Analyzing Massive Datasets from Allen Telescope Array with IBM Cloud

Slide Note

Proposal for Stanford CS341 students to utilize IBM Cloud services for analyzing massive datasets from the Allen Telescope Array at Hat Creek Radio Observatory. IBM Spark@SETI initiative offers access to Apache Spark platform, ATA data archives, and support for experimental approaches in the analysis of radio signal data. Benefits include simple access, no charge for IBM Spark service, and unlimited access to ATA data archives on IBM Cloud.

hargus_w Follow

Uploaded on Apr 13, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Capybara Hive Integration Testing

Issues Weve Seen at Hortonworks Many tests for different permutations e.g. does it work with Orc, with Parquet, with Text Can t run Hive tests on a cluster Forces QE to rewrite tests from scratch, hard to share resources with dev Tests are all small, no ability to scale Golden files are a grievous evil Test writers have to eye-ball results, error prone Small change in query plan forces hundreds of expected output changes QE and dev working in different languages and frameworks It s hard to get user queries with user-like data into the framework Tests built based on feature testing and bug fixing, not user experience

Proposed Requirements One test should run in all reasonable permutations Spark/Tez, Orc/Parquet/Text, secure/non-secure, etc. Tests can specify which options make no sense for them Same tests locally and on cluster Auto-generation of data and expected results At varying scales Expected results generated by source of truth, won t work for all but should cover 80% Programmatic access to query plan Add tools to make it easy to find tasks, operators, and patterns Java, runs in Junit Ability to simulate user data and run user queries

Whats There Today Automated data generation (random, stats based, dev specified) Data loaded into Hive and benchmark State remembered so that tables not created for every test Queries run against Hive and benchmark Comparison of select queries and insert statements Works on dev s machine or against a cluster Dev s machine: miniclusters and Derby Cluster: user provided cluster and Postgres A few basic tables provided for tests alltypes, capysrc, capysrcpart, TPC-H like tables UserQueryGenerator Takes in set of user queries Reads user s metastore (user has to first run analyze table on included tables) Generates Java test file that builds simulated data

Whats There Today Continued SQL Ansifier takes Hive query and converts to ANSI SQL to run against benchmark (incomplete) A given run of tests can be configured with a set of features e.g. file format=orc, engine=tez Annotations ignore a test when inappropriate with configured features (e.g. no acid when spark is the engine) set configuration for features (e.g. @AcidOn) Scale can be set User can provide custom benchmark and comparator Programmatic access to query plan very limited tools today, need more work here Initial patch posted to HIVE-12316

Missing Pieces Limited working options Need to add HBase metastore, LLAP, Spark, security, Hive Streaming, ... Tez there but SUPER slow JDBC in process binary data, complex types don t work parallel data generation and comparison written but not yet tested Not yet a way to set or switch users (for security tests) Limited usage testing Many options haven t been tried and I m sure some don t work Limited qfiles converted

Example Test @Test public void simple() throws Exception { TableTool.createAllTypes(); runQuery("select cvarchar from alltypes"); sortAndCompare(); }

Example Test @Test public void simpleJoin() throws Exception { TableTool.createPseudoTpch(); runQuery("select p_name, avg(l_price) " + "from ph_lineitem join ph_part " + "on (l_partkey = p_partkey) " + "group by p_name " + "order by p_name"); compare(); }

Example Test @Test public void q1() throws Exception { set("hive.auto.convert.join", true); runQuery("drop table if exists t"); runQuery("create table t (a string, b bigint); "); runQuery("insert into t select c, d from u;"); IMetaStoreClient msClient = new HiveMetaStoreClient(new HiveConf()); Table msTable = msClient.getTable("default", "t"); TestTable tTable = new TestTable(msTable); tableCompare(tTable); }

Example Explain @Test public void explain() throws Exception { TableTool.createCapySrc(); Explain explain = explain("select k,value from capysrc order by k"); // Expect that somewhere in the plan is a MapRedTask. MapRedTask mrTask = explain.expect(MapRedTask.class); // Find all scans in the MapRedTask. List<TableScanOperator> scans = explain.findAll(mrTask, TableScanOperator.class); Assert.assertEquals(1, scans.size()); }

Run a Test Locally, use default options mvn test -Dtest=TestSkewJoin Locally, specify using tez mvn test -Dtest=TestSkewJoin -Dhive.test.capybara.engine=tez On a cluster mvn test -Dtest=TestSkewJoin -Dhive.test.capybara.use.cluster=true -DHADOOP_HOME=your_hadoop_path -DHIVE_HOME=your_hive_path

Simulate User Queries Select queries create, one file for each test (may contain more than 1 query) Run analyze table with collect column stats for each table with source data Then run, outputs TestQueries.java hive --service capygen -i queries/*.sql -o TestQueries

Questions

Analyzing Massive Datasets from Allen Telescope Array with IBM Cloud

Download Presentation

Presentation Transcript

Related

More Related Content