
Efficient SQL on Hadoop Solutions
Explore the landscape of SQL on Hadoop engines for distributed computing, leveraging MPP frameworks to run SQL queries efficiently against data stored in HDFS. Discover the challenges of using Pig and Hive for SQL, and learn about popular SQL engines like Apache Drill, Cloudera Impala, and Presto. Find out how these engines enable in-memory query execution, machine learning integration, and seamless connectivity with visualization tools, opening up new possibilities for leveraging Hadoop in your analytics endeavors.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook
ALL OF THESE But HAWQ specifically Cause it has the most features
Problem! MapReduce is great, but all of my data dudes don t know Java Well, Pig and Hive exist... They are kind of SQL But Pig and Hive are slow and they aren t really SQL... How can I efficiently use all of my SQL scripts that I have today? Well, that's why all these companies are building SQL on Hadoop engines
SQL Engines for Hadoop Massive Parallel Processing (MPP) frameworks to run SQL queries against data stored in HDFS Not MapReduce, but still brings the code to the data SQL for big data sets, but not stupid huge ones Stupid huge ones should still use MapReduce
Current SQL on Hadoop Landscape Apache Drill (MapR) Apache HAWQ (Pivotal) Cloudera Impala Hive on Tez (Hortonworks) Presto (Facebook) Spark SQL (formerly Shark) Hive on Spark
Why? Ability to execute complex multi-staged queries in-memory against structured data Available SQL-based machine learning libraries can be ported to work on the system A well-known and common query language to express data crunching algorithms Not all queries need to run for hours on end and be super fault tolerant
Okay, tell me more... Many visualization and ETL tools speak SQL, and need to do some hacked version for HiveQL Can now connect these tools and legacy applications to big data stored in HDFS You can start leveraging Hadoop with what you know and begin to explore other Hadoop ecosystem projects Your Excuse Here
SQL on Hadoop Built for analytics! OLAP vs OLTP Large I/O queries against append-only tables Write-once, read-many much like MapReduce Intent is to retrieve results and run deep analytics in ~20 minutes Anything longer, you may want to consider using MapReduce
Architectures Architectures are all very similar Query Executor Query Executor Master HDFS Query Executor Query Planner Query Executor
HAWQ Master Located on a separate node from the NameNode Does not contain any user data Contains Global System Catalog Authenticates client connections, processes SQL, distributes work between segments, coordinates results returned by segments, presents final client results
HAWQ Transactions No global transaction management No updates or deletes.. Transactions at the HAWQ master level Single phase commit
HAWQ Segments A HAWQ segment within a Segment Host is an HDFS client that runs on a DataNode Multiple segments per Segment Host Segment is the basic unit of parallelism Multiple segments work together to form a parallel query processing system Operations execute in parallel across all segments
Segments Access Data Stored in HDFS Segments are stateless Segments communicate with NameNode to obtain block lists where data is located Segments access data stored in HDFS
HAWQ Parser Clients JDBC SQL Enforces syntax and semantics Converts SQL query into a parse tree data structure describing details of the query
Parallel Query Optimizer Cost-based optimization looks for the most efficient plan Physical plan contains scans, joins, sorts, aggregations, etc. Directly inserts motion nodes for inter- segment communication
Parallel Query Optimizer Continued Inserts motion nodes for efficient non-local join processing (Assume table A is distributed across all segments i.e. each has AK) Broadcast Motion (N:N) Every segment sends AK to all other segments Redistribute Motion (N:N) Every segment rehashes AK (by join column) and redistributes each row Gather Motion (N:1) Every segment sends its AK to a single node (usually the master)
Parallel Query Optimization Example SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment Gather Motion 4:1 (slice 3) Sort FROM customer, orders, lineitem, nation HashAggregate WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '1994-08-01' and o_orderdate < date '1994-08-01' + interval '3 month' and l_returnflag = 'R' and c_nationkey = n_nationkey HashJoin Redistribute Motion 4:4 (slice 1) Hash HashJoin HashJoin Seq Scan on lineitem Seq Scan on customer Hash Hash GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment Broadcast Motion 4:4 (slice 2) Seq Scan on orders ORDER BY revenue desc Seq Scan on nation
HAWQ Query Optimizer Gather Motion Sort HashAggregate HashJoin Redistribute Motion Hash HashJoin HashJoin Seq Scan on lineitem Seq Scan on customer Hash Hash Seq Scan on orders Broadcast Motion Seq Scan on nation
HAWQ Dispatcher and Query Executor MotionGather MotionGather Projects.beer, s.price Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) HashJoinb.name = s.bar MotionRedist(b.name) s ScanSells Filterb.city = 'San Francisco' s ScanSells Filterb.city = 'San Francisco' b ScanBars b ScanBars
HAWQ Dynamic Pipelining Parallel data flow using a UDP-based interconnect No materialization of intermediate results, unlike MapReduce DataNode 1 DataNode n Segment Host Segment Host Query Executor PXF Query Executor PXF Segment 1 Segment 1 Dynamic Pipelining Segment n Segment n Local Temp Storage Local Temp Storage HDFS HDFS
HAWQ Data Storage and I/O Overview DataNodes are responsible for serving read and write requests from HAWQ segments Data stored external to HAWQ can be read using Pivotal Xtension Framework (PXF) external tables Data stored in HAWQ can be written to HDFS for external consumption using PXF Writable HDFS Tables MapReduce can access data stored in HAWQ using provided Input/Output formats
Data Distributions Every table has a distribution method DISTRIBUTED BY (column) Uses a hash distribution DISTRIBUTED RANDOMLY Uses a random distribution which is not guaranteed to provide a perfectly even distribution
Multi-Level Partitioning Use Hash Distribution to evenly spread data across all nodes Segment 1D Segment 1A Segment 1B Segment 1C Use Range Partition within a node to minimize scan work Segment 2D Segment 2A Segment 2B Segment 2C Jan 2007 Feb 2007 Mar 2007 Apr 2007 May 2007 Jun 2007 Jul 2007 Aug 2007 Sep 2007 Oct 2007 Nov 2007 Dec 2007 Segment 3D Segment 3A Segment 3B Segment 3C
HAWQ Fault Tolerance Fault tolerance through HDFS replication Replication factor decided when creating a file space in HDFS When a segment fails the shard is accessible from another node
References Apache Drill Pivotal HAWQ Apache Impala Presto Hive on Tez Spark SQL