Advanced Data Processing Techniques in Pig and Hive

1 / 39

Embed Share

Learn about advanced data processing techniques in Apache Pig and Apache Hive, including the use of macros, user-defined functions, and writing your own functions in Java. Explore how to count the number of words in documents, package reusable code using macros, and leverage user-defined functions from Piggy Bank repository. Understand the different types of functions such as Eval, Filter, and Store/Load functions. Discover the process of writing and compiling your own functions as Java programs for integration with Pig.

kohl_nta Follow

Uploaded on Jul 07, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

CSE 491/891 Lecture 23 (Pig and Hive) #

Outline Last lecture (on Pig Latin) Relational operators FOREACH, GROUP, UNION, ORDER BY, DISTINCT, FILTER, SPLIT, JOIN Functions TOKENIZE, FLATTEN, etc How to count number of words in documents using Pig Today s lecture Macros User-defined functions Intro to Hive #

Macros Macros provide a way to package reusable code DEFINE <macro> (<args>) RETURNS <returnvalue> { Pig latin code for macro definition } Example: wordcount.pig DEFINE wordcount(text) RETURNS counts { tokens = foreach $text generate TOKENIZE($0) as terms; wordlist = foreach tokens generate FLATTEN(terms) as word, groups = group wordlist by word; $counts = foreach groups generate group as word, } 1 as freq; SUM(wordlist.freq) as freq; #

Macros You can either type in the macro definition in the grunt shell directly or load it from a script file Example: At runtime, Pig will expand the macro using the macro definition above before executing it #

User Defined Functions Before writing your own functions, you may want to look at some examples first Piggy Bank is a repository of Pig functions shared by the Pig community. Pig functions are written as Java classes and archived in jar files You can download the source files (piggybank.tar) from the class webpage #

User Defined Functions There are different types of functions: Eval function: a function that takes one or more expression and returns another expression (e.g., MAX) results = FOREACH grp GENERATE group, myEvalFn(data.val) Filter function: a function that returns a Boolean value (to be used with FILTER operation) results = FILTER records BY myFilterFn(val) Store/Load function: a function that specifies how to load or store data mydata = LOAD mydata.txt USING myLoadFn() #

Writing Your Own Function 1. Write your function as a Java program that extends the following Pig abstract classes Your Function Type Pig Abstract Class Eval function EvalFunc Filter function FilterFunc Load function LoadFunc 2. Compile your program: javac cp CLASSPATH <sourcecode> For AWS, the CLASSPATH is /usr/lib/pig/pig-0.16.0- amzn-0-core-h2.jar:/usr/lib/hadoop/hadoop-aws-2.7.3- amzn-1.jar (check first to make sure the file exists) #

Writing Your Own Function 3. Next, create a jar file to archive your program shell> jar cvf <myfunction.jar> <list of class files> 4. To use the function, you must register it first. grunt> REGISTER myfunction.jar 5. You can now start using the function in the grunt shell or in your Pig script file #

Example Consider the problem of finding frequent words in a corpus of documents by filtering infrequent words that appear fewer times than a threshold Pig Latin: data = LOAD datafile AS (text:chararray); counts = wordcount(data); -- this uses the wordcount -- macro was defined on slide #3 REGISTER PopularTerms.jar; results = FILTER counts BY PopularTerms(freq,2); -- assume threshold is equal to 2 dump results; User-defined filter function #

PopularTerms.java This program takes 2 input arguments Word frequency Minimum frequency threshold (user parameter) It returns the value true if the word frequency is greater than or equal to the threshold results = FILTER counts BY PopularTerms(freq,thres); #

PopularTerms.java (Overall Structure) import java.io.*; import java.util.*; import org.apache.pig.FilterFunc; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.Tuple; import org.apache.pig.impl.logicalLayer.FrontendException; public class PopularTerms extends FilterFunc { @Override public Boolean exec (Tuple tuple) throws IOException{ // This code will filter tuples with frequency below threshold // tuple contains the parameter values passed to the function } } #

PopularTerms.java public Boolean exec (Tuple tuple) throws IOException { if (tuple == null || tuple.size() == 0) { return false; } try { Object field1 = tuple.get(0); Object field2 = tuple.get(1); if (field1 == null || field2 == null) { return false; } int count = ((Long) field1).intValue(); int threshold = (Integer) field2; return (count >= threshold); } catch (ExecException e) { throw new IOException(e); } } // get frequency // get threshold Make sure you check the types of the fields first in grunt (using DESCRIBE) #

PopularTerms.java To compile and archive Script to run the job (filterWords.pig); you can also run it interactively by typing into grunt #

PopularTerms.java Submit the job #

Summary Pig is a high level programming framework built on top of Hadoop to simplify the task of writing programs to manipulate data (analogous to SQL) Limitation: syntax is still quite different from SQL It s a procedural language instead of a declarative language like SQL #

CSE 491/891 Intro to Hive #

Hive A data warehousing framework built on top of Hadoop Created to make it possible for analysts with strong SQL skills (but little Java programming skills) to run queries on large volumes of data Developed by Jeff Hammerbacher and his team at Facebook #

Main Components of Hive Hive Server A service that enables remote clients to execute Hive query and retrieve the results HiveQL An SQL-like query language for interacting with Hive Hive Compiler Convert Hive queries into map-reduce jobs Hive Client A command line interface (CLI) to interact with Hive Older Hive version: HiveCLI (now deprecated) Current Hive version: Beeline #

Hive Separation of data from the table schema Data is stored on HDFS (raw data files) Schema is stored in a metastore Unlike relational database (SQL), you can t simply update or delete an individual record in a table Hadoop is used for batch processing, it is not an online transaction database (where you can add/update/ remove any rows in real-time) #

Launching Beeline Client After creating an AWS EMR cluster, you may need to wait for > 15 min. to ensure hive is fully installed Otherwise, you may run into different types of errors when executing Hive query commands To launch the beeline hive client: ** This usually take about 10 minutes to get it up and running #

Useful Commands !connect: connecting to Hive server !help: get help !close: close the server connection !close all: close all server connections !quit: exit #

Launching Hive Client Use jdbc to connect to the Hive server You can use the default database Username and password can be blank (press Enter) Once connected, you can start entering SQL commands for creating, altering, querying, and dropping tables in the database #

Other Example Commands List all the beeline commands available To exit the client program #

Data Storage By default, Hive will store your data under the /user/hive/warehouse directory on HDFS You may request Hive to add compression and special directory structures (such as partitions) to the data to improve query performance However, you can also store your data in another directory on HDFS Hive will take the data as it is and will not try to optimize your data storage during query processing #

Internal and External Tables Internal tables: When a new table is created, the raw data will be moved from its original location to /user/hive/warehouse When table is dropped, both the metadata (its schema) and the actual raw data will both be removed External tables: When table is created, a directory with the same name as the table name is created at /user/hive/warehouse, but the data is kept at its original location When a table is dropped, its directory is removed from /user/hive/warehouse (along with meta data), but the original data (which is stored elsewhere) is not removed #

CREATE TABLE Internal table: CREATE TABLE <tablename> ( ) ...; LOAD DATA INPATH INTO TABLE <tablename>; External table: CREATE EXTERNAL TABLE <tablename> ( ) LOCATION ; -- specify the location of the original data #

Example: Wikipedia Revisions Input data (wiki_edit.txt) Columns are separated by white space Schema: (RevisionID, Article, EditDate, Username) #

CREATE TABLE Syntax looks very similar to SQL: Each command is terminated by semicolon The above create table statement is incomplete You can t do INSERT command like SQL You need to specify the original data file and how the fields within each record are delimited (separated) #

CREATE INTERNAL TABLE To load data from an existing file, you need to specify the delimiter character for the fields #

Example You can use hadoop fs to check the /user/hive/warehouse directory Note that the command will create a subdirectory named wiki under /user/hive/warehouse But there is no data file in the directory since we did not specify where the original data is located #

LOAD DATA INTO INTERNAL TABLE Input data is wiki_edit.txt (make sure it is uploaded to HDFS in a directory named wiki) filepath This will move the data files in /user/hadoop/wiki on HDFS to the /user/hive/warehouse/wiki directory OVERWRITE forces the content of the target table to be deleted and replaced with files referred to by the inpath Hive does not parse the file to an internal database format, i.e., files are stored in its original format on HDFS Conversion from raw bytes in a file to appropriate fields occur when a retrieval or update query is processed #

CREATE EXTERNAL TABLE Data will NOT be moved to /user/hive/warehouse; they will remain in the external directory /user/hadoop/wiki #

Data Types #

Data Types #

SHOW and DESCRIBE TABLES To display list of tables available To show the schema of a table #

SQL Queries #

Another Example: Bestbuy queries Upload data (bbuy.csv) to HDFS into a directory named bestbuy Schema: (user,sku,category,query,clicktime,querytime) #

Another Example: Bestbuy queries Create an external table: File location on HDFS If you drop the external table temp, the original data file still exist on /user/hadoop/bestbuy (but the table no longer exists) #

DROP and ALTER TABLE Use drop table to delete the data and metadata for a table Drop table <tablename> If you want to delete the data but keep the table definition, simply remove the files in HDFS (without removing the directory) Use Alter table to modify the table s schema ALTER TABLE abc RENAME TO xyz ALTER TABLE abc ADD COLUMNS (newcol STRING) #

Advanced Data Processing Techniques in Pig and Hive

Download Presentation

Presentation Transcript

Related

More Related Content