
Processing Data with Apache Hive - Steps to Load and Query
"Learn how to load and query data using Apache Hive. Follow steps to download data, load into HDFS, and execute queries to create tables with ease."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
HIVE CSCE 587 Spring 2018
Step 1: Data Start by downloading the data https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/how-to- process-data-with-apache-hive/assets/driver_data.zip There are two files in this data set that we saw last week with PIG: drivers.csv timesheet.csv
Step 2: Load the files into HDFS Unlike last week, we will use the GUI to load. Start by logging onto ambaria: vm-Hadoop-xx.cse.sc.edu:8080 use your maria_dev credentials
Step 2: Load the files into HDFS 1. Click on the icon that resembles 3x3 grid on the menu bar (it is at the top of the window on the far right side) 2. Select Files View
Step 2: Load the files into HDFS Navigate to /user/maria_dev 1. scroll to bottom of list to find user. 2. click on user . 3. scroll down list to find maria_dev 4. click on maria_dev
Navigating to /user/maria_dev You should see something like this, although you should also see the other files that you created last week with Hadoop and PIG.
1. Click on the Upload button 2. Then select the Browse button and navigate to where you stored the files on the linux file system. a. Select drivers.csv to upload b. Do the same for timesheet.csv
Had you started with a tabula rasa, your directory would look like this:
3. HIVE View 2.0 Switch context from Files view to Hive View 2.0:
Step 3.1 This brings up the query editor
Step 3.2: Create an empty table Enter: create table create table temp_drivers Then click on Execute temp_drivers ( (col_value col_value STRING); STRING);
Result Result
Step 3.3 Enter: LOAD DATA INPATH '/user/ LOAD DATA INPATH '/user/maria_dev Then click on Execute maria_dev/drivers.csv' OVERWRITE INTO TABLE /drivers.csv' OVERWRITE INTO TABLE temp_drivers temp_drivers; ;
Go back to the Files view What has changed? The file drivers.csv is no longer there Loading the file into Hive has consumed drivers.csv
Take a peek at the table temp_drivers Enter: select * from select * from temp_drivers temp_drivers limit 10; then click on Execute limit 10;
Step 3.4: extract the fields we want enter: CREATE TABLE drivers ( CREATE TABLE drivers (driverId driverId INT, name STRING, INT, name STRING, ssn the click on Execute ssn BIGINT, location STRING, certified STRING, BIGINT, location STRING, certified STRING, wageplan wageplan STRING); STRING);
STEP 3.5: STEP 3.5: QUERY TO EXTRACT DATA FROM QUERY TO EXTRACT DATA FROM TEMP_DRIVERS TEMP_DRIVERS INSERT OVERWRITE TABLE DRIVERS INSERT OVERWRITE TABLE DRIVERS SELECT SELECT REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE COL_VALUE, '^(?:([^,]*),?){1}', 1) , '^(?:([^,]*),?){1}', 1) DRIVERID COL_VALUE, '^(?:([^,]*),?){2}', 1) NAME, , '^(?:([^,]*),?){2}', 1) NAME, COL_VALUE, '^(?:([^,]*),?){3}', 1) , '^(?:([^,]*),?){3}', 1) SSN COL_VALUE, '^(?:([^,]*),?){4}', 1) LOCATION, , '^(?:([^,]*),?){4}', 1) LOCATION, COL_VALUE, '^(?:([^,]*),?){5}', 1) CERTIFIED, , '^(?:([^,]*),?){5}', 1) CERTIFIED, COL_VALUE, '^(?:([^,]*),?){6}', 1) , '^(?:([^,]*),?){6}', 1) WAGEPLAN DRIVERID, , SSN, , WAGEPLAN FROM FROM TEMP_DRIVERS TEMP_DRIVERS; ;
STEP 3.5: STEP 3.5: CREATE A QUERY TO CREATE A QUERY TO EXTRACT DATA FROM INSERT OVERWRITE TABLE DRIVERS INSERT OVERWRITE TABLE DRIVERS SELECT SELECT REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE COL_VALUE, '^(?:([^,]*),?){1}', 1) , '^(?:([^,]*),?){1}', 1) DRIVERID DRIVERID, , REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE COL_VALUE, '^(?:([^,]*),?){2}', 1) NAME, , '^(?:([^,]*),?){2}', 1) NAME, REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE COL_VALUE, '^(?:([^,]*),?){3}', 1) , '^(?:([^,]*),?){3}', 1) SSN SSN, , REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE COL_VALUE, '^(?:([^,]*),?){4}', 1) LOCATION, , '^(?:([^,]*),?){4}', 1) LOCATION, REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE COL_VALUE, '^(?:([^,]*),?){5}', 1) CERTIFIED, , '^(?:([^,]*),?){5}', 1) CERTIFIED, REGEXP_EXTRACT REGEXP_EXTRACT( (COL_VALUE COL_VALUE, '^(?:([^,]*),?){6}', 1) , '^(?:([^,]*),?){6}', 1) WAGEPLAN WAGEPLAN FROM FROM TEMP_DRIVERS TEMP_DRIVERS; ; EXTRACT DATA FROM TEMP_DRIVERS TEMP_DRIVERS
Take a peek at the first 10 rows of the resulting table Enter: select * from drivers limit 10; select * from drivers limit 10; Then click on Execute
CREATE TABLE temp_timesheet (col_value string); Processing timesheet.csv creating similar tables from timesheet.csv Start by creating temp_timesheet with the following command: CREATE TABLE temp_timesheet (col_value string); Then populate it with data from timesheet.csv LOAD DATA INPATH '/user/maria_dev/timesheet.csv' OVERWRITE INTO TABLE temp_timesheet; Finally, look at the first 10 lines of the table as a sanity check. Select * from temp_timesheet limit 10;
3.6 CREATE TABLE CREATE TABLE temp_timesheet temp_timesheet ( (col_value LOAD DATA INPATH '/user/ LOAD DATA INPATH '/user/maria_dev col_value string); maria_dev/timesheet.csv' OVERWRITE INTO TABLE /timesheet.csv' OVERWRITE INTO TABLE temp_timesheet string); temp_timesheet; ;
Creating timesheet from temp_timesheet Start by creating an empty table timesheet Enter: CREATE TABLE timesheet (driverId INT, week INT, hours_logged INT , miles_logged INT); Then click on Execute Next populate the table by extracting columns fromm temp_timesheet
Extracting columns from temp_timesheet Enter: insert overwrite table timesheet SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) week, regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) hours_logged, regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) miles_logged from temp_timesheet; Then click on Execute
Take a peek at the first 10 rows enter: select * from timesheet limit 10; click on Execute
Now group timesheet data by driverID so that we can sum the hours logged and sum the miles logged Enter: SELECT driverId, sum(hours_logged), sum(miles_logged) FROM timesheet GROUP BY driverId; Then click on Execute
Results after grouping by driverID and summing logged hours and logged miles
Combine columns from drivers and timesheet tables Columns from drivers table: driverId name Columns from timesheets table: total_hours total_miles Join column: driverId
Combine columns from drivers and timesheet tables SELECT d.driverId, d.name, t.total_hours, t.total_miles from drivers d JOIN (SELECT driverId, sum(hours_logged)total_hours, sum(miles_logged)total_miles FROM timesheet GROUP BY driverId ) t ON (d.driverId = t.driverId);
Results after entering command and executing