Tuesday, October 17, 2017

Different ways of executing the Big Data processing jobs in EMR

There are different ways of kick starting a Hive/Pig/MR/Spark on Amazon EMR. We already looked at how to submit a Hive job or a step from the AWS EMR management console here. This approach is cool, but doesn't have much scope for automation.

Here are the other ways to start the Big Data Processing with some level of automation.

1) Use Apache Oozie to create a workflow and a coordinator.
2) Use the AWS CLI
3) Login to the master instance and use the Hive shell

In the above, Option 1 is a bit complicated and will be explored in another blog. Here we will be looking at the other two options.

Option 2 : Using the AWS CLI

Step 1 : Create the airline.sql with the below content. The below will create a table in Hive and map it to the data in S3. To get the data into S3 follow this article. Then a query will be run on the table.
create external table ontime_parquet_snappy (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin; 

Step 2 : Put the above file into the master node using the below command.
aws emr put --cluster-id j-PQSG2Q9DS9HV --key-pair-file "/home/praveen/Documents/AWS-Keys/MyKeyPair.pem" --src "/home/praveen/Desktop/airline.sql"
Don't forget to replace the cluster-id, the path of the key-pair and the sql file in the above command.

Step 3 : Kick start the Hive program using the below command.
aws emr ssh --cluster-id j-PQSG2Q9DS9HV --key-pair-file "/home/praveen/Documents/AWS-Keys/MyKeyPair.pem" --command "hive -f airline.sql"
Replace the cluster-id and the key-pair path in the above command.

Step 4 : The last and the final step is to monitor the progress of the Hive job and verify the output in the S3 management console.

Option 3 : Login to the master instance and use the Hive shell

Step 1 : Delete the output of the Hive query which has been created in the above Option.

Step 2 : Follow the steps mentioned here to ssh into the master.

Step 3 : Start the Hive shell using the 'hive' command are create a table in Hive as shown below.

Step 4 : Check if the table has been created or not as shown below using the show and the describe SQL commands.

Step 5 : Execute the Hive query in the shell and wait for it to complete.

Step 6 : Verify the output of the Hive job in S3 management console.

Step 7 : Forward the local port to the remote port as mentioned here and access the YARN console, to see the status of the Hive job.

This completes the steps for submitting a Hive job in different ways. The same steps can be repeated with minimum changes for Pig, Sqoop and other Big Data softwares also.

No comments:

Post a Comment