Thursday, March 31, 2016

Twitter analysis with Pig and elephant-bird

Twitter analysis has been one of the popular blog on this site. Flume has been used to gather the data and then Hive has been used to do some basic analytics.  Performing the same with Pig had been pending for quite some time, so here it is.

The JSONLoader which comes with Pig can be used with Pig to load the JSON data from Flume. But, the problem with the JSONLoader is that that the entire scheme has to be specified as shown below. In the case of the Twitter data, the scheme becomes really huge and complex.

students = LOAD 'students.json'  USING JsonLoader('name:chararray, school:chararray, age:int');

So, I started using elephant-bird for processing the JSON date. With the JsonLoader from elephant-bird, there is no need to specify the schema. The JsonLoader simply returns a Pig map datatype and fields can be accessed using the JSON property name as shown below.
REGISTER '/home/bigdata/Installations/pig-0.15.0/lib/elephantbird/json-simple-1.1.1.jar'
REGISTER '/home/bigdata/Installations/pig-0.15.0/lib/elephantbird/elephant-bird-pig-4.3.jar'
REGISTER '/home/bigdata/Installations/pig-0.15.0/lib/elephantbird/elephant-bird-hadoop-compat-4.3.jar'

tweets = LOAD '/user/bigdata/tweetsJSON/' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
user_details = FOREACH tweets GENERATE json#'user' As tweetUser;
user_followers = FOREACH user_details GENERATE (chararray)tweetUser#'screen_name' As screenName, (int)tweetUser#'followers_count' As followersCount;
user_followers_distinct = DISTINCT user_followers;
user_followers_sorted = order user_followers_distinct by followersCount desc;

DUMP user_followers_sorted;
The above program got converted into a DAG of 3 MapReduce programs and took 1 min 6 sec to complete, which is not really that efficient. It should be possible to implement the same using two MapReduce programs. I am not sure if there is any way to optimize the above Pig script. Any feedback, please let me know in the comments and I will try it out and update the blog.

The same data was proceed using Hive with the JSONSerDe provided by Cloudera as mentioned in the original blog. only a single MapReduce program was generated and it took 21 seconds to process the data, a drastic improvement over Pig using the elephant-bird library.

In the coming blogs. we will look a few other ways of processing the JSON data which is a very common format.

1 comment:

  1. Hi, thanks for the great post, really helpful. I'm having a hard time installing Elephant Bird on my instance. Do you have any instructions for that?