Tuesday, October 29, 2013

Executing an Oozie workflow with Pig, Hive & Sqoop actions

In the earlier blog entries, we have looked into how install Oozie here and how to do the Click Stream analysis using Hive and Pig here. This blog is about executing a simple work flow which imports the User data from MySQL database using Sqoop, pre-processes the Click Stream data using Pig and finally doing some basic analytics on the User and the Click Stream using Hive.

Before running a Hive query, the table/column/column-types have to be defined. Because of this, the data for Hive needs to have some structure. Pig is better for processing semi structured data when compared to Hive. Here is Pig vs Hive at a very high level. Because of the above mentioned reason, one of the use case is that Pig being used for pre-processing (filter out the invalid records, massage the data etc) of the data to make it ready for Hive to consume.

The below DAG was generated by Oozie. The fork will spawn a Pig action (which cleans the Click Stream data) and a Sqoop action (which imports the user data from a MySQL database) in parallel. Once the Pig and the Sqoop actions are done, the Hive action will be started to do the final analytics combining the Click Stream and the User data.
Here are the steps to define the work flow and then execute it. This is with the assumption that  MySQL, Oozie and Hadoop have been installed, configured and work properly. Here are the instructions for installing and configuring Oozie.

- The work flow requires more than 2 map slots in the cluster, so if the work flow is executed on a single node cluster the following has to be included in the `mapred-site.xml`.
<property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>4</value>
</property>
<property>
   <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>4</value>
</property>
- Create the file `oozie-clickstream-examples/input-data/clickstream/clickstream.txt` in HDFS with the below content. Note than the last record is an invalid record which is filtered by Pig when the work flow is executed. The first field is the userId and the second field is the site visited by the user.
1,www.bbc.com
1,www.abc.com
1,www.gmail.com
2,www.cnn.com
2,www.eenadu.net
2,www.stackoverflow.com
2,www.businessweek.com
3,www.eenadu.net
3,www.stackoverflow.com
3,www.businessweek.com
A,www.thecloudavenue.com
- Create a user table in MySQL
CREATE TABLE user (
    user_id INTEGER NOT NULL PRIMARY KEY,
    name CHAR(32) NOT NULL,
    age INTEGER,
    country VARCHAR(32),
    gender CHAR(1)
);
- And insert some data into it
insert into user values (1,"Tom",20,"India","M");
insert into user values (2,"Rick",5,"India","M");
insert into user values (3,"Rachel",15,"India","F");
- Extract the `oozie-4.0.0/oozie-sharelib-4.0.0.tar.gz` file from the Oozie installation folder and copy the mysql-connector-java-*.jar to the `share/lib/sqoop` folder. This jar is required for Sqoop to connect to the MySQL database and get the user data.

- Copy the above mentioned `share` folder into HDFS. Here is the significance of sharelib in Oozie. These are the common libraries which are used across different actions in Oozie.
bin/hadoop fs -put /home/vm4learning/Code/share/ /user/vm4learning/share/

- Create the work flow file in HDFS (oozie-clickstream-examples/apps/cs/workflow.xml). Note that the connect string for the Oozie has to be modified appropriately.
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="cs-wf-fork-join">
    <start to="fork-node"/>

    <fork name="fork-node">
        <path start="sqoop-node" />
        <path start="pig-node" />
    </fork>

    <action name="sqoop-node">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${nameNode}/${examplesRootDir}/input-data/user"/>
            </prepare>

            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <command>import --connect jdbc:mysql://localhost/clickstream --table user --target-dir ${examplesRootDir}/input-data/user -m 1</command>
        </sqoop>
        <ok to="joining"/>
        <error to="fail"/>
    </action>

    <action name="pig-node">
        <pig>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${nameNode}${examplesRootDir}/intermediate"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <script>filter.pig</script>
            <param>INPUT=${examplesRootDir}/input-data/clickstream</param>
            <param>OUTPUT=${examplesRootDir}/intermediate</param>
        </pig>
        <ok to="joining"/>
        <error to="fail"/>
    </action>

    <join name="joining" to="hive-node"/>

    <action name="hive-node">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${nameNode}/${examplesRootDir}/finaloutput"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <script>script.sql</script>
            <param>CLICKSTREAM=${examplesRootDir}/intermediate/</param>
            <param>USER=${examplesRootDir}/input-data/user/</param>
            <param>OUTPUT=${examplesRootDir}/finaloutput</param>
        </hive>
        <ok to="end"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>
- Create the job.properties file in HDFS (oozie-clickstream-examples/apps/cs/job.properties).
nameNode=hdfs://localhost:9000
jobTracker=localhost:9001
queueName=default

examplesRoot=oozie-clickstream-examples
examplesRootDir=/user/${user.name}/${examplesRoot}

oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/cs
- Create the Hive script file in HDFS (oozie-clickstream-examples/apps/cs/script.sql). The below mentioned query will find the top 3 url's visited by users whose age is less than 16.
DROP TABLE clickstream;
CREATE EXTERNAL TABLE clickstream (userid INT, url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '${CLICKSTREAM}';

DROP TABLE user;
CREATE EXTERNAL TABLE user (user_id INT, name STRING, age INT, country STRING, gender STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '${USER}';

INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT url,count(url) c FROM user u JOIN clickstream c ON (u.user_id=c.userid) where u.age<16 group by url order by c DESC LIMIT 3;
- Create the Pig script file in HDFS (oozie-clickstream-examples/apps/cs/filter.pig).
clickstream = load '$INPUT' using PigStorage(',') as (userid:int, url:chararray);
SPLIT clickstream INTO good_records IF userid is not null,  bad_records IF userid is null;
STORE good_records into '$OUTPUT';
- Execute the Oozie workflow as below. Note that the `job.properties` file should be present in the local file system and not in HDFS.
bin/oozie job -oozie http://localhost:11000/oozie -config /home/vm4learning/Code/oozie-clickstream-examples/apps/cs/job.properties -run
- Initially the job will be in the `RUNNING` state and finally will reach the `SUCCEEDED` state. The progress of the work flow can be monitored from Oozie console at http://localhost:11000/oozie/.

- The output should appear as below in the `oozie-clickstream-examples/finaloutput/000000_0` file in HDFS.
www.businessweek.com 2
www.eenadu.net 2
www.stackoverflow.com 2
Here are some final thoughts on Oozie.

- It's better to test the individual actions like Hive, Pig and Sqoop independent of Ooize and later integrate them in the Oozie work flow.

- The Oozie error messages very cryptic and the MapReduce log files need to be looked to figure out the actual error.

- The Web UI which comes with Oozie is very rudimentary and clumsy, need to look into some of the alternatives.

- The XML for creating the work flows is very verbose and is very error prone. Any UI for creating workflows for Oozie would be very helpful.

Will look into the alternatives for some of the above problems mentioned in a future blog entry.

Tuesday, October 22, 2013

Integrating R and Hadoop using rmr2/rhdfs packages from Revolution Analytics

As mentioned in the earlier post Hadoop MR programs can be developed in non Java languages using the Hadoop streaming.

R is a language which has lot of libraries for statistics, so data processing involving a lot of statistics can be easily done in R. R can be integrated with Hadoop by writing R programs which read from the STDIO, do some processing and then writing to the results to the STDIO. An easier approach is to use some of the libraries like RHIPE and rmr2/rhdfs from Revolution Analytics.

With the assumption that R has been installed and integrated with Hadoop as mentioned in the previous blog, this blog will look into how to integrate R with Hadoop using 3rd party packages like rmr2/rhdfs from Revolution Analytics. As mentioned here it is much more easier to write R programs using rmr2/rhdfs than without.

Here are the instructions on how to install rmr2/rhdfs and then run write/run MR programs using those packages.

- Install the dependencies for rmr2/rhdfs. Note that the dependencies have to be installed in the system folder and not in a custom folder. This can be done by running R as sudo and then installing the below mentioned packages. Check this query on StackOverflow on why.
install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava"))
- Download the latest rmr2*tar.gz and rhdfs*tar.gz from here.

- Install rmr2.
sudo R CMD INSTALL rmr rmr2_2.3.0.tar.gz
- Install rhdfs.
sudo JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45 R CMD javareconf
sudo HADOOP_CMD=/home/vm4learning/Installations/hadoop-1.2.1/bin/hadoop R CMD INSTALL rhdfs rhbase_1.2.0.tar.gz
- Execute the below commands from the R shell to kick off the MR 
Sys.setenv(HADOOP_HOME="/home/vm4learning/Installations/hadoop-1.2.1")
Sys.setenv(HADOOP_CMD="/home/vm4learning/Installations/hadoop-1.2.1/bin/hadoop")

library(rmr2)
library(rhdfs)

ints = to.dfs(1:100)
calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))

from.dfs(calc)
- Verify the output of the MR job to make sure that Hadoop and rmr2/rhdfs have been integrated properly.
$val
         v    
  [1,]   1   2
  [2,]   2   4
  ............
  ............

In the upcoming blogs, we will see how to write complex MR programs using rmr2/rhdfs.

Wednesday, October 16, 2013

Finding interested Hadoop users in the vicinity using twitteR

In an earlier blog article, we looked at how to get the Tweets using Flume, put them into HDFS and then analyze it using Hive. Now, we will try to get some more interesting data from Twitter using R.

It's nice to know others with the same interests as we have within our vicinity. One way is to find out who Tweeted about a certain topic of interest in the vicinity. Twitter provides location feature which attaches the location information as meta along with the Tweets. This feature is off by default in the Twitter settings and here is how to opt in.

R provides a nice library (twitteR) to extract the Tweets and one of the option is to specify the geo location. Here is the API documentation for the twitteR package. With the assumption that R has been installed as mentioned in the earlier blog, now is the time to install the twitteR package and get get Tweets.

Here are the steps.

- First `libcurl4-openssl-dev` package has to be installed.
sudo apt-get install libcurl4-openssl-dev
- Create an account with twitter.com and create a new application here and get the `Consumer Key` and the `Consumer Secret`.
- Install the twitteR package from the R shell along with the dependencies.
install.packages("twitteR", dependencies=TRUE)
- Load the twitteR library in the R shell and call the getTwitterOAuth function with the `Consumer Key` and the `Consumer Secret` got from the earlier step.
library("twitteR")
getTwitterOAuth("Consumer Key", "Consumer Secret")

- The previous step will provide a URL, copy the URL to the browser and a PIN will be provided.
 
- Copy the PIN and paste it back in the console and now we are all set to get Tweets using the twitteR package.

- Goto the Google Maps and find your current location. In the URL, the latitude/longitude will be there.
https://www.google.com/maps/preview#!q=60532&data=!4m15!2m14!1m13!1s0x880e5123b7fe36d9%3A0xbedcc6bcaa223107!3m8!1m3!1d22176155!2d-95.677068!3d37.0625!3m2!1i1317!2i653!4f13.1!4m2!3d41.794402!4d-88.0803051

- Call the searchTwitter API with the above latitude/longitude and topic of interest to get the Tweets to the console.
searchTwitter('hadoop', geocode='41.794402,-88.0803051,5mi',  n=5000, retryOnRateLimit=1)
If you are providing some sort of service, you can now easily find others who are interested in your service in your vicinity.

1) Here is an interesting article on doing Facebook analysis using R.
 

Monday, October 14, 2013

MapReduce programming in R using Hadoop Streaming

Hadoop supports non Java languages for writing MapReduce programs with the streaming feature. Any language which runs on Linux and can read/write from the STDIO can be used to write MR programs. With Microsoft HDInsight, it should be possible to even use .NET languages against Hadoop.

So, often I get the query `what is the best language for writing programs in MR model?`. It all depends on the requirements. If the requirements are about statistics then R is one of the choice, if it is NLP then Python is one of the choice. Based on the library support for the requirements, the appropriate language has to be choose.

In this blog we will install and configure R on Ubuntu to execute a simple MR program (word count) in R. This is with the assumption that Hadoop is already installed and works properly.

BTW, R can be integrated with Hadoop in multiple ways. One way is to use the R without any additional packages. In this case, R would be reading from STDIO, do some processing and then write to the STDIO. The second way is to use some packages like RHIPE, rmr2 and others. This blog is about integrating R with Hadoop without using any 3rd party libraries (the first approach).

Just as a note, I had been using Ubuntu 12.04, because it's latest LTS release from Canonical as of this writing and is supported till April, 2017. Canonical follows a 6 months release and every fourth release is LTS which are supported for 5 years, while the rest of the releases supported for 9 months. So, the next LTS release will be in April, 2014. This is one of the reason why I am stuck (and also feel good) with Ubuntu 12.04 for now.

The main advantage of sticking with an old release of Ubuntu (which is still supported) is that it becomes more and more stable over time and the main disadvantage is that the packages are outdated. Canonical provides major upgrades to different softwares only across different Ubuntu releases. So, Ubuntu 12.04 has an old version of R (2.14.1) which doesn't work with some of the Hadoop packages from Revolution Analytics and others, so the first step is to upgrade R and not use the one which comes with Ubuntu.

So, here are the steps (Note that the paths should be updated accordingly and also some of the commands would require administrative privileges)

- Pick a closest CRAN mirror from here and add the below lines to the `/etc/apt/sources.list` file to add a new repository. The same thing can be also done using the Synaptic Package Manager, the instructions for the same are here.
deb http://cran.ma.imperial.ac.uk/bin/linux/ubuntu precise/

deb-src http://cran.ma.imperial.ac.uk/bin/linux/ubuntu precise/
- Update the package index using the below command
sudo apt-get update
- Now is the time to install R using the apt-get as below. This command will install R with all it's dependencies.
sudo apt-get install r-base
- Verify that R is installed properly by running a simple program in the R shell.
> ints = 1:100
> doubleInts = sapply(ints, function(x) 2*x)
> head(doubleInts)

[1]  2  4  6  8 10 12

- Now create a mapper.R file using the below code
#! /usr/bin/env Rscript
# mapper.R - Wordcount program in R
# script for Mapper (R-Hadoop integration)

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))

con <- file("stdin", open = "r")

while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)

    for (w in words)
        cat(w, "\t1\n", sep="")
}
close(con)

- Now create a reducer.R file using the below code
#! /usr/bin/env Rscript
# reducer.R - Wordcount program in R
# script for Reducer (R-Hadoop integration)

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

splitLine <- function(line) {
    val <- unlist(strsplit(line, "\t"))
    list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")

while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    split <- splitLine(line)
    word <- split$word
    count <- split$count

    if (exists(word, envir = env, inherits = FALSE)) {
        oldcount <- get(word, envir = env)
        assign(word, oldcount + count, envir = env)
    }
    else assign(word, count, envir = env)
}
close(con)

for (w in ls(env, all = TRUE))
    cat(w, "\t", get(w, envir = env), "\n", sep = "")

- Streaming code works independent of Hadoop, because the scripts read/write from STDIO and have no Hadoop bindings. So, the scripts can be tested independent of Hadoop as below.
vm4learning@vm4learning:~/Code/Streaming$ echo "foo foo quux labs foo bar quux" | Rscript mapper.R  | sort -k1,1 | Rscript reducer.R
bar    1
foo    3
labs    1
quux    2
- Create a simple input file and then upload the data into HDFS before running the MR job.
bin/hadoop fs -put /home/vm4learning/Desktop/input.txt input3/input.txt

- Run the MR job as below
bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -file  /home/vm4learning/Code/Streaming/mapper.R  -mapper /home/vm4learning/Code/Streaming/mapper.R -file /home/vm4learning/Code/Streaming/reducer.R  -reducer /home/vm4learning/Code/Streaming/reducer.R -input input3/ -output output3/

 - The output of the MR job written should be there in the output3/ folder in HDFS.

In the upcoming blogs, we will see how to use 3rd party libraries like the ones from Revolution Analytics to write much more simpler MR programs in R.

1) Here is a blog from HortonWorks on the same.

Monday, October 7, 2013

Installation and configuration of Apache Oozie

Many a times there will be a requirement of running a group of dependent data processing jobs. Also, we might want to run some of them at regular intervals of time. This is where Apache Oozie fits the picture. Here are some nice articles (1, 2, 3, 4) on how to use Oozie.

Apache Oozie has three components which are a work flow engine to run a DAG of actions, a coordinator (similar to a cron job or a scheduler) and a bundle to batch a group of coordinators. Azkaban from LinkedIn is similar to Oozie, here are the articles (1, 2) comparing both of them.

Installing and configuring Oozie is not straight forward, not only because of the documentation, but also because the release includes only the source code and not the binaries. The code has to be got, the dependencies installed and then the binaries built. It's a bit tedious process, so this blog with an assumption that Hadoop has been already installed and configured. Here is the official documentation on how to build and install Oozie.

So, here are the steps to install and configure

- Make sure the requirements (Unix box (tested on Mac OS X and Linux), Java JDK 1.6+, Maven 3.0.1+, Hadoop 0.20.2+, Pig 0.7+) to build are met.

- Download a release containing the code from Apache Oozie site and extract the source code.
- Execute the below command to start the build. During the build process, the jars have to be downloaded, so it might take some time based on the network bandwidth. Make sure that there are no errors in the build process.
bin/mkdistro.sh -DskipTests
- Once the build is complete the binary file oozie-4.0.0.tar.gz should be present in the folder where Oozie code was extracted. Extract the tar.gz file, this will create a folder called oozie-4.0.0.

- Create a libext/ folder and copy the commons-configuration-*.jar, ext-2.2.zip,  hadoop-client-*.jar and  hadoop-core-*.jar files. The hadoop jars need to be copied from the Hadoop installation folder.

When Oozie is started, the below exception is seen in the catalina.out log file. This is the reason for including the commons-configuration-*.jar file in libext/ folder.
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:37)
        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:34)
        at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:217) 
- Prepare a war file using the below command. oozie.war file should be there in the oozie-4.0.0/oozie-server/webapps folder.
bin/oozie-setup.sh prepare-war
- Create Oozie related schema using the below command
bin/ooziedb.sh create -sqlfile oozie.sql -run

- Now is the time to start the Oozie Service which runs in Tomcat.
bin/oozied.sh start

- Check the Oozie log file logs/oozie.log to ensure Oozie started properly. And, run the below command to check the status of Oozie or instead go to the Oozie console at http://localhost:11000/oozie
bin/oozie admin -oozie http://localhost:11000/oozie -status

- Now, the Oozie client has to be installed by extracting the oozie-client-4.0.0.tar.gz. This will create a folder called oozie-client-4.0.0.

With the Oozie service running and the Oozie client installed, now is the time to run some simple work flows in Oozie to make sure Oozie works fine. Oozie comes with a bunch of examples in the oozie-examples.tar.gz. Here are the steps for the same.

- Extract the oozie-examples.tar.gz and change the port number on which the NameNode listens (Oozie default is 8020 and Hadoop default is 9000) in all the job.properties files. Similarly, for the JobTracker also the port number has to be modified (Oozie default is 8021 and Hadoop default is 9001).

- In the Hadoop installation, add the below to the conf/core-site.xml file. Check the Oozie documentation for more information on what these parameters mean
     <property>
          <name>hadoop.proxyuser.training.hosts</name>
          <value>localhost</value>
     </property>
     <property>
          <name>hadoop.proxyuser.training.groups</name>es
          <value>training</value>
     </property>
- Make sure that HDFS and MR are started and running properly.

- Copy the examples folder in HDFS using the below command
bin/hadoop fs -put /home/training/tmp/examples/ examples/

- Now run the Oozie example as
oozie job -oozie http://localhost:11000/oozie -config /home/training/tmp/examples/apps/map-reduce/job.properties -run
- The status of the job can be got using the below command
oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-tucu

In the upcoming blogs, we will see how to write some simple work flows and schedule tasks in Oozie.

Friday, October 4, 2013

Alternative to Google Reader Search

For a variety of reasons RSS aggregators didn't hit the main stream even with those who work in the IT field. I had been using RSS aggregators for quite some time to keep myself updated with the latest in the Technology.

Now that the Google Reader has been sunset, a lot of alternate RSS aggregators have been popping up. The demise of once dominant Google Reader really opened market for others. I tried a couple of RSS aggregators and finally settled with Feedly.
Feedly is a nice RSS aggregator which has almost all the features of the Google Reader. Feedly recently even opened the API for other applications to be built upon the Feedly platform. But, one of the features I really miss in Google Reader is searching through the subscribed feeds instead of the entire internet. Google Reader was really good at it, with an option for searching through a subset of feeds which came very handy.

Feedly didn't have initially have this feature, but later it had been introduced through Feedly Pro, which is a paid service. So, I was looking for some alternatives to the Google Reader search feature. Google Custom Search Engine (CSE) is an alternative for searching through a bunch of sites. Here is a CSE created using the below mentioned sites.
 
Below is the sequence of steps to move the feeds from Feedly to Google CSE. The problem with this approach it that Feedly and Google CSE are seperate data islands and any feed subscribed in Feedly needs to be included in Google CSE manually. The process is a bit cumbersome, but worth the effort.

- Export the feeds from Feedly to an OPML file. Here is how to do it.

- Run the below script to extract the feeds from the opml file. Note than some of the feeds will be extracted as `http://feedproxy.google.com/TechCrunch`which needs be modified into the actual site `http://techcrunch.com/`.
grep xmlUrl feedly.opml | perl -ne 'if(/xmlUrl=\"(.+?)"/){print "$1\n"}'
- Once the feeds have been extracted and modified, export them in Google CSE in bulk as mentioned here.

- Extract the Public URL to search through the feeds added in the earlier step.

Voila, here are the results from the custom search for the word Bitcoin.
Please let me know in the comments if there any better alternatives for solving this problem and I will update the blog accordingly.

Tuesday, October 1, 2013

Sqoop2 vs Sqoop

Apache Sqoop uses a client model where the user needs to the install Sqoop along with connectors/drivers on the client. Sqoop2 (the next version of Sqoop) uses a service based model, where the connectors/drivers are installed on the Sqoop2 server. Also, all the configurations needs to be done on the Sqoop2 server. Note that this blog entry refers to Sqoop 1x (client based model) as Sqoop and Sqoop 2x (service based model) as Sqoop2.

From an MR perspective another difference is that Sqoop submits a Map only job, while Sqoop2 submits a MapReduce job where the Mappers would be transporting the data from the source, while the Reducers would be transforming the data according to the source specified. This provides a clean abstraction. In Sqoop, both the transportation and the transformations were provided by Mappers only.

Another major difference in Sqoop2 is from a security perspective. The administrator would be setting up the connections to the source and the targets, while the operator user uses the already established connections, so the operator user need not know the details about the connections. And operators will be given access to only some of the connectors as required.

Along with the continuation of the CLI, Web UI can also be used with Sqoop2. The CLI and the Web UI consume the REST services provided by the Sqoop Server. One thing to note is that the Web UI is part of Hue (HUE-1214) and not part of the ASF. The Sqoop2 REST interface also makes it easy to integrate with other frameworks like Oozie to define a work flow involving Sqoop2.
Here is a video on why Sqoop2. The voice is a bit not clear. For those who are more into reading, here are the articles (1, 2) for the same.

Also, here is the documentation for installing and using Sqoop2.  I tried installing it and as with any other framework in the initial stages, Sqoop2 is still half cooked and a lot more to be improved/developed from a usability/documentation perspective. Although the features of Sqoop2 are interesting, until Sqoop2 becomes more usable/stable we are left with Sqoop to get the data from RDBMs to Hadoop.

Apache Tez - Beyond Hadoop (MR)

Apache Pig and Hive are higher level abstracts on top of MR (MapReduce). PigLatin scripts of Pig and HiveQL queries of Hive are converted into a DAG of MR jobs, with the first MR job (5) reading from the input the last MR job (2) writing to the output. One of the problem with this approach is that the temporary data between the MR jobs (as in the case of 11 and 9) is written to HDFS (by 11) and read from HDFS (by 9) which leads to inefficiency. Not only this, multiple MR jobs will also lead to initialization over head.
With the ClickStream analysis mentioned in the blog earlier, to find out the `top 3 urls visited by users whose age is less than 16` Hive results in a DAG of 3 MR jobs, while Pig results in a DAG of 5 MR jobs.

Apache Tez (Tez means Speed in Hindi) is a generalized data processing paradigm which tries to address some of the challenges mentioned above. Using Tez along with Hive/Pig, a single HiveQL or PigLatin will be converted into a single Tez Job and not as a DAG of MR jobs as is happening now. In Tez, data processing is represented as a graph with the vertices in the graph representing processing of data and edges representing movement of data between the processing.

With Tez the output of a MR job can be directly streamed to the next MR job  without writing to HDFS. If there are is any failure, the tasks from the last checkpoint will be executed. In the above DAG, (5) will read the input from the disk and (2) will write the output to the disk. The data between the remaining vertices can happen in-memory or can be streamed over the network or can be written to the disk for the sake of checkpointing.

According to the Tez developers in the below video, Tez is for frameworks like Pig, Hive and not for the end users to directly code to Tez.
HortonWorks had been aggressively promoting Tez as a replacement to Hadoop runtime and is also publishing a series of articles on the same (1, 2, 3, 4). A DAG of MR jobs would also run efficiently on a Tez runtime than on a Hadoop runtime.

Work is in progress to run Pig and Hive on the Tez runtime. Here are the JIRAs (Pig and Hive) for the same. Need to wait and see how far it is accepted by the community. In the future we should be able to run Pig and Hive on Hadoop and Tez runtime by just switching a flag in the configuration files.

Tez is a part of a bigger initiative Stinger, which I will cover in another blog entry.