Big Data and Cloud Tips: December 2012

Monday, December 31, 2012

Cloudera Certified Developer for CDH4 (CCD-410)

Finally today after all the procrastination, I got through Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% (57 correct answers out of 60 questions). Cloudera is offering a 40% off CDH4 tests and Second Shot till the end of the year (2012), so I took the last opportunity on 31st December, 2012.

Took the appointment with Pearson Vue (the customer service was fantastic) on 27th December, 2012 and the certification exam on 31st December, 2012. As many of you know, I am a big fan of Hadoop and related technologies and had been helping others through this blog and through different forums like Hyderabad-HUG and StackOverflow. All of this and the support I got helped me to get through the certification with a good score with just two days of preparation.

Finally, for those who are thinking about getting through the certification, Hadoop - The Definitive Guide - 3rd Edition book is a must read with a lot of hands on experience. Also, there were no specific questions related to Cloudera version of Hadoop. Getting familiar with the Apache version Hadoop is enough for the certification.

It was a nice way to say good bye to 2012 and welcome 2013. I am very comfortable with setting up a Hadoop cluster and plan to complete the Cloudera Certified Administrator for Apache Hadoop CDH4 (CCA-410) some time in February, 2013. Will update on the same in another blog post.

Monday, December 24, 2012

Introduction to Apache Hive and Pig

Apache Hive is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop. Hive supports HiveQL which is similar to SQL, but doesn't support the complete constructs of SQL.

Hive coverts the HiveQL query into Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java MapReduce will required a lot of code to be written/debugged compared to HiveQL. So, it increases the developer productivity to use Hive.

To summarize, Hive through HiveQL language provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce. But the Hive community is working to narrow down this gap for most of the commonly used scenarios.

Along the same line Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which is converted into Java MapReduce program and then submitted to the Hadoop cluster.

While HiveQL is a declarative language like SQL, PigLatin is a data flow language. The output of one PigLatin construct can be sent as input to another PigLatin construct and so on.

Some time back, Cloudera published statistics about the workload character in a typical Hadoop cluster and it can be easily observed that Pig and Hive jobs are a good part of the jobs in Hadoop cluster. Because of the higher developer productivity many of the companies are opting for higher level abstracts like Pig and Hive. So, we can bet there will be a lot of job opening around Hive and Pig when compared to MapReduce developers.

Although Programming Pig book has been published (October, 2011) some time back, Programming Hive book was published recently (October, 2012). For those who have experience working with RDBMS, getting started with Hive would be a better option than getting started with Pig. Also, note that PigLatin language is not very difficult to get started with.

For the underlying Hadoop cluster it's transparent whether a Java MapReduce job is submitted or a MapReduce job is submitted through Hive and Pig. Because of the batch oriented nature of MapReduce jobs, the jobs submitted through Hive and Pig are also batch oriented in nature.

For real time response requirements, Hive and Pig doesn't meet the requirements because of the earlier mentioned batch oriented nature of MapReduce jobs. Cloudera developed Impala which is based on Dremel (publication from Google) for interactive ad-hoc query on top of Hadoop. Impala supports SQL like query and is compatible with HiveQL. So, any applications which are built on top of Hive should work with minimal changes with Impala. The major difference between Hive and Impala is that while HiveQL is converted into Java MapReduce jobs, Impala doesn't covert the SQL query into a Java MapReduce jobs.

Someone can ask whether to go with Pig or Hive for a particular requirement which is a topic for another blog.

Saturday, December 1, 2012

RDBMS vs NoSQL Data Flow Architecture

Recently I had an interesting conversation with someone who is an expert in Oracle Database on the difference between RDBMS and a NoSQL Database. There are a lot of differences, but the data flow is as shown below in those systems.

In a traditional RDBMS, the data is first written to the database, then to the memory. When the memory reaches a certain threshold, it's written to the Logs. The Log files are used for recovering in case of server crash. In RDBMS before returning a success on an insert/update to the client, the data has to be validated against the predefined schema, indexes created and other things which makes it a bit slow compared to the NoSQL approach discussed below.

In case of a NoSQL database like HBase, the data is first written to the Log (WAL), then to the memory. When the memory reaches a certain threshold, it's written to the Database. Before returning a success for a put call, the data has to be just written to the Log file, there is no need for the data to be written to the Database and validated against the schema.

Log files (first step in NoSQL) are just appended at the end and is much faster than writing to the Database (first step in RDBMS). The NoSQL data flow discussed above gives a higher threshold/rate during data inserts/updates in case of NoSQL Databases when compared to RDBMS.

Pages