Big Data and Cloud Tips: November 2013

Wednesday, November 27, 2013

Installing and configuring Storm on Ubuntu 12.04

In this blog althrough we had been exploring different frameworks like Hadoop, Pig, Hive and others which are batch oriented (some call it long time similar to real time). So, based on the size of the data and the cluster it might take a couple of hours to process the data.

Above mentioned frameworks might not meet all the user requirement. Lets take the use case of a Credit Card or an Insurance company, they would like to detect frauds happening as soon as possible to minimize the effects of frauds. This is where frameworks like Apache Storm, LinkedIn Samza, Amazon Kinesis help to fill the gap.

A social analytics company called BackType acquired by Twitter developed Storm. Twitter later released the code for Storm. This Twitter blog makes a good introduction to Storm architecture. Instead of the repeating the same here, we will look into how to install and configure on Apache Storm on a single node. It took me a couple of hours to figure it out, so this blog entry to make it easy for others to get started with Storm. 1, 2 had been really helpful to figure it out. Note that Storm has been submitted to the Apache Software Foundation and is in the Incubator phase.

Here are steps from 10k feet

- Download the binaries, Install and Configure - ZooKeeper.
- Download the code, build, install - zeromq and jzmq.
- Download the binaries, Install and Configure - Storm.
- Download the sample Storm code, build and execute them.

Here are the steps in detail. No need to install Hadoop etc, Storm works independent of Hadoop. Note that these steps have been tried on Ubuntu 12.04 and might change a bit with other OS.

Installing/configuring HCatalog and integrating with Pig

As mentioned in the previous blog entry, Hive uses a metastore to store the table details, mapping of the table to the data and other details. Any framework which provides an SQL like interface would require a metastore similar to Hive. Instead of having a separate incompatible metastore for each of these framework, it would be good to have a common metastore across them.This is where HCatalog comes into the picture.

This way the data structures created by one frameworks would be visible to others. Also, by using a common metastore a data analysts can more concentrate on the analytics part than on the location/format of the data.

Using open source is really nice, but sometimes the documentation lacks details and when it comes to integrating different frameworks it lacks even more and there are also the compatibility issues between the different frameworks. So, this entry is about how to install/configure HCatalog and how to integrate it with Pig. This way Pig and Hive will have a unified metadata view. Anyway, here is the documentation for HCatalog.

Imapla or Hive - when to use what?

Hive has been initially developed by Facebook and later released to the Apache Software Foundation. Here is a paper from Facebook on the same. Impala from Cloudera is based on the Google Dremel paper. Both, Impala and Hive provide a SQL type of abstraction for data analytics for data on on top of HDFS and use the Hive metastore. So, when to use Hive and when to use Impala?

Here is a discussion on Quora on the same. Here is a snippet from the Cloudera Impala FAQ

Impala is well-suited to executing SQL queries for interactive exploratory analytics on large datasets. Hive and MapReduce are appropriate for very long running, batch-oriented tasks such as ETL.

And here is a nice presentation which summarizes to the point about Hive vs Imapala. So, I won't be repeating them again in this blog.

Note that performance is not the only non-functional-requirement for picking a patricular framework. Also, the Big Data had been moving rapidly and the comparison results might trip the other way in the future as more improvements are made to the corresponding framework.

Thursday, November 21, 2013

Different ways of configuring Hive metastore

Apache Hive is a client side library providing a table like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a MR plan which is then submitted to the Hadoop cluster for execution.

The Hive table definitions and mapping to the data are stored in a metastore. Metastore constitutes of (1) the meta store service and (2) the database. The metastore service provides the interface to the Hive and the database stores the data definitions, mappings to the data and others.

The metastore (service and database) can be configured in different ways. The default Hive configuration (as is from Apache Hive without any configuration changes) is that Hive driver, metastore interface and the db (derby) all use the same JVM. This configuration is called embedded metastore and is good for the sake of development and unit testing, but won't scale to a production environment as only a single user can connect to the derby database at any instant of time. Starting second instance of the Hive driver will thrown an error back.

Another way to configure is to use an external database which is JDBC compliant like MySQL as shown below. This configuration is called local metastore.

Here are the steps required to configure Hive in an local metastore way.

Using Log4J/Flume to log application events into HDFS

Many a times the events from the applications have to be analyzed to know more about the customer behavior for recommendations or to figure any fraudulent use cases. With more data to analyze, it might take a lot of time or some times even not possible to process the events on a single machine. This is where distributed systems like Hadoop and others cuts the requirement.

Apache Flume and Sqoop can be used to move the data from the source to the sink. The main difference is that Flume is event based, while Sqoop is not. Also, Sqoop is used to move data from structures data stores like RDBMS to HDFS and HBase, while Flume supports a variety of sources and sinks.

One of the option is to make the application use Log4J to send the log events to a Flume sink which will store them in HDFS for further analysis.

Here are the steps to configure Flume with the Avro Source, Memory Channel, HDFS Sink and chain them together.

Creating a simple coordinator/scheduler using Apache Oozie

With the assumption that Oozie has been installed/configured as mentioned here and that a simple work flow can be executed as mentioned here, now it's time to look at how to schedule the work flow at regular interval using Oozie.

- Create the coordinator.properties file in HDFS (oozie-clickstream-examples/apps/scheduler/coordinator.properties)

nameNode=hdfs://localhost:9000
jobTracker=localhost:9001
queueName=default

examplesRoot=oozie-clickstream-examples
examplesRootDir=/user/${user.name}/${examplesRoot}

oozie.use.system.libpath=true
oozie.coord.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/scheduler

- Create the coordinator.xml file in HDFS (oozie-clickstream-examples/apps/scheduler/coordinator.xml). The job runs between the specified start and the end time interval for every 10 minutes.

<coordinator-app name="wf_scheduler" frequency="10" start="2013-10-24T22:08Z" end="2013-10-24T22:12Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
   <action>
      <workflow>
         <app-path>${nameNode}${examplesRootDir}/apps/cs</app-path>
      </workflow>
   </action>
</coordinator-app>

Note that the Oozie coordinator can be time based or event based with much more complexity than as mentioned here. Here are the specifications for the Oozie coordinator.

- For some reason Oozie is picking 1 hour behind the system time.This can be observed from the `Created` time of the latest submitted work flow and the system time from the top right in the below screen. So, the start and the end time in the previous step had to be backed by an hour to the actual times.

Not sure why this happens, but will update the blog if the cause is found out or someone posts why in the comments.

-Submit the Oozie coordinator job as

bin/oozie job -oozie http://localhost:11000/oozie -config /home/vm4learning/Code/oozie-clickstream-examples/apps/scheduler/coordinator.properties -run

-The coordinator job should appear in the Oozie console from the PREP to RUNNING, all the way to SUCCEEDED state.

- The output should appear as below in the `oozie-clickstream-examples/finaloutput/000000_0` file in HDFS.

www.businessweek.com 2
www.eenadu.net 2
www.stackoverflow.com 2

Note that Oozie has got a concept of bundles where a user can batch a group of coordinator applications and execute an operation on the the whole bunch at a time. Will look into it in another blog entry.

Big Data and Cloud Tips

Pages

Wednesday, November 27, 2013

Installing and configuring Storm on Ubuntu 12.04

Monday, November 25, 2013

Installing/configuring HCatalog and integrating with Pig

Saturday, November 23, 2013

Imapla or Hive - when to use what?

Thursday, November 21, 2013

Different ways of configuring Hive metastore

Tuesday, November 19, 2013

Using Log4J/Flume to log application events into HDFS

Tuesday, November 12, 2013

Creating a simple coordinator/scheduler using Apache Oozie