Thursday, January 2, 2014

Getting started with Big Data - Part 3 - Installation and configuration of Apache Bigtop
In earlier blog entries we looked at how to install VirtualBox and then installing Ubuntu on top of VirtualBox. In the final series, we will look on how to install Bigtop on top of the Ubuntu Guest OS. From the Apache Bigtop site

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects.

Open source software/frameworks work good individually, but it takes some effort/time to integrate them, the main challenge is the interoperability issues between the different frameworks. This is where companies like Cloudera, Hortonworks, MapR and others come into play. They take the different frameworks from the Apache Software Foundation and make sure they play nice to each other. Not only do they address the interoperability issues, but also make performance/usability enhancements.

Apache Bigtop takes this effort to a community level from a individual company level. Bigtop can be compared to Fedora, while Cloudera (CDH) / Hortonworks (HDP) / MapR (M5/M5/M7) can be compared to RHEL. Red Hat provides commercial support for RHEL, while Cloudera / Hortonworks / MapR provide commercial support for their own distributions of Hadoop. Also, Fedora has some of the leading edge and more variety of softwares and same is the case with Bigtop also. A lot of Apache frameworks (like Mahout / Hama) are included in Bigtop, but not in the commercial distributions like CDH/HDP/M*.

For those who wanted to deep dive into Big Data, Bigtop makes sense as it includes a lot of additional Big Data frameworks. Also, there are not many restrictions on it's usage. More on What is Bigtop, and Why Should You Care?

Here is the official documentation on installing Bigtop. But, the documentation is a bit outdated and has some steps missing. Here are the steps in detail.

- Install Java as mentioned here. Make sure that Oracle JDK 6 is installed and not JDK 7, because Bigtop has been tested with JDK 6.

- Get the key for Bigtop and add it to the list of trusted keys.
wget -O- | sudo apt-key add -

- Add the Bigtop repository to the Guest OS.
sudo wget -O /etc/apt/sources.list.d/bigtop.list`lsb_release --codename --short`/bigtop.list

- Resynchronize the package index files from their sources
sudo apt-get update 

- Run the below command to get the list of packages in the Bigtop repository.
grep Package /var/lib/apt/lists/bigtop.s3.amazonaws.com_releases_0.7.0_ubuntu_precise_x86%5f64_dists_bigtop_contrib_binary-i386_Packages

- Install bigtop-utils. After the installation of the different frameworks, at any point of time check the log files in the `/var/log/` folder for any exceptions or errors.
sudo apt-get install bigtop-utils
- During the bigtop-utils installation, enter the password for the MySQL db and the select the appropriate Postfix configuration.
Screen 1

Screen 2

- Install hadoop and hue packages and the other packages of interest. Hive also gets installed automatically, because of the way the dependencies have been specified.
sudo apt-get install hadoop\* hue

- Modify `/etc/hue/conf/hue.ini` file using sudo.
# Use WebHdfs/HttpFs as the communication mechanism.
# This should be the web service root URL, such as
# http://namenode:50070/webhdfs/v1

# Defaults to $HADOOP_MR1_HOME or /usr/lib/hadoop-0.20-mapreduce

- To `/etc/hadoop/conf/hdfs-site.xml` add the below properties using sudo.



- Get the ip address of the Guest OS using the ifconfig command and modify the `/etc/hosts` to change the ip of the host
bigdatavm@bigdatavm:~$ cat /etc/hosts    localhost
#    bigdatavm    bigdatavm

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
- In the `/etc/default/bigtop-utils` set
export JAVA_HOME=/usr/lib/jvm/java-6-oracle/
- Format the NameNode
sudo /etc/init.d/hadoop-hdfs-namenode init

- For pseudo mode start the HDFS services as
for i in hadoop-hdfs-namenode hadoop-hdfs-datanode ; do sudo service $i start ; done

- From the HDFS web console (http://localhost:50070/dfshealth.jsp) make sure that the number of live nodes is 1.

Screen 3

- Make sure to create a sub-directory structure in HDFS before running any daemons.
sudo /usr/lib/hadoop/libexec/

- Restart the OS or start the individual services, so that the changes to the above configuration files comes into effect.

- Ubuntu by default runs at level 2. Note that the Resource Manager (S20*) and the Node Manager (S20*) have the same priority at startup.

Screen 4
- If the NM starts before the RM,  because of a bug the NM crashes with the below exception (/var/log/hadoop-yarn/yarn-yarn-nodemanager-bigdatavm.log) and the NM has to be started again.
Caused by: Call From bigdatavm/ to failed on connection exception: Connection refused; For more details see:
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
        at java.lang.reflect.Constructor.newInstance(
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
        ... 8 more

Caused by: Connection refused
        at Method)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(
        at org.apache.hadoop.ipc.Client$Connection.access$2200(
        at org.apache.hadoop.ipc.Client.getConnection(
        ... 9 more 
- Start the node manager manually.
sudo service hadoop-yarn-nodemanager start

- Check the log files (in /var/log) for any errors and also check the following consoles
Hue UI - http://localhost:8888
RM UI - http://localhost:8088
HDFS UI - http://localhost:50070 

- Run a MR job which comes with the default installation.
sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000 terasort-input

Screen 5

Hue provides a Web Console for the different Big Data components. In the upcoming blog, we will look into how to create Oozie work flows using Hue and various other components.

Screen 6



  1. Excellent tutorial! Will refer to it when people wants to use Hue with BigTop!
    (noticed that the 'Modify `/etc/hue/conf/hue.ini` file using sudo.' should not be needed as the default is localhost even if the ini does not show it)

    1. Thanks for the feedback. I don't remember, but I think webhdfs_url was commented out in the hue.ini file.

  2. Have you gotten Hue to work properly outside of the Cloudera environment by chance?

    1. I figure it out for some fonction but impala, solr and RDBMS are not running till now

  3. It is possible to run mapreduce jobs in oozie with hadoop-2.2.0 without bigtop ? i tried it but not works for me.