Monday, February 11, 2013

How to know the active Apache projects?

Before adopting any software either open source or commercial, besides the features it's supports its equally important to know a couple of other things like

- how soon the bug fixes/enhancements will be provided
- how good the documentation is
- how active the community is (applies to both proprietary and open source)
- When will a particular version reach EOL
- What about the commercial support provided


It becomes a tricky when adoption an open source project. There are tons of nice projects at Apache, GitHub, Google Code and CodePlex from Microsoft. Day by day, more and more companies are open sourcing their internal frameworks through the cloud based version control systems.

Twitter had been leading this effort, more information can be found on this Twitter blog. As mentioned in this earlier blog entry, Google had been leading the Big Data space with its papers, Twitter had been more aggressively putting out the internal frameworks it had been using.

Coming back to the Apache projects, initially they start as incubator projects and once a particular project gathers enough community of developers to develop new features and fix the bug fixes, then the project will become a TLP (Top Level Project). Even some of the Apache TLP won't see any updates because of the lost community interest or because of some better alternatives.

Here are the source to know when the Apache projects (both TLP and incubator) have been last updated. As sees from these sources, some of the projects have not been changed for the last couple of years, while some of the projects are being updated very actively.

http://svn.apache.org/viewvc/?sortby=date
http://svn.apache.org/viewvc/incubator/?sortby=date

The above is one of the many signal to know how active a particular Apache project is. Also, check from the Apache mailing lists, how active the community had been to provide any support. Also, make sure to check if any of the companies are providing commercial support for the Apache products of interest. Coming to the Big Data space there are companies like Cloudera, HortonWorks which provide support for a wide range of frameworks and there are companies like DataStax which provide support for a single framework Cassandra.

To quickly summarize, enough care and caution should be taken before adopting a framework or a software into a project.

Saturday, February 9, 2013

Hadoop in a box

As a technology geek, I am not sure why I do something :) This experiment falls under the same category. Wanted to setup a Hadoop cluster on my Notebook. I have a HP 430 Notebook with a Core i5 processor and 4 GB RAM.

I choose to use Cloudera Manager for installation of the cluster as it automates most of the installation and configuration required for a Hadoop cluster. Below is how the configuration looks like.
On the Laptop Ubuntu 12.04 Desktop has been installed (host OS), which is the OS I use most of the time. On top of it Oracle VirtualBox had been installed, so as to enable running one OS (guest) on top of another OS (host). On top of VirtualBox, two instances of Ubuntu 12.04 Server have been installed (Guest OS). There is no need to have a full fledged desktop as nodes. Not only it is unnecessary, but Desktop versions of the OS make the whole thing slower. One of the Guest OS has been configured as a master/slave and other as a slave.

a) Below are the commands to start/stop the Cloudera Manager and the related Services
sudo /etc/init.d/postgresql start
sudo /etc/init.d/cloudera-scm-server-db start
sudo /etc/init.d/cloudera-scm-server start

sudo /etc/init.d/cloudera-scm-server stop
sudo /etc/init.d/cloudera-scm-server-db stop
sudo /etc/init.d/postgresql stop
Here is a screen shot for the same

 

b) Here is the command to start the two slave VMs.
vboxmanage startvm slave1 slave2
Here is a screen shot for the same


Now we two Ubuntu 12.04 Servers on top of Ubuntu 12.04  Desktop.


c) The Cloudera Manager Console is available at localhost:7180, the default  username/password is admin/admin.


d) If everything has been setup properly, the agents on VMs should send Heart Beats to the Cloudera Manager and the nodes should be reported as good in the below screen.


e) Similarly, all the services should be in a good and in a stopped state.


f) The services can be started from the same UI and status should be changed to started as below.


g) Now, it's time to run some basic commands. Login to one of the slave and then execute the commands as below.


h) CDH installations also provides a interface Hue to interact with the Hadoop cluster as shown below.



As mentioned earlier, it's not a full blown cluster for conducting performance tests or a proof of concept, but a basic one to get familiar with basic installation and usage of Hadoop cluster.

Will follow with another blog on hot to get started with the installation of a Cloudera CDH cluster.

Friday, February 8, 2013

Caching Proxy - Installation and Configuration

Setting up a Hadoop cluster is all easy with a bit of familiarity with system and network administration. It's all interesting, the only frustrating thing is the downloading of the patches after the installation of the OS and the downloading of the packages for the softwares on top of OS. The downloads can go to all the way close to a GB also, which might take a couple of minutes to hours based on the internet bandwidth.

Here is where caching tools really help. They will cache the downloaded packages to one of the designated local machine (lets call it the cache server) and the other machines can point to the cache server to get the packages. This way the packages are downloaded from the internet for the first time and from then on the local cache server will be used for getting the packages. This approach will not only save the network bandwidth, but will also make the whole installation process faster.
For debian systems, apt-cacher-ng is designed to cache the packages and is really easy to install and configure. Here are the steps involved:

a) On the cache machine install the apt-cache-ng using the below command. root previlages would be required to run the command.
sudo apt-get install apt-cacher-ng
b) All the different machines in the local network have to point to the cache server using the below command, where `cacheserver` has to be replaced with the appropriate host name/ip of the cache machine.
sudo echo 'Acquire::http { Proxy "http://cacheserver:3142"; };' | sudo tee /etc/apt/apt.conf.d/01apt-cacher-ng-proxy
It's as easy as the above two steps to setup a cache server for a debian system.

For a rpm based system it's a bit more complicated. For rpm based systems squid should be installed either on a debian or a rpm based systems and other systems will fetch the packages from squid. Below are the instructions for installing squid on a debian based system.

a) On the cache machine install squid using the below command for debian based system. Again root previlages would be required to run the command.
sudo apt-get install squid3
b) Uncomment the below line in /etc/squid3/squid.conf file, the default uses memory based caching. With the following settings all the packages will be stored in the /var/spool/squid3 directory.
cache_dir ufs /var/spool/squid3 100 16 256
Uncomment the below line
http_access deny all
and add the below lines to enable access to the squid server from the different machine. Based on the network settings/configurations the ip addresses have to be chosen approximately.
acl allcomputers src 192.168.1.0/255.255.255.0
http_access allow allcomputers
c) Add the below to .bashrc for the proxy to take affect to all the applications.
export http_proxy=http://cacheserver:3128
export ftp_proxy=http://cacheserver:3128
or add the below to /etc/yum.conf for the proxy to work only with yum which is used for installing the packages on rpm based systems. Here is the documentation for the same.
# The proxy server - proxy server:port number
proxy=http://mycache.mydomain.com:3128
# The account details for yum connections
proxy_username=yum-user
proxy_password=qwerty 
Make sure the fire wall has been disabled or the appropriate port has been opened on the cache server, 3142 for apt-cacher-ng and 3128 for squid. gufw is a front-end to the iptables in Ubuntu.

For the past couple of days, I had been working on setting up a two node Cloudera CDH on a Laptop using virtualization. It's not super fast, but decent enough to try some basic commands when access to a cluster is no there. I will follow up with a blog on the process for the same.

The more I spend time with Ubuntu/Linux, the more I am liking it. It is very much customizable and can be very easily tweaked for performance.

Happy Hadooping !!!

Sunday, February 3, 2013

MapReduce Patterns

Developing algorithms to process data on top of Distributing computing because of it's inherent distributed nature is difficult. So there are models like PRAM, MapReduce, BSP (Bulk Synchronous Parallel), MPI (Message Passing Interface) and others. Apache Hadoop, Apache Hama, Open MPI and other frameworks/software make it easier to develop algorithms on top of the earlier mentioned distributed computing models.


Coming back to our favorite topic MapReduce, this model is not something new. Map and Reduce had been known as Map and Fold which are higher order functions and had been used in functional programming languages like Lisp and ML. The different between a regular function and a higher order function is that while a higher order function can take multiple functions as input and return a function as output, a regular function doesn't.

As mentioned earlier Map and Fold had been there for ages and you might have heard about them if you had your education in Computer Science or used any of the functional programming. But, Google with it's paper on MapReduce made the functional programming and higher order functions popular.

Also, along the same lines BSP had been there for quite some time, but Google made BSP model popular with it's Large-scale graph computing at Google paper.

Some models suit well for some type of algorithms, while some won't. To mention, iterative algorithms fit better with the BSP model than the MR model, but still Mahout choose to use MR model than BSP model because Apache Hadoop (MR implementation) is much more mature and has a huge ecosystem when compared to Apache Hama (BSP implementation).

All the models which have been discussed need a different way of thinking to solve a particular problem. Here are the resources containing the pseudo code/logic for implementing some of the algorithms on top of MapReduce.

- Hadoop - The Definitive Guide (Join in Page 283)

- MapReduce Patterns

- Data Intensive Text Processing with MapReduce

Would be following with another blog with patterns around the BSP model.