Big Data and Cloud Tips: January 2014

Friday, January 31, 2014

Bloom Filters in HBase and Chrome

Bloom Filters allows to efficiently check if a particular element/record is there in the set/table or not. It has very minimal impact on the insert operations. The only caveat is that it might return a false positive, Bloom filter might say that a particular element/record is there in the set/table even when that particular item of interest is not there. Bloom Filters have been implemented in HBase and are by default enabled.

Interesting to know that Google Chrome browser used to implement Bloom Filters and has been later replaced with an alternate approach. According to Wikipedia (outdated):

The Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit a full check of the URL is performed.

HBase implements Bloom filters on the server side, while Chrome implements the Bloom filter on the client/browser side. So, Bloom filter data is up to date in HBase, but might be a bit outdated in the Chrome browser. There are some variations of Bloom Filters, but the basic concept is very simple and beautiful.

Blooming Georgia St. from Flickr by Tiberiu Ana under CC

Thursday, January 30, 2014

Hadoop in a box (revisit)

It's possible to setup a small Hadoop cluster on a single machine using virtualization. I blogged about it here a year back without the nitty gritty details, but Cloudera published a blog here with much more details.

Here is an alternate way of setting a bigger cluster on a single machine using Linux containers. This solution only works for Linux as the host OS and is less burden on the host OS, but the Cloudera mentioned approach works on multiple host OS. I haven't given the Linux containers options a try, but seems very interesting.

The same concept can be extended for non Hadoop frameworks also.

`Technology Radar` 2014 from ThoughWorks

ThoughWorks has released `Technology Radar` for 2014. They have categorized techniques / tools / platforms / languages/ frameworks into different rings (adopt / trial / assess / hold). The rings summarize how ready the different aspects of technology are for adoption. The paper is good to get some of the hip terms around technology.

Reggae - Rare Music Vided by raremusicvideo1 from Flickr under CC

Resources to get started with Machine Learning

With all the action happening around Machine Learning and Artificial Intelligence (1, 2, 3, 4 etc), looks like this year we would be seeing more and more interesting things happening around Big Data. One of the area we can expect a lot of work to happen in the near future is usability. Some of the the companies like BigML are already working in this space.

ML is being pushed by the availability of the infrastructure at lower costs for the last few years. Amazon has cut the prices 40 times since the launch of AWS. Facebook is using Blu-ray disks to bring down the costs of storing the cold data.

As I continue my journey through learning Machine Learning and Statistical Process, I will keep the Machine Learning page (also available from the top tabs) updated with some of the best resources available (free and commercial). As of now I have seeded with some videos, books, blogs, data sets, use cases. It's nice that the sessions from top universities are being provided for free. Some of these sessions overlap, but they provide a different perspectives around the various aspects of Machine Learning.

With some much information available, it's difficult to focus and to differentiate the best from the others. The ML page is not a comprehensive/perfect/exhaustive list, but it is something to begin with and I will keep it updated as I come across more interesting and useful resources.

If you have and interesting/useful resources to be included to the above mentioned page, send it across to praveensripati@gmail.com.

Machine Learning by Erik Charlton on Flickr under CC

Wednesday, January 29, 2014

Interview with John Chambers (creator of S)

Here is an interview with John Chambers (creator of S on the left) about the history behind R & S. S is considered predecessor of R. He is also a core member of the R team. The same is also summarized in a pdf here.

He is very passionate about what he has been working on, even remembers exact dates of some of the events which happened like 30 years back.

Tuesday, January 28, 2014

Man behind DeepMind

Some interesting (inspiring is more appropriate word) facts about Demis Hassabis, the man behind DeepMind technologies. DeepMind is the AI company recently acquired by Google for a cool 400m $.

Hats off.

Monday, January 27, 2014

Subscribe by email to thecloudaveue.com

One way to get updated with Big Data and any other technology is to closely follow what others had been doing and saying through blogs (both companies and individual). Naturally a companies blog usually puts a positive spin about their services and products, but some of the individual blogs are unbiased about what they write. With so much happening around Big Data, it's not possible to follow the huge number of blogs. One way to get around this is to use an RSS aggregator like Feedly. In spite of RSS aggregators being easy to use and free, their usage has been very limited. Google decided mid of 2013 to discontinue Google Reader and this gave an opportunity for alternate RSS aggregators to spring up.

For those who had been following this blog, you can now subscribe to it by providing your email address on the top right. This way you can be sure that you won't be missing some of the latest happenings around Big Data.

There are claims that email is dead with the advent of social media, but most of us still use it. I am with the strong belief that both email and social media have their own space. The same goes with blogging also, people claim that blogging is dead with the advent of social media.

Email email email by RambergMediaImages on Flickr under CC

Sunday, January 26, 2014

Statistical Learning from Stanford Online

Stanford is offering a MOOC on Statistical Learning. It's free, more details here. The course has already started a couple of days back, but you can still watch the archives once registered to the course. The Coursera ML course is really good, but sometimes it deep dives into Maths, which makes it not for everyone. The good thing about the Stanford offering is

This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics.

On the other side, the focus is on R for statistical analysis. For those who are into Python, they can still go through the course. DataRobot is planning for a follow on blogs for the techniques discussed during the Stanford sessions using Python. DataRobot is still in the stealth mode, so you can follow their blog here using Feedly or some other RSS aggregator.

Note that the corresponding book `An Introduction to Statistical Learning with Applications in R` for the course can be downloaded here for free.

Free sign by Alan O'Rourke on Flickr under CC

Friday, January 24, 2014

Resilient Distributed Datasets (RDD) for the impatient

The Big Data revolution was started by the Google's Paper on MapReduce (MR). But, the MR model mainly suits batch oriented processing of the data and some of the other models are being shoe horned into it because of the prevalence of Hadoop and the attention/support it gets. With the introduction of YARN in Hadoop, other models besides MR could be first-class-citizens in the Hadoop space.

Lets take the case of MR as shown below, there is a lot of reading and writing happening to the disk after each MR transformation which makes is too slow and less suitable for iterative processing as in the case of Machine Learning.

Let's see how we can solve the iterative problem with Apache Spark. Spark is built using Scala around the concept of Resilient Distributed Datasets (RDD) and provides actions / transformations on top of RDD. It has one of the best documentation around open source projects. There was not much resources around RDD, but this paper and presentation are the roots of RDD. Check this to know more about RDDs from a Spark perspective.

Let's look at a very high level what RDDs are and use this as a foundation to build upon Spark and other related frameworks in the upcoming blog articles. According to earlier mentioned paper

Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.

This is the holy grail of what an RDD is. RDDs are a 'immutable resilient distributed collection of records' which can be stored in the volatile memory or in a persistent storage (HDFS, HBase etc) and can be converted into another RDD through some of the transformations. An action like count can also be applied on an RDD.

As observed in the above flow, the data flow from one iteration to another happens through memory and doesn't touch the disk (except for RDD2). When the memory is not sufficient enough for the data to fit it, it can be either spilled to the drive or is just left to be recreated upon request for the same.

Because of the distributed nature of the Big Data processing, there is a better probability that a couple of nodes might go down at any point of time. Note that in the above flow, the RDD2 is persisted to disk because of Check Pointing. In the work flow, for any failure during the transformations t3 or t4, the entire work flow need not be played back because the RDD2 is persisted to disk. It would be enough if transformation t3 and t4 are played back.

Also, RDD can be cached in memory for frequently cached data. Lets say different queries are run on the same set of data again and again, this particular data can be kept in memory for better execution times.

To summarize, for iterative processing MR model is less suited than the RDD model. Performance metrics around iterative and other processings are mentioned in detail in this paper around RDD.

Safari Online All-Access Subscription

Gamler's Arrow by Seema Krishnakumar on Flickr

Big data is like a moving target and a month doesn't go by without a new book / framework / company / VC funding etc around Big Data. So, I finally jumped into the Safari Online Subscription (Individual) to get myself to speed.

I was under the impression they would be offering only O'Reilly publication books, but there are books from a lot of other publishers also as shown here. They have a huge collection of books and videos which can be searched easily using the topic of interest. One of the video I marked is Hilary Mason: Advanced Machine Learning.

There are multiple subscriptions as shown below with Safari Online. And the main advantage of the `ALL-ACCESS` subscription is that it gives access to Rough Cuts which are work-in-progress books, but not yet published. So, I am able to access Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and other books around some of the latest technologies.

~~Only gripe is that there is no offline support, so I need to be connected all the time to read the books.~~ Looks like limited offline supported is provided, more details here. Also, multiple books come-up when searching for a particular topic and there is no user rating to decide what book to go with.

Would definitely recommend to go with this subscription for those who are planning or deep into Big Data.

Monday, January 20, 2014

Review of `Building Machine Learning Systems with Python` Book

For some time I wanted to get started with Machine Learning (ML) and was looking for some good resources (books, videos, articles etc). Some of them deep dive into the ML topics (with too much of stats and maths) without any practical aspects and some of them are with a minimal of practical examples. But, I wanted to get started with the practical aspects of ML quickly without deep diving into the ML concepts, which I plan to pursue later.

It's like I don't want to know in detail how a car works, but still I would like to drive a car. Although knowing how a car works, will really help when there is a break down in the middle of nowhere. Applying the same analogy, there are a lot of frameworks in different languages which implement the ML algorithms and it's a matter of knowing which ML algorithms to use given a problem and call the appropriate framework API. Later at some point of time, we might be interested in the nitty gritty details of how a particular ML algorighm works or is implemented to fine tune it, but it would be good to get some quick results without getting into the minute details.

`Building Machine Learning Systems with Python` maintains a perfect balance between the theoretical and practical aspects of ML. It goes with the assumption that the reader is familiar (not an expert) with Python and ML. For getting started/familar with ML would recommend `The Shape Of Data` and for Python would recommend 1 and 2. There are a lots of editors for Python, but I had been using Eclipse PyDev plugin. For those who are familiar with Eclipse environment, it would be very easy to get started with PyDev.

Just to add there are a lot of ML frameworks for Python (1, 2, 3, 4 etc). But, as of this writing couldn't find any Python framework which implements ML algorithms in a distributed fashion. I have posted a query in SO here and waiting for response. Not exactly sure if Python is a good option for distributed ML, but some of the options for distributed ML around Java are Apache Mahout and the recent Oryx from Cloudera. But, as the size of the data sets grow it makes sense to have some nice frameworks implementing ML algorithms in a distributed fashion using Python. Here is an article mentioning on how to use Mahout with Python, but both wrappers and native interfaces have their own space. (scikit-learn support is being added to Hadoop, more details here).

As mentioned in the `O'reilly Data Science Salary` there is a close tie between Python and R around Data Science. Here is an interesting comparison from Revolution Analytics between the usage of Python and R. Revolution Analytics or any other company will in fact mention good about their products, but they have got some metrics on the same.

I am going through the above mentioned book and would be writing a detail review about the book. I am really excited to get the book and couldn't stop writing about it. I would also blogging about my experience with Python and ML in particular. So, keep following.

6 24 09 Bearman Cartoon Artificial Intelligence copy by Bearman2007 on Flickr

Thursday, January 16, 2014

2013 Data Science Salary Survey from O'reilly

Just now `2013 Data Science Salary Survey` landed in my mail box. The executive summary in `Page 8` is really interesting. Not sure how big the sampled date is, but the report shows some interesting correlations. SQL/RDB tops the list of data analytics tools, followed by R and Python which are very close.

Wednesday, January 15, 2014

Map and Reduce in Python without Hadoop

MapReduce is not a new programming model, but the Google's paper on MapReduce made it popular. A map is usually used for transformation, while reduce/fold is used for aggregation. They are built-in primitives used in functional programming languages like Lisp and ML. More about the functional programming roots to MapReduce paradigm can be found in Section 2.1 of Data-Intensive Text Processing with MapReduce paper.

Below is a simple Python 2 program using the map/reduce functions. map/reduce are functions in the __builtin__ python module. More about functional programming in Python here. For those using Python3, the reduce function has removed from the __builtin__ package. According to the Python 3.1 release notes :

Removed reduce(). Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.

The Python 2 program `squares/transforms` a list of 1 to 100 using `map/square` and then `sums/aggregates` them up using the `reduce/add` function. Note that Hadoop which provides a run time environment for executing MapReduce programs also does something similar, but in a distributed fashion to process huge amounts of data.

#!/usr/bin/python 

def square(x):
    return x * x

def add(x, y):
    return x + y

def main():
    print reduce(add, map(square, range(100)))

if __name__ == "__main__":
    main()

Tuesday, January 7, 2014

Oozie hangs on a single node for work flow with fork

Apache Oozie is a work flow engine which allows for a DAG of tasks to be run. In earlier blog entries we looked at installing/configuring Oozie, creating/executing simple work flows and finally creating a coordinator/scheduler using Oozie. Azkaban from LinkedIn is similar to Apache Oozie.

For each action (like Hive, Pig, Sqoop etc), Oozie launches an MR launcher which in turn launches the actual action. Here is a discussion from the Apache forums on the rational behind this.

In the above work flow, after the fork (in green) the Pig and Sqoop actions (in yellow) are launched by two separate MR launchers in parallel. So, at this point of time four Map slots are required (two for MR launchers, one for Pig and one for Sqoop). But, default only two Map slots are available in each node. That is, only two Map tasks will run at any point of time in a node. This is specified by the `mapred.tasktracker.map.tasks.maximum` property which defaults to 2 in the mapred-site.xml file.

The two available slots are occupied by the Oozie MR launchers and there are no more Map slots available to run the Map tasks related to Pig and Sqoop action. In this case the Job Tracker keeps on waiting for the availability of the Map slots which will never happen.

The above mentioned problem happens only in the case of Hadoop running in the pseudo distributed mode (i.e., single node/machine) with the default configurations which is usually the case of development. One solution is to increase the number of nodes and configure the Hadoop cluster in a fully distributed mode. Another solution is to bump up the `mapred.tasktracker.map.tasks.maximum` property in the mapred-site.xml as below.

<property>
     <name>mapred.tasktracker.map.tasks.maximum</name>
     <value>4</value>
</property>

Monday, January 6, 2014

Different ways of getting started with Big Data

For those who are getting started with Big Data, there are a multitude of options with their own pros and cons. Without going further in detail about each of them, below is a mind map with some of the options. Based on the budget allocated to the initiative, time-to-market, core expertise of the company, industry regulations and various other factors the appropriate one can be picked.

For example, there is a better probability that a startup planning a Big Data initiative would be more towards a cloud based service to avoid the initial CAPEX and also for the sake of instantaneous scalability. Similarly, for a bank there is better probability towards an in-premise commercial distribution (like CDH from Cloudera, HDP from Hortonworks) as they can concentrate on their line of business than on the issues that arise with the Big Data environment and also because of the industry regulations.

{pick*me*2} by { lillith } from Flickr under CC

For those who are following the Big Data space, a week never goes by without the mention of a new startup / framework / funding. It will be some time before we see consolidation, with only the strong players left.

References
1) http://www.infoq.com/news/2013/12/qubole-on-gce

Thursday, January 2, 2014

Getting started with Big Data - Part 3 - Installation and configuration of Apache Bigtop

In earlier blog entries we looked at how to install VirtualBox and then installing Ubuntu on top of VirtualBox. In the final series, we will look on how to install Bigtop on top of the Ubuntu Guest OS. From the Apache Bigtop site

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects.

Open source software/frameworks work good individually, but it takes some effort/time to integrate them, the main challenge is the interoperability issues between the different frameworks. This is where companies like Cloudera, Hortonworks, MapR and others come into play. They take the different frameworks from the Apache Software Foundation and make sure they play nice to each other. Not only do they address the interoperability issues, but also make performance/usability enhancements.

Apache Bigtop takes this effort to a community level from a individual company level. Bigtop can be compared to Fedora, while Cloudera (CDH) / Hortonworks (HDP) / MapR (M5/M5/M7) can be compared to RHEL. Red Hat provides commercial support for RHEL, while Cloudera / Hortonworks / MapR provide commercial support for their own distributions of Hadoop. Also, Fedora has some of the leading edge and more variety of softwares and same is the case with Bigtop also. A lot of Apache frameworks (like Mahout / Hama) are included in Bigtop, but not in the commercial distributions like CDH/HDP/M*.

For those who wanted to deep dive into Big Data, Bigtop makes sense as it includes a lot of additional Big Data frameworks. Also, there are not many restrictions on it's usage. More on What is Bigtop, and Why Should You Care?

Here is the official documentation on installing Bigtop. But, the documentation is a bit outdated and has some steps missing. Here are the steps in detail.

- Install Java as mentioned here. Make sure that Oracle JDK 6 is installed and not JDK 7, because Bigtop has been tested with JDK 6.

Pages