Monday, January 20, 2014

Review of `Building Machine Learning Systems with Python` Book

For some time I wanted to get started with Machine Learning (ML) and was looking for some good resources (books, videos, articles etc). Some of them deep dive into the ML topics (with too much of stats and maths) without any practical aspects and some of them are with a minimal of practical examples. But, I wanted to get started with the practical aspects of ML quickly without deep diving into the ML concepts, which I plan to pursue later.

It's like I don't want to know in detail how a car works, but still I would like to drive a car. Although knowing how a car works, will really help when there is a break down in the middle of nowhere. Applying the same analogy, there are a lot of frameworks in different languages which implement the ML algorithms and it's a matter of knowing which ML algorithms to use given a problem and call the appropriate framework API. Later at some point of time, we might be interested in the nitty gritty details of how a particular ML algorighm works or is implemented to fine tune it, but it would be good to get some quick results without getting into the minute details.

`Building Machine Learning Systems with Python` maintains a perfect balance between the theoretical and practical aspects of ML. It goes with the assumption that the reader is familiar (not an expert) with Python and ML. For getting started/familar with ML would recommend `The Shape Of Data` and for Python would recommend 1 and 2. There are a lots of editors for Python, but I had been using Eclipse PyDev plugin. For those who are familiar with Eclipse environment, it would be very easy to get started with PyDev.

Just to add there are a lot of ML frameworks for Python (1, 2, 3, 4 etc). But, as of this writing couldn't find any Python framework which implements ML algorithms in a distributed fashion. I have posted a query in SO here and waiting for response. Not exactly sure if Python is a good option for distributed ML, but some of the options for distributed ML around Java are Apache Mahout and the recent Oryx from Cloudera. But, as the size of the data sets grow it makes sense to have some nice frameworks implementing ML algorithms in a distributed fashion using Python. Here is an article mentioning on how to use Mahout with Python, but both wrappers and native interfaces have their own space. (scikit-learn support is being added to Hadoop, more details here).

As mentioned in the `O'reilly Data Science Salary` there is a close tie between Python and R around Data Science. Here is an interesting comparison from Revolution Analytics between the usage of Python and R. Revolution Analytics or any other company will in fact mention good about their products, but they have got some metrics on the same.

I am going through the above mentioned book and would be writing a detail review about the book. I am really excited to get the book and couldn't stop writing about it. I would also blogging about my experience with Python and ML in particular. So, keep following.
6 24 09 Bearman Cartoon Artificial Intelligence copy by Bearman2007 on Flickr


  1. I am impatient to read your review.

  2. Me too reading it .. seems interesting!!!

  3. Can you tell any good books for practical implementation of hadoop with python