Big Data and Cloud Tips: February 2014

Monday, February 10, 2014

Review of Learning Python (5th Edition)

Thanks to Vincent Danen for the picture. `A picture is worth a thousand words`. The book in the foreground is the 3rd edition of `Learning Python` and in the back is the 5th edition. The book had been getting thicker edition-by-edition. I did a quick look in Amazon and the below table gives the size of each edition and the published date. Maybe, we can use non-linear regression to figure out the size of the next edition :)

Kidding aside, the author (Mark Lutz) of Learning Python - 5th edition does a very good job introducing Python and slowly moving into some of the advanced topics. But, the only gripe I have is that the book is huge (cannot be carried easily) and that the author repeats some of the topics again and again. So, if you are a quick reader like me, then you can quickly skip some of the repeated content and focus more time on the topics of interest.

For those who are into Python for some quick results, this book is certainly not an option. But, if you are into Python for a long haul for using it with Data Science or something else, then the book is worth the time. The book is about Python in general, so can be applied to other areas in Python (Scripting, Data Science, Dynamic Pages etc). The author also mentions the Python ecosystem at a very high level, so this book also gives a 360 view of Python.

Once familiar and comfortable with Python, the author also published Python Pocket Reference and also Programming Python (on how to develop applications in Python).

Friday, February 7, 2014

Big Data Scenarios / Use Cases

Very often I do get the query `I know Hadoop, Hive, Pig etc. Where do I start using it?`. One quick way is to figure out what others had been doing in the domain of interest. This can be done in multiple ways:

1) Follow some of the Big Data related blogs like the ones from Cloudera, HortonWorks etc. Some of the blogs do a good job of segregating blog entries into different categories as in the case of Cloudera.

2) Follow a Big Data aggregator like Planet Big Data and Big Data Made Simple. They don't have any original content, but act like a mere aggregator from various other places.

3) Kaggle has been the platform for holding Data Science competitions. Look at the description of the different challenges and also don't forget to follow their blog.

What's Next ? by Crystl from Flickr under CC

Thursday, February 6, 2014

Optical Archival Storage Technology in Facebook

Verbatim 5.25" floppy disk by goosmurf from Flickr under CC

With some much happening around Big Data, it is interesting to know some of the happenings in the storage space even for those who are not much into hardware. Here is an interesting perspective from James Hamilton on how Facebook uses optical technology for cold storage (aka archival).

This Facebook hardware project is particularly interesting in that it’s based upon an optical media rather than tape. Tape economics come from a combination of very low cost media combined with only a small number of fairly expensive drives. The tape is moved back and forth between storage slots and the drives when needed by robots. Facebook is taking the same basic approach of using robotic systems to allow a small number of drives to support a large media pool. But, rather than using tape, they are leveraging the high volume Blu-ray disk market with the volume economics driven by consumer media applications. Expect to see over a Petabyte of Blu-ray disks supplied by a Japanese media manufacturer housed in a rack built by a robotic systems supplier.

Here is a video from Facebook showing the actual hardware and an article from Arstechnica. Finally, below is a video (around 30 minutes) with Facebook VP Jay Parikh discussing cold storage and Blu-rays.

Tuesday, February 4, 2014

Free Python books

Lately I had been drumming about Python for Data Science and had been spending time learning the same. Many of us might have used Python for creating dynamic web pages, server side automation and other general purpose requirements. So, using Python for Data Science would be a natural extension. Here are a lot of free Python book from freepythontips.

Books by Gavin Gilmour from Flickr under CC

Monday, February 3, 2014

Introduction to Spark

In an earlier blog we looked at RDD, they form the basis for Spark. I planned to write in detail about Spark, but DBMS2 does a very good job summarizing about `Spark and Databricks` here. Databricks which is in stealth mode would be mostly providing services (in cloud) and commercial support around Spark. Cloudera is actively pushing (1, 2) Spark and sooner or later we would see Spark in CDH.

Spark and Machine Learning are a nice combination. MapReduce processing provides high latency and high throughput and is not well suited for ML processing which are iterative in nature. R and Python (1, 2) interfaces to Spark are still a work-in-progress. So, over time it should be possible to use the rich ML/Statistical libraries of R/Python with Spark.

Btw, here is the original paper on Spark.