Big Data and Cloud Tips: July 2013

Tuesday, July 23, 2013

Crowdfunding for the development of the Ubuntu Phone

I had been using Ubuntu for the last 8-9 years and had been more than happy using it. Initially it didn't support the different connected devices and had to some work around, but lately Ubuntu had been the OS of choice for Desktop and is fastly moving into the data centers also (1, 2). Although Ubuntu releases a new OS version every 6 months, I am stuck with Ubuntu 12.04 (Precise Pangolin) because it's the latest release with a Long Term Support (LTS). Ubuntu LTS versions are released every 2 years and are supported for 5 years, so 12.04 version is supported till April, 2017.

The VM which we are using for the Big Data training uses Ubuntu 12.04 and I distribute it with without any restrictions, the same would have not been possible with a proprietary OS.

Came to know that Canonical the company which is providing commercial support for Ubuntu has started crowdfunding for the development of an Ubuntu phone under Ubuntu Edge. The hardware specs look really cool, which may change over the time of the development.

One of the coolest thing of the mobile is that it can be connected to a desktop and can be used as a desktop. All the files will be on the mobile, so we have the desktop environment in our pocket all the time. The phone can also be dual booted with Android and Ubuntu.

Canonical is aiming for 32 million USD from the crowd for the development of the phone within a months time. It's not a donation, the phone is delivered around May, 2014. It's a all or none, if the $$ target is reached, the development of the phone happens or else the $$ are returned to the payer.

If not interested in buying the phone, a minimum of 20$ can also be given for the initiative. More details of the founder initiative here. This is what I have chosen to contribute, so I am now a founder for this initiative :)

More details about the Ubuntu Edge here.

Wednesday, July 17, 2013

Myyrix gets folded into Cloudera

According to the Myrrix site `Myrrix is a complete, real-time, scalable clustering and recommender system, evolved from Apache Mahout.`

Myrrix was founded by Sean Owen, who is also the author of `Mahout in Action` book. The book is very good to get started with Mahout, but has been published end of 2011. After the book has been published, there had been a lot of changes to the Mahout API and the API in the book is not compatible with the API in the latest release of Mahout framework. Sean also started the taste framework which became part of the Mahout.

With the storage and compute costs going down a lot of interesting things are happening in the Machine Learning space like this. Myrrix has been folded into Cloudera and the plan is to integrate Myrrix with the Cloudera Hadoop. Myrrix and others like BigML, Ayasdi make using Machine Learning easy to use for the masses.

Friday, July 12, 2013

What is NoSQL? A video by Martin Fowler

Here is a nice video by Martin Fowler on `Introduction to NoSQL`. At the end he talks about polygot databases. It's not like NoSQL is going to replace RDBMs and that RDBMs will vanish for ever. NoSQL and RDBMs have their own space for meeting different requirements and will coexist. Fowler explains it in a very succinct way and it's also fun to watch his talk.

It's interesting to note that he doesn't mention Spanner anywhere. For the impatient to read the Google paper on Spanner, here are some articles on Spanner (1, 2) and a video (1).

I am firm believer that getting a clear understanding of the core concepts is very essential before diving into a new technology. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence is a very nice book for those who are getting started with NoSQL. I am also starting a new page with links to NoSQL and will keep it updated as I come across some interesting information.

Wednesday, July 10, 2013

When to use HBase and when MapReduce?

Very often I do get a query on when to use HBase and when MapReduce. HBase provides an SQL like interface with Phoenix and MapReduce provides a similar SQL interface with Hive. Both can be used to get insights from the data.

I would like the analogy of HBase/MapReduce to an plane/train. A train can carry a lot of material at a slow pace, while a plane relatively can carry less material at a faster pace. Depending on the amount of the material to be transferred from one location to another and the urgency, either a plane or a train can me used to move material.

Similarly HBase (or in fact any database) provides relatively low latency (response time) at the cost of low throughput (data transferred/processed), while MapReduce provides high latency at the cost of high throughput. So, depending on the NFR (Non Functional Requirements) of the application either HBase or MapReduce can be picked.

E-commerce or any customer facing application requires a quick response time to the end user and and also only a few records related to the customer have to be picked, so HBase would fit the bill. But, for all the back end/batch processing MapReduce can be used.

Sunday, July 7, 2013

The shape of data blog

For those who are interested in getting introduced to data science, but are not from a mathematical or a statistics background Jesse Johnson (Assistant Professor of Mathematics at Oklahoma State University) had been writing some nice articles regularly on his blog at `the shape of data`, explaining the different concepts from a geometrical perspective.

I completed reading a couple of articles and couldn't stop myself mentioning about it on my blog. Understanding an algorithm is important, but visualizing it takes us to the next level while using it. In my Bachelors of Engineering, I could visualize some of the courses like Engineering Mechanics/Drawing and got very good scores, while some other courses like Electronics Engineering went down.

Thanks to Jesse Johnson for the blog. Please let me know any other interesting blogs to get started with data science in the comments and I will add them to this entry.

Saturday, July 6, 2013

My new Canon 600D Digital SLR (DSLR)

I had been using Nikon N65 film SLR for some time and a Sony point and shoot camera. But, the cost for getting the film developed and time for the processing is a pain in the neck. So, decided to buy a digital SLR (DSLR) and was fluctuating to buy a Canon 600D or a Nikon D5100. After looking at reviews and taking the advice from those who have used SLRs a lot, finally settled with a Canon 600D (also known as T3i) from Flipkart.

Digital SLRs are nice and can take some very good pictures, but are complex to use. The manual provided by Canon has 323 pages and the DSLR has more then 20 controls and tons of options. But, not to mention it takes some nice photos. Below is a picture of my kid with a bicycle we made with Awesome Strawesome. The focus in on the bicycle with the background blurred, this gives a depth to the picture as below. The LCD monitor of the monitor has a swivel, so that I need not twist myself to get pictures at an odd angle.

The main disadvantage of using a DSLR is the size and the weight. Many times I was a bit lazy to carry it with me on different occasions and wish I had it. I have seen smart phones which fit into a pocket easily taking some nice pictures. Anyway, both DSLRs and Smart Phones have their own space for photography.

There are a lot of nice resources on the internet and books also on digital photography. Technology really interests me, but it's nice to get away/diverted from it once a while.

Thursday, July 4, 2013

Google paper on optimal provisioning of flash

Here is a nice article on the different types of memory in a computer. The top of the pyramid like L1/L2 cache is close to the CPU and is fast, but is costlier and small in terms of storage. The bottom of the pyramid is far to the CPU and is slow, but is cheaper and large in terms of storage.

The mentioned article doesn't talk about SSD. SSDs are fast and consume less power when compared to HDD. This is one of the reason why Laptops with SSD are ligher and faster. But, SSD are costlier in terms of per GB cost when compared to a similar capacity HDD.

Some of the recent computers have a hybrid of SSD and HDD to gain the benefits of both. Data is flushed from/into flash based on the LRU or the FIFO algorithms. This makes computers work faster without spending heavily on a SSD only computer. This is refered to as Express Cache and more about it here.

Google has published a paper called Janus applying the same above concepts to a data center level. Though SSD is fast, it's cost makes it prohibitive to have SSD only data center. So, the data is moved between the SSD and HDD based on the LRU and FIFO algorithms. Thanks to the GigaOm article for pointing to Janus.

One interesting aspect is the below observation in the paper

Our results show that the recommendations allow 28% of read operations to be served from flash by placing 1% of our data on flash.

As mentioned in the earlier blog Google had been driving the Big Data space by publishing papers and the Janus papers is one of them.

Another area how Google had been spurring innovation has been by killing the Google Reader. This has lead to lot of other alternative online rss aggregators. I had been trying feedly and theoldreader, and finally settled using theoldreader. Google Reader and theoldreader are pretty close and it feels at home using the oldreader.

Although, Google had been spurring the innovation in many areas, it had also been killing innovation. With Google Reader active others were scared of Google to start an alternative rss aggregator or infact any other product similar to what Google had been offering.

Pages