Monday, March 31, 2014

What is a Big Data cluster?

Very often I get the query `What is a cluster?` when discussing about Hadoop and Big Data. To keep it simple `A cluster is a group or a network of machines wired together acting a single entity to work on a task which when run on a single machine takes much more longer time.` The given task is split and processed by multiple machines in parallel and so that the task gets completed faster. Jesse Johnson puts it in simple and clear terms what a cluster is all about and how to design distributed algorithms here.
IMG_9370 by NeoSpire from Flickr under CC
In a Big Data cluster, the machines (or nodes) are neither as powerful as a server grade machine nor as dumb as a desktop machine. Having multiple (like in thousands) server grade machines doesn't make sense from a cost perspective, while a Desktop grade machine fails often which has to be appropriately handled. Big Data clusters have a collection of commodity machines which fall in between a server and a desktop grade machine.

Similar to open source software projects like Hadoop and others, Facebook started Open Computer Project around computing infrastructure. Facebook doesn't see any edge from their competitors by having a specialized and distinguished hardware from the rest and has been opening some of it's internal infrastructure designs. Anyone can take a design, modify the same and come up with their own hardware.

I am not into much of hardware, but it makes sense if the different data centers (like those from Amazon, Google, Microsoft and others) have a common specification around hardware as it brings down the data center building cost due to the scale of manufacturing and the R&D costs. It's very similar to what had been happening in the Apache and the Linux space, different companies work together is a collaborative environment on a common goal to make software better and enjoy the benefits of the same.

Note (21st April, 2014) : Here is a nice article from ZDNet on how Facebook saved $$$ using the Open Compute Project.

1 comment:

  1. Praveen, One open source technology to mention is HPCC Systems from LexisNexis, a data-intensive supercomputing platform for processing and solving big data analytical problems. Their open source Machine Learning Library and Matrix processing algorithms assist data scientists and developers with business intelligence and predictive analytics. Its integration with Hadoop, R and Pentaho extends further capabilities providing a complete solution for data ingestion, processing and delivery. In fact, both libhdfs and webhdfs implementations are available. More at