Big Data and Cloud Tips: 2012

Monday, December 31, 2012

Cloudera Certified Developer for CDH4 (CCD-410)

Finally today after all the procrastination, I got through Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% (57 correct answers out of 60 questions). Cloudera is offering a 40% off CDH4 tests and Second Shot till the end of the year (2012), so I took the last opportunity on 31st December, 2012.

Took the appointment with Pearson Vue (the customer service was fantastic) on 27th December, 2012 and the certification exam on 31st December, 2012. As many of you know, I am a big fan of Hadoop and related technologies and had been helping others through this blog and through different forums like Hyderabad-HUG and StackOverflow. All of this and the support I got helped me to get through the certification with a good score with just two days of preparation.

Finally, for those who are thinking about getting through the certification, Hadoop - The Definitive Guide - 3rd Edition book is a must read with a lot of hands on experience. Also, there were no specific questions related to Cloudera version of Hadoop. Getting familiar with the Apache version Hadoop is enough for the certification.

It was a nice way to say good bye to 2012 and welcome 2013. I am very comfortable with setting up a Hadoop cluster and plan to complete the Cloudera Certified Administrator for Apache Hadoop CDH4 (CCA-410) some time in February, 2013. Will update on the same in another blog post.

Monday, December 24, 2012

Introduction to Apache Hive and Pig

Apache Hive is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop. Hive supports HiveQL which is similar to SQL, but doesn't support the complete constructs of SQL.

Hive coverts the HiveQL query into Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java MapReduce will required a lot of code to be written/debugged compared to HiveQL. So, it increases the developer productivity to use Hive.

To summarize, Hive through HiveQL language provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce. But the Hive community is working to narrow down this gap for most of the commonly used scenarios.

Along the same line Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which is converted into Java MapReduce program and then submitted to the Hadoop cluster.

While HiveQL is a declarative language like SQL, PigLatin is a data flow language. The output of one PigLatin construct can be sent as input to another PigLatin construct and so on.

Some time back, Cloudera published statistics about the workload character in a typical Hadoop cluster and it can be easily observed that Pig and Hive jobs are a good part of the jobs in Hadoop cluster. Because of the higher developer productivity many of the companies are opting for higher level abstracts like Pig and Hive. So, we can bet there will be a lot of job opening around Hive and Pig when compared to MapReduce developers.

Although Programming Pig book has been published (October, 2011) some time back, Programming Hive book was published recently (October, 2012). For those who have experience working with RDBMS, getting started with Hive would be a better option than getting started with Pig. Also, note that PigLatin language is not very difficult to get started with.

For the underlying Hadoop cluster it's transparent whether a Java MapReduce job is submitted or a MapReduce job is submitted through Hive and Pig. Because of the batch oriented nature of MapReduce jobs, the jobs submitted through Hive and Pig are also batch oriented in nature.

For real time response requirements, Hive and Pig doesn't meet the requirements because of the earlier mentioned batch oriented nature of MapReduce jobs. Cloudera developed Impala which is based on Dremel (publication from Google) for interactive ad-hoc query on top of Hadoop. Impala supports SQL like query and is compatible with HiveQL. So, any applications which are built on top of Hive should work with minimal changes with Impala. The major difference between Hive and Impala is that while HiveQL is converted into Java MapReduce jobs, Impala doesn't covert the SQL query into a Java MapReduce jobs.

Someone can ask whether to go with Pig or Hive for a particular requirement which is a topic for another blog.

Saturday, December 1, 2012

RDBMS vs NoSQL Data Flow Architecture

Recently I had an interesting conversation with someone who is an expert in Oracle Database on the difference between RDBMS and a NoSQL Database. There are a lot of differences, but the data flow is as shown below in those systems.

In a traditional RDBMS, the data is first written to the database, then to the memory. When the memory reaches a certain threshold, it's written to the Logs. The Log files are used for recovering in case of server crash. In RDBMS before returning a success on an insert/update to the client, the data has to be validated against the predefined schema, indexes created and other things which makes it a bit slow compared to the NoSQL approach discussed below.

In case of a NoSQL database like HBase, the data is first written to the Log (WAL), then to the memory. When the memory reaches a certain threshold, it's written to the Database. Before returning a success for a put call, the data has to be just written to the Log file, there is no need for the data to be written to the Database and validated against the schema.

Log files (first step in NoSQL) are just appended at the end and is much faster than writing to the Database (first step in RDBMS). The NoSQL data flow discussed above gives a higher threshold/rate during data inserts/updates in case of NoSQL Databases when compared to RDBMS.

Sunday, November 25, 2012

Experience with Bamboo Fun (CTH-640) tablet

During a class room training I don't use much of presentations, but do a lot of free hand drawing and writing on the white board. From experience I found this is a win-win situation for me as well as the trainees. I plan to do the same thing for virtual training also. I could have used mouse for drawing, but I opted to buy a Wacom Bamboo Fun in Hyderabad, India and having really fun with it. Wacom has a wide range of tablets, Bamboo Fun falls in the mid range and is really nice for those who are starting to use tablets.

I primarily use Ubuntu 12.04 as my Desktop OS and had to do a bit of research to make sure that the tablet works with Ubuntu before buying it. The moment I connected it through the USB to the Laptop, I was able to use the tablet with GIMP for drawing without any configuration or any additional software. Later I installed MyPaint and started using it with the tablet. I am not sure, but I think not all the features of the Bamboo Fun are available in Ubuntu because of the driver. Need to check it out.

Bamboo Fun box also has a CD with Adobe Photoshop Elements 10, Corel Painter Essentials 4 and Bamboo Scribe 3.0 for Windows/Mac. Was able to install all these on a Windows 7 Virtual Machine. The tablet also comes with a USB cable, a pen and a couple of nibs for the pen.

The only gripe is that a case was not available from the Wacom dealer. So, I had to design a case, get it done and it turned to be fun and came out very nice as shown in the pictures.

Overall, the Wacom Bamboo Fun is worth the price and would recommend for anyone who are getting started with online presentations and professional drawing. If you have kids at home, they also would really like the tablet like mine did for drawing. But make sure that they handle the tablet with care.

Recently, I have seen Sudheer (whom I had been working with for some time) using iPad with a pen to deliver class room training and it was really effective. The iPad was screened to a projector and the drawings/writings were were also saved for later reference. His use of iPad has streamlined and validated my thought of using the Bamboo Fun.

This is what can be done with the tablet. Have fun !!!

Friday, November 23, 2012

Tata Photon Plus security risk with Ubuntu

When I tried to update my Ubuntu computer using the `sudo apt-get update;sudo apt-get dist-upgrade` the user password with administrative privileges was not prompted. Some thing smelled fishy and I had to recollect what changes have been done to the OS for not being prompted to enter the password.

Quickly I recalled that I recently bought a Tata Photon Plus for Mobile Broadband Internet, since I would be traveling a bit. As an Ubuntu bigot, my first task is to test it under Ubuntu 12.04 to make sure it works.

When I plugged the Tata Photon Plus into USB on a Ubuntu 12.04 machine, it was not detected for some reason. So, I had to copy the Linux installation files from the data card on a Windows machine to a USB and then copy them to a Ubuntu 12.04 machine.

According to the instructions in the Tata Photon Plus manual, I had to run the install file under the Linux folder for installing the required driver and software for the Tata Photon Plus to work. The install script required administrative privileges (sudo) to run the script.

So, I looked into the install file as the initial suspect and found the following to my surprise

# Shashank: Defect fix AJ2D13470: Begin
echo -e "ALL ALL=(ALL) NOPASSWD:ALL" >> /tmp/${TEMPFILE}

and

# Shashank [s72814] added to generate .bin file and to give access permissions in user mode: End
# Shashank: Defect fix AJ2D13470: End
cp -f /tmp/${TEMPFILE} /etc/sudoers

What our friend Shashank (possible from Huawei, who have worked with Tata to provide the necessary hardware and software for the data card) has done is to open to Ubuntu 12.04 system wide open for any changes without any password. The first thing I did was to run `sudo visudo` and then comment out the below line.

ALL ALL=(ALL) NOPASSWD:ALL

Finally happy and back to the original secure state. For those who are interested interested in the install file, here it is. Hope that the software developers are a bit careful and the companies who they work for do a bit of audit before releasing it to the public. Not all users of software are experts in tracing out and fixing the problems in a software.

I initially thought of reporting the same to the Tata, but could not get a proper contact from their website, whom I could report to. So, I decided to post to make this public to get the due attention.

Thought of the day is to use new softwares with a bit of salt.

Tuesday, November 13, 2012

Making USB visible in the VirtualBox Guest

For many of you who might been be following by blog might know that I use Ubuntu 12.04 32-bit as my primary desktop and use Window 7 guest with VirtualBox for working with Microsoft Office documents and also to sync my Apple Nano using iTunes.

I recently bought a device (would be blogging about it in the next post) for which the CD had drivers for Windows and not for Ubuntu. So, I did a bit of research and found out that the device is supported in Ubuntu and so I decided to buy it.

When the device is connected to the Laptop, the host OS (Ubuntu 12.04) recognizes it, but the guest OS (Windows 7) doesn't. I wanted to exploit the device to the full extent, so wanted to use it in the Windows 7 guest with the proper drivers and software installed.

Here are the steps to be followed to make the USB device visible on the guest OS.

1) OSE (Open Source Edition) version of VirtualBox is installed by default with the installation from the Ubuntu repository. The appropriate version of the VirtualBox Extension pack has to be downloaded and installed. To install it, simply double click the extension and VirtualBox should open, prompting you to install it. If that doesn't work, you can also select File > Preferences > Extensions and browse for the downloaded VirtualBox extension.

2) Install the `gnome-system-tools` package using the `sudo apt-get install gnome-system-tools` command. This installs the `Users and Groups` program. Using thi thse user who logged into the OS has to be included in the vboxusers group. Logout and login back for the group association to take effect.

3) Connect the device which needs to be visible in the Guest OS and add a USB filter as shown here. Here are more instructions for the same. Initially the device which was visible
only in the host OS is no more visible in the host OS, but visible in the guest OS as required.

One thing to note is that the critical devices which are connected to the host OS through USB should not be added to the guest OS as they would be disconnected to the host OS and connected to the guest OS.

Next blog will be about an interesting device I bought and my review about it. So, keep following the blog.

Sunday, October 28, 2012

Debugging a Hadoop MapReduce Program in Eclipse

Note : Also don't forget to do check another entry on how to unit test MR programs with MRUnit here. And here is a screencast for the same.

Distributed applications are by nature difficult to debug, Hadoop is no exception. This blog entry will try to explain how to put break points and debug a user defined Java MapReduce program in Eclipse.

Hadoop support executing a MapReduce job in Standalone, Pseudo-Distributed and Fully-Distributed Mode. As we move from one more to another in the same order, the debugging becomes harder and new bugs are found on the way. Standalone mode with the default Hadoop configuration properties allows MapReduce programs to be debugged in Eclipse.

Step 1: Create a Java Project in Eclipse.

Step 2: For the Java project created in the earlier step, add the following dependencies (commons-configuration-1.6.jar, commons-httpclient-3.0.1.jar, commons-lang-2.4.jar, commons-logging-1.1.1.jar, commons-logging-api-1.0.4.jar, hadoop-core-1.0.3.jar, jackson-core-asl-1.8.8.jar, jackson-mapper-asl-1.8.8.jar and log4j-1.2.15.jar) in Eclipse. The dependencies are available by downloading and extracting a Hadoop release.

Step 3: Copy the MaxTemperature.java, MaxTemperatureMapper.java, MaxTemperatureReducer.java, MaxTemperatureWithCombiner.java, NewMaxTemperature.java to the src folder under the project. The Sample.txt file which contains the input data should be copied to the input folder. The project folder structure should look like below, without any compilation errors.

Step 4: Add the input and the output folder as the arguments to the MaxTemperature.java program.

Step 5: Execute MaxTemepature.java from Eclipse. There should be no exceptions/errors shown in the console. And on refreshing the project, an output folder should appear as should below on successful completion of the MapReduce job. To rerun the program, the output folder has to be deleted.

Step 6: As in the case of any Java program, break points can be put in the MapReduce driver, mapper, reducer code and debugged.

In the upcoming blog, we will see how to include/compile/debug Hadoop code into Eclipse along with the user defined driver, mapper and the reducer code.

Happy Hadooping !!!!

Note (5th March, 2013) : The above instructions have been tried on Ubuntu 12.04 which has all the utilities like chmod and others, which Hadoop uses internally. These tools are not available by default in Windows and you might get error as mentioned in this thread, when trying the steps mentioned in this blog on a Windows machine.

One alternative it to install Cygwin on Windows as mentioned in this tutorial. This might or might not work smoothly.

Microsoft is working very aggressively to port Hadoop to the Windows platform and has released HDInsight recently. Check this and this for more details. This is the best bet for all the Windows fans. Download the HDInsight Server on a Windows machine and try out Hadoop.

Sunday, October 7, 2012

Google driving the Big Data space

Google has unique requirements with respective to data processing and storage which no one has. According to the WikiPedia, Google has to process about 24 Peta Bytes of data per day which be a bit outdated and Google might be processing more data per day. So, they need to continuously innovate to address the unique requirements. Soon they outgrow the innovation and they come up with some new innovation.

The good thing is that Google had been continuously releasing these innovations as papers once they have it refined and there is a solid internal implementation of it.

These Google Papers have been implemented by the ASF (Apache Software Foundation) and others. It's taking some time for the ASF frameworks like Hadoop and others to production ready. There is a catchup between Google papers and the ASF on a continuous basis.

Google Paper	Apache Frameworks
The Google File System (October, 2003)	HDFS (2008 became Apache TLP)
MapReduce: Simplified Data Processing on Large Clusters (December, 2004)	MapReduce (2008 became Apache TLP)
Bigtable: A Distributed Storage System for Structured Data (November, 2006)	HBase (2010 became Apache TLP), Cassandra (2010 became Apache TLP)
Large-scale graph computing at Google (June, 2009)	Hama, Giraph (2012 became Apache TLP)
Dremel: Interactive Analysis of Web-Scale Datasets (2010)	Apache Drill (Incubated in August, 2012), Imapala from Cloudera.
Large-scale Incremental Processing Using Distributed Transactions and Notifications (2010)	???
Spanner: Google's Globally-Distributed Database (September, 2012)	???

Following the research/papers published by Google and related blogs/articles gives an idea where Big Data is moving. Many might not have the same requirements nor the resources as Google, so we would be seeing more and more cloud services for the same.

Edit (26th October, 2012) : Here is an article from Wired echoing what was mentioned in the blog. Very inspiring.

Edit (9th October, 2013) : Here is another perspective from Steve Loughran on thinking beyond what Google had been doing.

Friday, October 5, 2012

Some tips for sending emails through Amazon SES (Simple Email Service)

One of the requirement we had was of sending thousands of email newsletters. Without a second thought we decided to use Amazon SES for sending the emails, because as with any of the cloud service there is no initial cost and we need to pay on a need basis with no commitment. Also, the documentation around AWS (Amazon Web Services) is really good and the AWS Eclipse Toolkit was useful to get started.

Many of the emails (in thousands) in our email database were invalid which resulted in a bounce or a complaint. AWS SES handles these invalid mails by sending individual email or/and notifications through AWS SNS. Here is flow for the same.

** Courtesy Amazon

Initially we choose the notification by email. At the end, we had thousands of emails back for both bounced and complaints and had to extract the individual email address from those mails to clean our email database. This was a cumbersome task. So, we choose to alternate route of sending the bounced and complaint email addresses to SNS and then forward them to AWS SQS. Then a Java program using the AWS SDK for Java pulled the messages from the Queue.

Every thing was nice and good till now. But, when AmazonSQS.receiveMessage(ReceiveMessageRequest receiveMessageRequest) was called on the Queue only a single message was returned inspite of having around a thousand messages in the queue. The below is the probable reasoning for it from the ReceiveMessage documentation.

>> Due to the distributed nature of the queue, a weighted random set of machines is sampled on a ReceiveMessage call. That means only the messages on the sampled machines are returned. If the number of messages in the queue is small (less than 1000), it is likely you will get fewer messages than you requested per ReceiveMessage call. If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response; in which case you should repeat the request.

To make life easy, one of the option was to invoke the ReceiveMessageRequest.setMaxNumberOfMessages(Integer maxNumberOfMessages). But, this was also returning a maximum of 10 messages at a time. An exception is thrown when the maximum number of messages is set to more than 10. We stopped automating further, by repeatedly calling AmazonSQS.receiveMessage() till the number of messages in the queue reaches zero. Checking the number of messages in the Queue can be done by calling the GetQueueAttributesResult.getAttributes() method, but this is again not returning a consistent value because of the distributed nature of the queue.

Amazon Web Services is a awesome service and I am really impressed by how quickly someone can get started, but there are small quirks here and there which need to be addressed. The purpose of this blog entry is to help those who are getting started with AWS for sending emails.

For those who are getting started with AWS, Amazon offers a free usage tier. The only caution is to stop the services when not needed, so that the usage doesn't cross the free limit.

Wednesday, October 3, 2012

Is Java a prerequisite for getting started with Hadoop?

The query I get often from those who want to get started with Hadoop is if knowledge of Java is a prerequisite or not. The answer is both a Yes and a No, depends on the individual persons interest on what they would like to do with Hadoop.

Why No?

MapReduce provides Map and Reduce primitives which had been as Map and Fold primitives in the Functional Programming world in language like Lisp from quite some time. Hadoop provides interfaces to code in Java against those primitives. But, any language supporting read/write to STDIO like Perl, Python, PHP and others can also be used using Hadoop streaming feature.

Also, there are high level abstractions provided by Apache frameworks like Pig and Hive for which familiarity of Java is not required. Pig can be programmed in Pig Latin and Hive can be programmed using HiveQL. Both of these programs will be automatically converted to MapReduce programs in Java.

According to the Cloudera blog entry `What Do Real-Life Hadoop Workloads Look Like?` - Pig and Hive constitute a majority of the workloads in a Hadoop cluster. Below is the histogram from the mentioned Cloudera blog.

Why Yes?

Hadoop and the ecosystem can be easily extended for additional functionality like developing custom Input and OutputFormats, UDF (User Defined Functions) and others. For customizing Hadoop knowledge of Java is mandatory.

Also, many times it's required to get deep into Hadoop code as to why something is behaving a particular way or to know more about the functionality of a particular module. Again knowledge of Java comes handy here.

Hadoop projects come with a lot of different roles like Architect, Developer, Tester, Linux/Network/Hardware Administrator and some of which require explicit knowledge of Java and some don't. My suggestion is if you are genuinely interested in Big Data and think that Big Data will make a difference then deep dive into Big Data technologies irrespective of knowledge about Java.

Happy Hadooping !!!!

Sunday, September 23, 2012

Big Data - Making it work @ HUG-Hyderabad

HUG-Hyderabad meeting was conducted in Infosys Hyderabad campus on 22nd September, 2012 and it was way beyond what was expected and I made some good networking with the like minded. Here is the agenda for the meeting. Along with the other presentations the data visualization presentation from Gramener was really good and got me excited. It's not only important to process the data, but it's even more important to visualize the data. Here is the presentation by Naveen from Gramener.

Hadoop has become like a buzz word and there is an effort to fit every problem with a Hadoop Solution. As you have noticed from the blog, I am pretty much interested in alternate frameworks and models to Hadoop and MapReduce like Hama/Giraph and BSP. So, I talked about some of the strengths and weakness of Hadoop along with a few alternatives. The audience were really interested in it. Here is the presentation I have used for my session.

There had been IT bubbles and booms in the past and there will be there more in the future also. While some of them were geared towards decreasing the costs and some towards making more profits, Big Data technology has started making a direct impact on our every day lives from Health Care to Travel to Education. There is no aspect in life which isn't or can't be benefited from Big Data. Here is another presentation by Sudheer Marisetti from Abacus Concepts.

Enterprises are adopting open source clouds to avoid any vendor lock-ins and Service Providers are using open source clouds due to the reason that it can be customized to differentiate them from other Service Providers . Here is a nice article from GigaOM on `What role does open source play in cloud computing innovation?`. Ram Chinta from CloudStack shared the below `Community Connect : Apache CloudStack` presentation from the HUG session. Also, CloudStack meetups are being planned in Hyderabad. For those interested please join the CloudStack Hyderabad Group here.

The next HUG meeting would be mostly scheduled in a month or so in Hyderabad and I am anxiously waiting for it. I encourage those who are interested in Big Data to register here for the upcoming event in meetup and attend it. Also, if anyone is interested in hosting the event in their company in Hyderabad please let me know at praveensripati@gmail.com

Have a nice day !!!

Monday, September 10, 2012

Why does Hadoop uses KV (Key/Value) pairs?

Hadoop implements the MapReduce paradigm as mentioned in the original Google MapReduce Paper in terms of key/value pairs. The output types of the Map should match the input types of the Reduce as shown below.

(K1,V1) -> Map -> (K2,V2)
(K2, V2) -> Reduce -> (K3,V3)

The big question is why use key/value pairs?

MapReduce is derived from the concepts of functional programming. Here is a tutorial on the map/fold primitives in the functional programming which are used in the MapReduce paradigm and there is no mention of key/value pairs anywhere.

According to the Google MapReduce Paper

We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.

The only reason I could see why Hadoop used key/value pairs is that the Google Paper had key/value pairs to meet their requirements and the same had been implemented in Hadoop. And everyone is trying to fit the the problem space in the key/value pairs.

Here is an interesting article on using tuples in MapReduce.

Wednesday, August 29, 2012

Dual booting Ubuntu 12.04 (32-bit) and CentOS 6.3 (64-bit) for Hadoop

Usually I download Hadoop from Apache (code or binaries) and work on it. But, this time I wanted to give CDH, HDP and MapR a shot, to see what additional features each one of them have and how easy it is to install them. One requirement for all of them is to have a 64-bit OS. The catch is, for now HDP has only rpm packages available and no deb packages. So, one of the option is to install CentOS 64-bit OS. I could have used a VM images from Cloudera/MapR/HortonWorks, but working with the VM images is a pain because the downloads are big and VMs are a bit slow when compared working with OS directly installed on the hardware.

So, I gave CentOS (64-bit) a shot, downloaded it from one of the torrents and installed it along Ubuntu OS. It was smooth with a couple of hiccups.

Here is some of the things to consider while dual booting Ubuntu and CentOS.

- First thing to note is that CentOS uses an old version of GRUB, while Ubuntu uses a relatively new version. So, I choose to install CentOS first and then Ubuntu. This left the NoteBook with the GRUB from Ubuntu.

- After setting the correct time zones and enabling NTP, the clocks were showing an exact difference of 5 1/2 hours between Ubuntu and CentOS. Both the OS were interpreting the motherboard RTC differently, Ubuntu as `UTC=no` and CentOS as `UTC=yes`. I had to change `UTC=yes` in `/etc/default/rcS` on Ubuntu, since the changes in CentOS were not getting reflected in the system clock. Voila - now both Ubuntu and CentOS show the same time.

- To make the click on the touch pad work (on HP 430) `synclient TapButton1=1` had to be run. Not sure if this is a problem with CentOS, or it has to do something with the drivers. Adding it to the startup programs was also not working, had to explicitly run the command from the terminal.

- There were a couple of icons (home, thrash etc) on the CentOS and I like a clean desktop . To remove the permanent icons from the desktop first install `gconf-editor` using `yum install gconf-editor`. Start it (Applications-->System Tools-->Configuration Editor), go to: apps-->nautilus-->desktop and uncheck trash_icon_visible and the relevant options.

- I choose to encrypt the home folder while installing Ubuntu, so the swap was encrypted and I was not able to share it with CentOS. These instructions allowed me to remove the encrypted swap and use the normal one. Later `/etc/fstab` was modified on both the OS to use the same Swap. Using the same swap will break the hibernate functionality because hibernate uses swap to save it's state. One thing to note is for Hibernate to work, the swap should be grater than the RAM on the machine.

- Due to licensing issues some of the multi media codecs are not installed with CentOS. Here are the instructions (1 and 2) to install them.

This might be evident, but what I observed is that CentOS which is based on RedHat is very conservative on getting the newest versions of the different softwares into the release. It's using Firefox 10x, while the current version of Firefox as of this writing is 14.0.1. Firefox might have fixed a couple of bugs and security issues between these two releases, not sure if CentOS has backported them to the release which is defaulted. Also, though I haven't done any performance testing CentOS 6.3 64-bit was more response than Ubuntu 12.04 32-bit on the same machine.

In the follow on blogs, I will write about my experience installing CDH, HDP and MapR.

Happy Hadooping !!!

Monday, July 30, 2012

Democratisation of higher education

Coursera had been picking up steam from sometime by offering online courses for free from different premier institutes ranging on a wide range of topics. Recently I found out that IIT's also had been doing a similar thing by posting Videos of some of the lectures on YouTube on a wide range of topics (Core Sciences, Civil Engineering, Computer Science and Engineering, Electrical Engineering, Electronics and Communication Engineering, Mechanical Engineering).

I went through the `Introduction to ANN` in the IIT YouTube Channel by Prof. S Sengupta and found it very useful. Here are some of the presentations related to Machine Learning. The Videos are a bit too long (~40 videos, 1 hour each on a certain topic), but are very good to gain indepth knowledge.

Design & Analysis of Algorithms
Artificial Intelligence
Neural Network and Applications
Natural Language Processing
Graph Theory
Regression Analysis
Core - Probability and Statistics -
Digital Signal Processing

Although IIT had been doing this for close to 4 years and covers a lot more topics than both Coursera and Udacity, they didn't get much attention. The oldest video I could get was around December, 2007 from IIT. This is way before than Coursera and others did some thing similar. Couple of things missing from the IIT team for wide adoption are

- Publicity of the IIT YouTube channel
- Ability to interact with the fellow students and with the lecturers
- Online grading system

which I don't think is very difficult to add on top of what already is existing. Definitely IIT seems to have interest and motivation in democratizing the eduction, since new Videos are being updated on a regular basis. More effort has to be put on the above mentioned items. One way is to get the help of the IIT students for the same.

Great work from IIT !!! All now someone needs is a computer with a good internet connection and a lot of motivation to learn.

Wednesday, July 25, 2012

Beyond Hadoop - Machine Learning

Once data has been stored in Hadoop, the next logical step is to extract useful information/patterns on which some action can be taken and also to make the machines learn from the vast amount of data. Storing and retrieving raw data is of not much use. Frameworks like Apache Mahout, Weka, R, ECL-ML implement a lot of Machine Learning algorithms. Though Machine Learning is not new, it had been picking up lately because vast amount of data can be stored easily and the processing power is also getting cheaper. Here are some nice articles on the same.

Machine Learning makes it possible for Google Picasa to identify faces in pictures, for GMail to identify spam in mails, for friends recommendations in LinkedIn, for books recommendations in Amazon, for search engines to show relevant information and a lot of other useful things.

I have included a new page for `Machine Learning` where I would be updating with useful and interesting articles/books/blogs/tutorials and other information which would be useful for those who are getting started with Machine Learning. I would also be more frequently blogging about `Machine Learning` here.

I am starting with the Mahout in Action book and this Coursera tutorial.

Wednesday, July 18, 2012

How to manage multiple passwords?

After all the media attention around LinkedIn password leak and others, I am not sure why some of the services (this and this) store the actual password in their databases and send it through mail or SMS upon `forgot password` request. Hashing password without a salt is worse, storing the actual password is worst.

So, I have started using `KeePassX` Password Manager for randomly generating and storing all my passwords. And, I use a stong password for KeePassX.

Ubuntu doesn't come with KeePassX installed, `sudo apt-get install keepassx` will install KeePassX with all the depdencies. Here is a nice tutorial on using KeePassX. The password database is stored in DropBox, so that in a scenario where I loose the machine it would be possible to get back the passwords.

GIMP 2.8 Review

GIMP (GNU Image Manipulation Program) is a software which I had been using for image manipulation and am comfortable using it. I haven't used Adobe PhotoShop, but GIMP really cuts my requirements for this blog and for general purpose photo processing. Also, there are a lot of nice tutorials on GIMP.

Here is a nice glossy button's I have created with GIMP using this tutorial. The banner for this blog has also been created using GIMP only.

Ubuntu repository contains GIMP 2.6 which is nice for all the basic requirements, but one thing which really bugs me is the `Multi-Window Mode` as shown below. The windows keep floating and it's difficult to manage multiple windows.

GIMP 2.8 has this nice feature called `Single-Window Mode` besides other features which makes it easier to manage in a single window as shown below.

The below commands will remove GIMP from Ubuntu (if installed) and will install the latest version of GIMP (2.8), which has the `Single-Window Mode` feature.

sudo apt-get purge gimp*

sudo add-apt-repository ppa:otto-kesselgulasch/gimp

sudo apt-get update

sudo apt-get install gimp

`Artists Guide to GIMP` is a book I would recommend for those who are starting with GIMP.

Have a nice time with GIMP !!!

Thursday, July 12, 2012

Downloading files from YouTube in Ubuntu

There are a lot of nice videos in YouTube from tops for kids to machine learning. Some of these videos are so interesting, feel like viewing them again and again. When you find this pattern, it's better to download the videos. Not only does this allow for offline view, but also save the bandwidth. Bandwidth cap makes this even more useful.

`youtube-dl` is a very useful command to download files from YouTube in Ubuntu. `youtube-dl`has got a lot of nice options, here are some of the options I use

youtube-dl -c -t -f 5 --batch-file=files.txt

-c -> resume partially downloaded file
-t -> Use the title of the video in the file name used to download the video
-f -> Specify the video format (quality) in which to download the video.
--batch-file -> Specify the name of a file containing URLs of videos to download from youtube in batch mode. The file must contain one URL per line.

One thing to note is that the YouTube video can be downloaded in a lot of formats (-f option, see man page for `youtube-dl` for more details) and `-f 5` options uses a format with less download, not-bad quality and also plays in VLC on Ubuntu.

Installing of `youtube-dl` on Ubuntu is pretty straight forward. `sudo apt-get install youtube-dl` would be sufficient. There is a non-linux version of the same, but I haven't tried it out. But, the Ubuntu version of `youtube-dl` is really cool.

Sunday, July 8, 2012

Is Hadoop a square peg in a round hole?

There was an article in GigaOm about Hadoop days being numbered. I agree with some of the points with the author and not some.

Because of the HYPE many are doing something or other around Hadoop and so the ecosystem, support (commercial support/forums etc), production use, documentation is huge. So, trying to fit everything into Hadoop is not the right solution. Alternate paradigms have to be considered while architecting a system based on the requirements.

In the context of graph processing with pregel the author mentions

At the time of writing, the only viable option in the open source world is Giraph, an early Apache incubator project that leverages HDFS and Zookeeper. There’s another project called Golden Orb available on GitHub.

Besides Apache Giraph, there is also Apache Hama for graph processing based on pregel. Also, Apache Giraph and Hama have moved from incubator to Apache TLP (Top Level Project). While, Giraph can only be used for graph processing, Hama in a pure BSP engine which can be used for a lot of other things besides graph processing. In contrast, there was a blog entry mentioning that Giraph can be used to process other models also.

Then, there is also GraphLab and Golden Orb. While there had been some work going on GraphLab, Golden Orb had been dormant for more than an year.

For those interested here is a paper comparing MapReduce with Bulk Synchronous Parallel. The paper states that MR algorithms can be implemented in BSP and the other way also. But, some algorithms can be effectively implemented in BSP and some in MR.

Once again I would like to iterate to consider alternate paradigms/frameworks besides Hadoop/MapReduce while architecting a solution around big data.

Tuesday, July 3, 2012

Improving crop output using big data

This recent article`India to launch $75m mission to forecast rains`got my attention. The meteorological department had been doing a very poor job forecasting the weather. Forget about the hourly/daily, the complete season forecast was of no use to the farmers who depend on the rains for their crop.

From the above mentioned article `Last year they predicted a bad monsoon, but in the end the rains turned out to be in excess of what was forecast.`

So, this made me think why not invest a small fraction of the $75m to fund a competition in Kaagle to forecast the weather for the next monsoon with all the data available with the meteorological department. To my knowledge, there shouldn't be any problem to share the data in the public.

If someone has a contacts with the meteorological department, please pass this information to them.

Tuesday, June 26, 2012

Beyond Hadoop WordCount

I often get the question `Not that I have run the wordcount example - what next? What else can implemented on top of Hadoop?`. Here are some of the options to consider

- Go through the code in the Hadoop examples package and understand in detail how MapReduce works.

- Implement some of the examples in `Data-Intensive Text Processing with MapReduce`.

- Pick a topic of interest from the blog entry from atbrox and start implementing it in MapReduce.

BTW, although algorithms mentioned above might be implemented in MapReduce, but it might not be the best model to implement the algorithms. Start considering alternative models like BSP. Take a look at Apache Hama and Giraph frameworks for implementing some of the above mentioned algorithms. Especially, iterative algorithms like PageRank can be efficiently implemented over Hama and Giraph, when compared to Hadoop.

Edit (27th August, 2012) : This blog article has a nice summary on how to get started analyzing the public data sets.

Monday, June 25, 2012

`Graph Processing Applications` Session @ HUG-Hyderabad

Last Saturday (23rd June, 2012) took a session on `Graph Processing Applications` @ Hyderabad-HUG. The session went very good and the response from the audience (~80) was also positive. Some of the audience asked for more technical details with a demo on Hama/Giraph. I plan to take another session in the near future on the same.

Here is the presentation I used for the session

Some were not familiar with graphs, so I started with a basic introduction to graphs and then talked about the different graph processing frameworks (Giraph, Hama) and about the graph databases (Neo4J).

Broadridge was very good at hosting the event, felt at home giving the session. Overall, I am very much satisfied with the session and plan to take a few more related to Big Data. Ed and Thomas had been very helpful to get me kicked off with Hama and some of the concepts behind BSP. Thanks to both of them.

If anyone is present in Hyderabad, would suggest to follow to this meetup and participate (both at the receiving and giving end) in the upcoming HUG sessions. Also, if the company you are working for is interested in hosting a HUG session @ Hyderabad, please let me know at praveensripati@gmail.com.

Have a nice day !!!

Edit (8 July, 2012) - There is an interesting article in GigaOM on the different alternatives to Hadoop/MapReduce.

Tuesday, June 19, 2012

Does Operating Systems really matter?

In the past 1 month I was able to convert two Windows users into Ubuntu and get my kid (4 years old) start with Ubuntu without much difficulty.

The first user wanted me to check why her Windows XP Laptop was slow (was taking ~5 min to boot), with her permission I installed Ubuntu 12.04 with all the required software (like VLC, Media Codecs, Firefox with a couple of extensions, some nice educational games for kids and others). With about 15 minutes of hand holding on Ubuntu, she was able to explore it with much ease.

The second user was using Windows for checking mails, blogging, checking social network sites (Facebook etc) and for Skype. It was snap to get her used to Firefox and Skype on Ubuntu 12.04 again.

My kid just started with computers. I have him use my desktop with Ubuntu 12.04 and he is comfortable with some of the educational games that I have installed on it (GCompris and others) for him.

The above facts made me think if OS really matters anymore as long as equivalent softwares are available on the different Operating Systems? Also, as more and more applications are moving to the web, the significance of a particular operating system is becoming less and less. It's very rare now a days that a particular site works for a particular web site works for a particular browser and not for other, the browser experience is almost the same on any Operating System.

Hope to convert more computer users into Ubuntu !!!

Friday, March 30, 2012

Hadoop Vs Hama

As I mentioned earlier getting started with Hama is very easier to use for those who are familiar with Hadoop. Much of the ideas and code have been borrowed from Hadoop.

Hama also follows master/slave pattern. Above picture conveys the mapping between Hadoop and Hama. A JobTracker maps to a BSPMaster, TaskTracker maps to a GroomServer and Map/Reduce task maps to a BSPTask.

The major difference between Hadoop and Hama is while map/reduce tasks can't communicate with each other, BSPTask can communicate to each other. So, when a job requires multiple iterations (as in the case of graph processing) the data between iterations has to be persisted and read back when MapReduce (Hadoop) is used. This is not the case with BSP (Hama) as the tasks can communicate with each other. This leads to better efficiency as the overhead of disk writing and reading is avoided.

Here are some more details about the Hama architecture.

Tuesday, March 27, 2012

HortonWorks Webinars

HortonWorks has scheduled a series of Webinars in the month of April and May

- Simplifying the Process of Uploading and Extracting Data from Hadoop

- HDFS Futures: NameNode Federation for Improved Efficiency and Scalability

- Improving Hive and HBase Intergration

Saturday, March 24, 2012

BSP (Hama) vs MR (Hadoop)

BSP (Bulk Synchronous Parallel) is a distributed computing model similar to MR (MapReduce). For some class of problems BSP performs better than MR and other way. This paper compares MR and BSP.

While Hadoop implements MR, Hama implements BSP. The ecosystem/interest/tools/$$$/articles around Hadoop is much more when compared to Hama. Recently Hama has released 0.4.0 and includes examples for page rank, graph exploration, shortest path and others. Hama has many similarities (API, command line, design etc) with Hadoop and it shouldn't take much time to get started with Hama for those who are familiar with Hadoop.

As of now Hama is in incubator stage and is looking for contributors. Hama is still in the early stage and there is a lot of scope for improvement (performance, testing, documentation etc) . For someone who wants to start with Apache, Hama is a choice. Thomas and Edward had been actively blogging about Hama and are very responsive for any clarifications in the Hama community.

I am planning to spend some time on Hama (learning and contributing) and would keep posting on this blog on the progress.

Ubuntu - Syncing packages across multiple machines

I have multiple Ubuntu machines and couple of times I had to install Ubuntu from scratch either because the Ubuntu upgrade was not smooth or I messed it up. In either case after installing the Ubuntu, all the additional softwares had to be installed. I used to maintain a command similar to below in Dropbox

sudo apt-get install bum nautilus-dropbox vlc gufw ubuntu-restricted-extras nautilus-open-terminal pidgin freemind gimp subversion autoconf libsvn-java nethogs gedit-plugins skype okular wine1.3 artha alarm-clock git gnome-shell ssh qgit virtualbox virtualbox-guest-additions-iso kchmviewer id3v2 g++ lxde chromium-browser

`Ubuntu Software Center` which is installed by default in Ubuntu has the feature (File -> Sync Between Computers ....) for syncing installed packages across multiple Ubuntu machines. The list of packages installed is maintained online.

Wednesday, March 21, 2012

My new Nano

I bought the 1st generation Apple Nano from BestBuy some 7-8 years back and it was gathering dust because of my iPod. Apple recalled the old Nano because it was posing safety risk and a month back I got an 8GB silver Nano as a replacement. It took more than 3 months for Apple to replace it, but I got the new Nano a month back and extremely happy with it.

It works fabulous, but the only gripe I have is that there is no iTunes for Linux. When I tried to sync the Nano with Banshee, the Nano was crashing and had to restore it on a Windows machine. I need to spend some time on how to get sync to work on a Ubuntu machine.

BTW, my 4 year old son is very comfortable using LG Optimus One (Andriod) and iPAD (iOS) to play music, watch pictures and play games. The way many of us grew with bulky monochrome monitors, RAM in MBs, HDD in single digit GBs - many of the kids are growing with smart phones and tablets. When they grow and get a chance to decide for a tablet or a pc, there is no doubt on what they will pick.

HDFS Facts and Fiction

Sometimes even the popular blogs/sites don't get the facts straight, not sure if articles are reviewed by others or not. As I mentioned earlier, changes in Hadoop are happening at a very rapid pace and it's very difficult to keep updated with the latest.

Here is a snippet from ZDNet article

> The Hadoop Distributed File System (HDFS) is a pillar of Hadoop. But its single-point-of-failure topology, and its ability to write to a file only once, leaves the Enterprise wanting more. Some vendors are trying to answer the call.

HDFS supports appends and that's the core of HBase. Without HDFS append functionality HBase doesn't work. There had been some challenges to get appends work in HDFS, but they have been sorted out. Also, HBase supports random/real-time read/write access of the data on top of HDFS.

Agree that NameNode is a SPOF in HDFS. But, HDFS High Availability will include two NameNodes and for now switchover from the active to the standby NameNode is a manual task to be done by an operator, work is going on to make this automatic. Also, there are mitigations around SPOF in HDFS like having a Secondary NameNode, writing the NameNode meta data to multiple locations.

One thing to ponder if HDFS is so unreliable, we wouldn't have seen so many cluster using HDFS. Also, on top of HDFS other Apache frameworks are also being built.

Thursday, March 15, 2012

How easy is it to use Hadoop?

This article made me think how easy it is to setup Hadoop. Setting up Hadoop on a single/multiple nodes and running MR jobs is not a big challenge. But, getting it to run efficiently and securely is a completely different game. There are too many Hadoop parameters to be configured, some of which are not documented. Now, add the different hardware configurations, different Hadoop versions and lack of documentation to the mix.

All the Hadoop production clusters have a separate team who are very familiar with the Hadoop code and know it in and out. Hadoop is evolving at a very fast pace and it's like a moving target to keep updated with the changes in the Hadoop code. It's not just possible to download the Hadoop binaries and use as-is in production efficiently.

I believe in the potential of Hadoop and don't want to deter those who wanted to get started with Hadooop. This blog is all about to make it easy for those who are getting started with Hadoop. Things will change as more and more vendors are getting in the Hadoop space and as they are contributing code to Apache. But, for now Hadoop is a beast and there is and will be huge demand for Hadoop professionals.

Friday, March 9, 2012

Home partition under Ubuntu

As mentioned in the previous blog I was playing with LXDE as a replacement for Unity Desktop Environment for Ubuntu. For some reason (after installing/uninstalling some components of LXDE) the Notebook was not booting into either Unity or LXDE, reinstalling LXDE again was of no use. Luckily, I created a separate partition for home and it was a matter of just reinstalling Ubuntu 11.10 under root (/) from CD, updating it (sudo apt-get update;sudo apt-get upgrade) and installing the required softwares (sudo apt-get install ......). The data in the home partition was intact.

In the past I had experiences where Ubuntu upgrade was not smooth and had to re-install Ubuntu loosing all the data because home didn't exist in a separate partition. Ubuntu has a 6 months release cycle and it is recommended to freshly install Ubuntu instead of upgrading, along with having a separate partition for home.

I haven't tried out, but having a separate partition for home might also help to share the user data among multiple OS on a single machine.

Thursday, March 8, 2012

LXDE Desktop on Ubuntu 11.10

Ubuntu ships with Unity which was OK until I installed LXDE (Lightweight X11 Desktop Environment) using the following command

sudo apt-get install lxde

Although LXDE is a bit rough, it's way faster than Unity or GNome. On the same Notebook Unity took ~232 MB, while LXDE took ~140 MB of RAM. With a 4 GB RAM Notebook, a 72 MB saving of RAM won't make much of a difference. But, the snappiness of it makes me love and stick to LXDE.

LXDE site also claims that it requires less energy to perform tasks to other systems on the market, so I would be able to run on the battery for more time which is another plus. Noticed that LXDE boots ~ 20-25 seconds faster than Unity.

When Ubuntu introduced Unity, it took some time to get used to it, same is the case with LXDE also.

Thanks to howtogeek for the tip and the instructions.

Monday, February 20, 2012

Getting started with HBase Coprocoessors - EndPoints

EndPoints allow for deploying custom operations which are not provided by the core HBase and which is specific to an application. AggregateImplementation is an EndPoint which comes with HBase.

In HBase the table is split across multiple regions based on row key ranges. The jar file containing the EndPoint is deployed on all the region servers. The client ultimately needs to specify the regions on which the EndPoints need to be executed. Since, the client doesn't deal directly with the regions, the regions are indirectly specified by row key or row key ranges.

If a row key is specified then the EndPoint is executed on the region to which the row key belongs to. Alternatively, if a row key range is specified then the EndPoint is applied to all the regions on which the row key range belongs. The client needs to iterate the results from all the regions and consolidate them to get the final result.

HBase EndPoints are very similar to MapReduce. The EndPoint execution is similar to the map task and happens on the Region Server and close to the data. The client code iterates the results from all the regions and consolidates them which is similar to the reduce task. Since in most of the cases the EndPoint execution happens close to the data EndPoints are efficient.

1) Compile the following code and prepare a jar file out of it.

package coprocessor;
import java.io.IOException;

import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.ipc.CoprocessorProtocol;

public interface RowCountProtocol extends CoprocessorProtocol {
   
    long getRowCount() throws IOException;

    long getRowCount(Filter filter) throws IOException;

    long getKeyValueCount() throws IOException;

}

package coprocessor;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.coprocessor.BaseEndpointCoprocessor;
import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.regionserver.InternalScanner;

public class RowCountEndpoint extends BaseEndpointCoprocessor implements
        RowCountProtocol {
    private long getCount(Filter filter, boolean countKeyValues)
            throws IOException {
        Scan scan = new Scan();
        scan.setMaxVersions(1);
        if (filter != null) {
            scan.setFilter(filter);
        }
        RegionCoprocessorEnvironment environment = (RegionCoprocessorEnvironment) getEnvironment();
        // use an internal scanner to perform scanning.
        InternalScanner scanner = environment.getRegion().getScanner(scan);
        int result = 0;
        try {
            List<KeyValue> curVals = new ArrayList<KeyValue>();
            boolean done = false;
            do {
                curVals.clear();
                done = scanner.next(curVals);
                result += countKeyValues ? curVals.size() : 1;
            } while (done);
        } finally {
            scanner.close();
        }
        return result;
    }

    @Override
    public long getRowCount() throws IOException {
        return getRowCount(new FirstKeyOnlyFilter());
    }

    @Override
    public long getRowCount(Filter filter) throws IOException {
        return getCount(filter, false);
    }

    @Override
    public long getKeyValueCount() throws IOException {
        return getCount(null, true);
    }
}

2) Modify the hbase-env.sh file on all the Region Server to include the jar file created earlier containing the coprocessor code.

export HBASE_CLASSPATH="/home/praveensripati/Installations/hbase-0.92.0/lib/coprocessor.jar"

3) Modify the hbase-site.xml to include the class name of the Endpoint on all the Region Servers.

    <property>
        <name>hbase.coprocessor.region.classes</name>
        <value>coprocessor.RowCountEndpoint</value>
    </property>

4) Restart the HBase cluster.

5) Create a 'testtable' table and populate the data in it. The table will be deployed in 5 regions based on row keys. The Endpoint will execute on muliple regions based on the input row keys and send the result to the client.

create 'testtable', 'colfam1', { SPLITS => ['row-300', 'row-500', 'row-700' , 'row-900'] }

for i in '0'..'9' do for j in '0'..'9' do \
for k in '0'..'9' do put 'testtable', "row-#{i}#{j}#{k}", \
"colfam1:#{j}#{k}", "#{j}#{k}" end end end

6) Execute the following on the client to get the count of the the number of rows in the 'testtable'

package coprocessor;

import java.io.IOException;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.coprocessor.Batch;
import org.apache.hadoop.hbase.util.Bytes;

public class EndpointExample {
    public static void main(String[] args) throws IOException {
        Configuration conf = HBaseConfiguration.create();
        HTable table = new HTable(conf, "testtable");
        try {
            Map<byte[], Long> results = table.coprocessorExec(
                    RowCountProtocol.class, null, null,
                    new Batch.Call<RowCountProtocol, Long>() {
                        @Override
                        public Long call(RowCountProtocol counter)
                                throws IOException {
                            return counter.getRowCount();
                        }
                    });

            long total = 0;
            for (Map.Entry<byte[], Long> entry : results.entrySet()) {
                total += entry.getValue().longValue();
                System.out.println("Region: " + Bytes.toString(entry.getKey())
                        + ", Count: " + entry.getValue());
            }
            System.out.println("Total Count: " + total);
        } catch (Throwable throwable) {
            throwable.printStackTrace();
        }
    }
}

7) Here is output of the above program.

Region: testtable,,1329653922153.d88dbec04c8b3093bd256a1e70c5bfe6., Count: 300
Region: testtable,row-300,1329653922157.2431482c120bb0c5939688ef764e3137., Count: 200
Region: testtable,row-500,1329653922157.c8843b18b612d4d8632135d7b8aff0c3., Count: 200
Region: testtable,row-700,1329653922157.abc2ceba898d196334d9561d8eddc431., Count: 200
Region: testtable,row-900,1329653922157.42cfb7cf277782c5cbeba1cc9d3874af., Count: 100
Total Count: 1000

Here is the output from the HBase shell.

hbase(main):006:0> count 'testtable'
Current count: 1000, row: row-999                                                                                          
1000 row(s) in 0.4060 seconds

Note that the output from the EndPoint and the HBase shell is the same.

The above example of EndPoint tries to convey a simple scenario, but more complex scenarios can be built. Also observers which was discussed in the earlier blog can be integrated to build even more complex scenarios.

Edit (10th February, 2013) : Coprocessors can also be deployed dynamically without restaring the cluster to avoid any downtime. Check the `Coprocessor Deployment` section here for more details.

Pages