Monday, July 30, 2012

Democratisation of higher education

Coursera had been picking up steam from sometime by offering online courses for free from different premier institutes ranging on a wide range of topics. Recently I found out that IIT's also had been doing a similar thing by posting Videos of some of the lectures on YouTube on a wide range of topics (Core Sciences, Civil Engineering, Computer Science and Engineering, Electrical Engineering, Electronics and Communication Engineering, Mechanical Engineering).

I went through the `Introduction to ANN` in the IIT YouTube Channel by Prof. S Sengupta and found it very useful. Here are some of the presentations related to Machine Learning. The Videos are a bit too long (~40 videos, 1 hour each on a certain topic), but are very good to gain indepth knowledge.

Design & Analysis of Algorithms
Artificial Intelligence
Neural Network and Applications
Natural Language Processing
Graph Theory
Regression Analysis
Core - Probability and Statistics -
Digital Signal Processing

Although IIT had been doing this for close to 4 years and covers a lot more topics than both Coursera and Udacity, they didn't get much attention. The oldest video I could get was around December, 2007 from IIT. This is way before than Coursera and others did some thing similar. Couple of things missing from the IIT team for wide adoption are

- Publicity of the IIT YouTube channel
- Ability to interact with the fellow students and with the lecturers
- Online grading system

which I don't think is very difficult to add on top of what already is existing. Definitely IIT seems to have interest and motivation in democratizing the eduction, since new Videos are being updated on a regular basis. More effort has to be put on the above mentioned items. One way is to get the help of the IIT students for the same.

Great work from IIT !!! All now someone needs is a computer with a good internet connection and a lot of motivation to learn.

Wednesday, July 25, 2012

Beyond Hadoop - Machine Learning

Once data has been stored in Hadoop, the next logical step is to extract useful information/patterns on which some action can be taken and also to make the machines learn from the vast amount of data. Storing and retrieving raw data is of not much use. Frameworks like Apache Mahout, Weka, R, ECL-ML implement a lot of Machine Learning algorithms. Though Machine Learning is not new, it had been picking up lately because vast amount of data can be stored easily and the processing power is also getting cheaper. Here are some nice articles on the same.

Machine Learning makes it possible for Google Picasa to identify faces in pictures, for GMail to identify spam in mails, for friends recommendations in LinkedIn, for books recommendations in Amazon, for search engines to show relevant information and a lot of other useful things.

I have included a new page for `Machine Learning` where I would be updating with useful and interesting articles/books/blogs/tutorials and other information  which would be useful for those who are getting started with Machine Learning. I would also be more frequently blogging about `Machine Learning` here.

I am starting with the Mahout in Action book and this Coursera tutorial.

Wednesday, July 18, 2012

How to manage multiple passwords?

After all the media attention around LinkedIn password leak and others, I am not sure why some of the services (this and this) store the actual password in their databases and send it through mail or SMS upon `forgot password` request. Hashing password without a salt is worse, storing the actual password is worst.

So, I have started using `KeePassX` Password Manager for randomly generating and storing all my passwords. And, I use a stong password for KeePassX.

Ubuntu doesn't come with KeePassX installed, `sudo apt-get install keepassx` will install KeePassX with all the depdencies. Here is a nice tutorial on using KeePassX. The password database is stored in DropBox, so that in a scenario where I loose the machine it would be possible to get back the passwords.

GIMP 2.8 Review

GIMP (GNU Image Manipulation Program) is a software which I had been using for image manipulation and am comfortable using it. I haven't used Adobe PhotoShop, but GIMP really cuts my requirements for this blog and for general purpose photo processing. Also, there are a lot of nice tutorials on GIMP.

Here is a nice glossy button's I have created with GIMP using this tutorial. The banner for this blog has also been created using GIMP only.

Ubuntu repository contains GIMP 2.6 which is nice for all the basic requirements, but one thing which really bugs me is the `Multi-Window Mode` as shown below. The windows keep floating and it's difficult to manage multiple windows.

GIMP 2.8 has this nice feature called `Single-Window Mode` besides other features which makes it easier to manage in a single window as shown below.

The below commands will remove GIMP from Ubuntu (if installed) and will install the latest version of GIMP (2.8), which has the `Single-Window Mode` feature.

sudo apt-get purge gimp*

sudo add-apt-repository ppa:otto-kesselgulasch/gimp

sudo apt-get update

sudo apt-get install gimp

`Artists Guide to GIMP` is a book I would recommend for those who are starting with GIMP.

Have a nice time with GIMP !!!

Thursday, July 12, 2012

Downloading files from YouTube in Ubuntu

There are a lot of nice videos in YouTube from tops for kids to machine learning. Some of these videos are so interesting, feel like viewing them again and again. When you find this pattern, it's better to download the videos. Not only does this allow for offline view, but also save the bandwidth. Bandwidth cap makes this even more useful.

`youtube-dl` is a very useful command to download files from YouTube in Ubuntu. `youtube-dl`has got a lot of nice options, here are some of the options I use

youtube-dl -c -t -f 5 --batch-file=files.txt

-c -> resume partially downloaded file
-t -> Use the title of the video in the file name used to download the video
-f -> Specify the video format (quality) in which to download the video.
--batch-file -> Specify the name of a file containing URLs of videos to download from youtube in batch mode. The file must contain one URL per line.

One thing to note is that the YouTube video can be downloaded in a lot of formats (-f option, see man page for `youtube-dl` for more details) and `-f 5` options uses a format with less download, not-bad quality and also plays in VLC on Ubuntu.

Installing of `youtube-dl` on Ubuntu is pretty straight forward. `sudo apt-get install youtube-dl` would be sufficient. There is a non-linux version of the same, but I haven't tried it out. But, the Ubuntu version of `youtube-dl` is really cool.

Sunday, July 8, 2012

Is Hadoop a square peg in a round hole?

There was an article in GigaOm about Hadoop days being numbered. I agree with some of the points with the author and not some.

Because of the HYPE many are doing something or other around Hadoop and so the ecosystem, support (commercial support/forums etc), production use, documentation is huge. So, trying to fit everything into Hadoop is not the right solution. Alternate paradigms have to be considered while architecting a system based on the requirements.

In the context of graph processing with pregel the author mentions

At the time of writing, the only viable option in the open source world is Giraph, an early Apache incubator project that leverages HDFS and Zookeeper. There’s another project called Golden Orb available on GitHub.

Besides Apache Giraph, there is also Apache Hama for graph processing based on pregel. Also, Apache Giraph and Hama have moved from incubator to Apache TLP (Top Level Project). While, Giraph can only be used for graph processing, Hama in a pure BSP engine which can be used for a lot of other things besides graph processing. In contrast, there was a blog entry mentioning that Giraph can be used to process other models also.

Then, there is also GraphLab and Golden Orb. While there had been some work going on GraphLab, Golden Orb had been dormant for more than an year.

For those interested here is a paper comparing MapReduce with Bulk Synchronous Parallel. The paper states that MR algorithms can be implemented in BSP and the other way also. But, some algorithms can be effectively implemented in BSP and some in MR.

Once again I would like to iterate to consider alternate paradigms/frameworks besides Hadoop/MapReduce while architecting a solution around big data.

Tuesday, July 3, 2012

Improving crop output using big data

This recent article`India to launch $75m mission to forecast rains`got my attention. The meteorological department had been doing a very poor job forecasting the weather. Forget about the hourly/daily, the complete season forecast was of no use to the farmers who depend on the rains for their crop.

From the above mentioned article `Last year they predicted a bad monsoon, but in the end the rains turned out to be in excess of what was forecast.`

So, this made me think why not invest a small fraction of the $75m to fund a competition in Kaagle to forecast the weather for the next monsoon with all the data  available with the meteorological department. To my knowledge, there shouldn't be any problem to share the data in the public.

If someone has a contacts with the meteorological department, please pass this information to them.