Friday, February 8, 2013

Caching Proxy - Installation and Configuration

Setting up a Hadoop cluster is all easy with a bit of familiarity with system and network administration. It's all interesting, the only frustrating thing is the downloading of the patches after the installation of the OS and the downloading of the packages for the softwares on top of OS. The downloads can go to all the way close to a GB also, which might take a couple of minutes to hours based on the internet bandwidth.

Here is where caching tools really help. They will cache the downloaded packages to one of the designated local machine (lets call it the cache server) and the other machines can point to the cache server to get the packages. This way the packages are downloaded from the internet for the first time and from then on the local cache server will be used for getting the packages. This approach will not only save the network bandwidth, but will also make the whole installation process faster.
For debian systems, apt-cacher-ng is designed to cache the packages and is really easy to install and configure. Here are the steps involved:

a) On the cache machine install the apt-cache-ng using the below command. root previlages would be required to run the command.
sudo apt-get install apt-cacher-ng
b) All the different machines in the local network have to point to the cache server using the below command, where `cacheserver` has to be replaced with the appropriate host name/ip of the cache machine.
sudo echo 'Acquire::http { Proxy "http://cacheserver:3142"; };' | sudo tee /etc/apt/apt.conf.d/01apt-cacher-ng-proxy
It's as easy as the above two steps to setup a cache server for a debian system.

For a rpm based system it's a bit more complicated. For rpm based systems squid should be installed either on a debian or a rpm based systems and other systems will fetch the packages from squid. Below are the instructions for installing squid on a debian based system.

a) On the cache machine install squid using the below command for debian based system. Again root previlages would be required to run the command.
sudo apt-get install squid3
b) Uncomment the below line in /etc/squid3/squid.conf file, the default uses memory based caching. With the following settings all the packages will be stored in the /var/spool/squid3 directory.
cache_dir ufs /var/spool/squid3 100 16 256
Uncomment the below line
http_access deny all
and add the below lines to enable access to the squid server from the different machine. Based on the network settings/configurations the ip addresses have to be chosen approximately.
acl allcomputers src
http_access allow allcomputers
c) Add the below to .bashrc for the proxy to take affect to all the applications.
export http_proxy=http://cacheserver:3128
export ftp_proxy=http://cacheserver:3128
or add the below to /etc/yum.conf for the proxy to work only with yum which is used for installing the packages on rpm based systems. Here is the documentation for the same.
# The proxy server - proxy server:port number
# The account details for yum connections
Make sure the fire wall has been disabled or the appropriate port has been opened on the cache server, 3142 for apt-cacher-ng and 3128 for squid. gufw is a front-end to the iptables in Ubuntu.

For the past couple of days, I had been working on setting up a two node Cloudera CDH on a Laptop using virtualization. It's not super fast, but decent enough to try some basic commands when access to a cluster is no there. I will follow up with a blog on the process for the same.

The more I spend time with Ubuntu/Linux, the more I am liking it. It is very much customizable and can be very easily tweaked for performance.

Happy Hadooping !!!

1 comment:

  1. This is different from the example given in the grunt-connect-proxy documentation in that my version doesn't make anything browseable. Proxy Sites