Monday, October 17, 2016

Getting started with Apache Spark

With so much noise around Apache Spark, let's look into how to get started with Spark in local mode and execute a simple Scala program. A lot of complex combinations are possible, but we will look at the minimum steps required to get started with Spark.

Most of the Big Data softwares are developed with Linux as the platform and porting to Windows has been an after thought. It is interesting to see how Big Data on Windows will morph in the future. Spark can run on both Windows/Linux, but we will take Linux (Ubuntu 14.04 64-bit Desktop) into consideration. So, here are the steps:

1) Download and install Oracle VirtualBox as mentioned here.

2) Download and install Ubuntu as mentioned here as a guest OS.

3) Update the patches on Ubuntu from a terminal and reboot it.
sudo apt update;sudo apt-get dist-upgrade
4) Oracle Java doesn't come with Linux distributions, so has to be installed manually on top of Ubuntu as mentioned here.

5) Spark has been developed in Scala, so we need install Scala.
sudo apt-get install scala
6) Download Spark from here and extract it. Spark built with hadoop 1x or 2x will work, because HDFS is not being used in this context.

7) From the Spark installation folder start the Spark shell.
bin/spark-shell
8) Execute the below commands in the shell to load the README.md and count the number of lines in it.
val textFile = sc.textFile("README.md")

textFile.count()
What we have done is install Ubuntu as a guest OS and then install Spark on it. And finally run a simple Scala program in Spark local mode. There are much more advanced setups like running Spark program against data in HDFS, running Spark in stand alone, Mesos and YARN mode. We will look at them in the future blogs.

No comments:

Post a Comment