Monday, October 7, 2013

Installation and configuration of Apache Oozie

Many a times there will be a requirement of running a group of dependent data processing jobs. Also, we might want to run some of them at regular intervals of time. This is where Apache Oozie fits the picture. Here are some nice articles (1, 2, 3, 4) on how to use Oozie.

Apache Oozie has three components which are a work flow engine to run a DAG of actions, a coordinator (similar to a cron job or a scheduler) and a bundle to batch a group of coordinators. Azkaban from LinkedIn is similar to Oozie, here are the articles (1, 2) comparing both of them.

Installing and configuring Oozie is not straight forward, not only because of the documentation, but also because the release includes only the source code and not the binaries. The code has to be got, the dependencies installed and then the binaries built. It's a bit tedious process, so this blog with an assumption that Hadoop has been already installed and configured. Here is the official documentation on how to build and install Oozie.

So, here are the steps to install and configure

- Make sure the requirements (Unix box (tested on Mac OS X and Linux), Java JDK 1.6+, Maven 3.0.1+, Hadoop 0.20.2+, Pig 0.7+) to build are met.

- Download a release containing the code from Apache Oozie site and extract the source code.
- Execute the below command to start the build. During the build process, the jars have to be downloaded, so it might take some time based on the network bandwidth. Make sure that there are no errors in the build process.
bin/mkdistro.sh -DskipTests
- Once the build is complete the binary file oozie-4.0.0.tar.gz should be present in the folder where Oozie code was extracted. Extract the tar.gz file, this will create a folder called oozie-4.0.0.

- Create a libext/ folder and copy the commons-configuration-*.jar, ext-2.2.zip,  hadoop-client-*.jar and  hadoop-core-*.jar files. The hadoop jars need to be copied from the Hadoop installation folder.

When Oozie is started, the below exception is seen in the catalina.out log file. This is the reason for including the commons-configuration-*.jar file in libext/ folder.
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:37)
        at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:34)
        at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:217) 
- Prepare a war file using the below command. oozie.war file should be there in the oozie-4.0.0/oozie-server/webapps folder.
bin/oozie-setup.sh prepare-war
- Create Oozie related schema using the below command
bin/ooziedb.sh create -sqlfile oozie.sql -run

- Now is the time to start the Oozie Service which runs in Tomcat.
bin/oozied.sh start

- Check the Oozie log file logs/oozie.log to ensure Oozie started properly. And, run the below command to check the status of Oozie or instead go to the Oozie console at http://localhost:11000/oozie
bin/oozie admin -oozie http://localhost:11000/oozie -status

- Now, the Oozie client has to be installed by extracting the oozie-client-4.0.0.tar.gz. This will create a folder called oozie-client-4.0.0.

With the Oozie service running and the Oozie client installed, now is the time to run some simple work flows in Oozie to make sure Oozie works fine. Oozie comes with a bunch of examples in the oozie-examples.tar.gz. Here are the steps for the same.

- Extract the oozie-examples.tar.gz and change the port number on which the NameNode listens (Oozie default is 8020 and Hadoop default is 9000) in all the job.properties files. Similarly, for the JobTracker also the port number has to be modified (Oozie default is 8021 and Hadoop default is 9001).

- In the Hadoop installation, add the below to the conf/core-site.xml file. Check the Oozie documentation for more information on what these parameters mean
     <property>
          <name>hadoop.proxyuser.training.hosts</name>
          <value>localhost</value>
     </property>
     <property>
          <name>hadoop.proxyuser.training.groups</name>es
          <value>training</value>
     </property>
- Make sure that HDFS and MR are started and running properly.

- Copy the examples folder in HDFS using the below command
bin/hadoop fs -put /home/training/tmp/examples/ examples/

- Now run the Oozie example as
oozie job -oozie http://localhost:11000/oozie -config /home/training/tmp/examples/apps/map-reduce/job.properties -run
- The status of the job can be got using the below command
oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-tucu

In the upcoming blogs, we will see how to write some simple work flows and schedule tasks in Oozie.

5 comments:

  1. Does the above configuration works for oozie-3.3.2 ?

    ReplyDelete
  2. Can anyone solve this problem?

    hduser@ubuntu:~/oozie/distro/target/oozie-3.3.2-distro/oozie-3.3.2$ bin/oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run
    Error: E0902 : E0902: Exception occured: [Call to localhost/127.0.0.1:8020 failed on local exception: java.io.IOException: Broken pipe]

    ReplyDelete
  3. Nice work! Thanks for creating this ! :-)

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Dear Praveen,

    I have been trying to install Oozie with Apache Hadoop version 1.2.1 on Centos 6.4
    Maven 3.2.1 has been installed. While trying to build a distribution with the below command, I got the following error

    mkdistro.sh -e -DskipTests

    [WARNING] Some problems were encountered while building the effective model for org.apache.oozie:oozie-main:pom:3.3.2
    [WARNING] 'build.plugins.plugin.version' for com.atlassian.maven.plugins:maven-clover2-plugin is missing. @ line 742, column 21
    [WARNING] 'build.plugins.plugin.version' for org.codehaus.mojo:findbugs-maven-plugin is missing. @ line 751, column 21

    In {oozie_base_dir}/pom.xml, the above plugin versions were not explicitly declared but mentioned with GroupID & ArtifactID. Also, the mentioned repository "https://repository.cloudera.com/cloudera/ext-release-local/" did not contain the above two plugins.

    I also referred the below link for further knowledge.

    https://cwiki.apache.org/confluence/display/MAVEN/PluginVersionResolutionException
    http://stackoverflow.com/questions/15213801/pluginversionresolutionexception-in-maven

    Explicitly telling versions of plugin in pom file also didnot work in my case, as the above mentioned repository no longer had the above two plugins.

    Then, I could find the above two plugins in search.maven.org as below, Maven also refers to {home_dir}/.m2/ whenever it could not fetch files from specified repositories.

    org.codehaus.mojo http://search.maven.org/#browse|820238317
    com.atlassian.maven.plugins http://search.maven.org/#browse|673055776

    Request your kind help in guiding me further. Thanks in advance.

    ReplyDelete