Tuesday, November 12, 2013

Creating a simple coordinator/scheduler using Apache Oozie

With the assumption that Oozie has been installed/configured as mentioned here and that a simple work flow can be executed as mentioned here, now it's time to look at how to schedule the work flow at regular interval using Oozie.

- Create the coordinator.properties file in HDFS (oozie-clickstream-examples/apps/scheduler/coordinator.properties)
nameNode=hdfs://localhost:9000
jobTracker=localhost:9001
queueName=default

examplesRoot=oozie-clickstream-examples
examplesRootDir=/user/${user.name}/${examplesRoot}

oozie.use.system.libpath=true
oozie.coord.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/scheduler
- Create the coordinator.xml file in HDFS (oozie-clickstream-examples/apps/scheduler/coordinator.xml). The job runs between the specified start and the end time interval for every 10 minutes.
<coordinator-app name="wf_scheduler" frequency="10" start="2013-10-24T22:08Z" end="2013-10-24T22:12Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
   <action>
      <workflow>
         <app-path>${nameNode}${examplesRootDir}/apps/cs</app-path>
      </workflow>
   </action>
</coordinator-app> 
Note that the Oozie coordinator can be time based or event based with much more complexity than as mentioned here. Here are the specifications for the Oozie coordinator.

- For some reason Oozie is picking 1 hour behind the system time.This can be observed from the `Created` time of the latest submitted work flow and the system time from the top right in the below screen. So, the start and the end time in the previous step had to be backed by an hour to the actual times.
Not sure why this happens, but will update the blog if the cause is found out or someone posts why in the comments.

-Submit the Oozie coordinator job as
bin/oozie job -oozie http://localhost:11000/oozie -config /home/vm4learning/Code/oozie-clickstream-examples/apps/scheduler/coordinator.properties -run
-The coordinator job should appear in the Oozie console from the PREP to RUNNING, all the way to SUCCEEDED state.
- The output should appear as below in the `oozie-clickstream-examples/finaloutput/000000_0` file in HDFS.
www.businessweek.com 2
www.eenadu.net 2
www.stackoverflow.com 2
Note that Oozie has got a concept of bundles where a user can batch a group of coordinator applications and execute an operation on the the whole bunch at a time. Will look into it in another blog entry.

3 comments:

  1. Nice post!

    You can also now enjoy a better Oozie UI with Hue ;) (e.g. http://gethue.tumblr.com/tagged/oozie)

    ReplyDelete
    Replies
    1. Planned to install Hue, but the documentation mentions CDH as a prerequisite. Not sure if it works with frameworks from Apache. Looks the same is the case with Imapala also.

      Delete
  2. The cause of the delay may be that it needs the times in UTC and UTC only. I have seen this referred in Hue documentation, maybe it is Oozie's limitation.

    ReplyDelete