Tuesday, November 19, 2013

Using Log4J/Flume to log application events into HDFS

Many a times the events from the applications have to be analyzed to know more about the customer behavior for recommendations or to figure any fraudulent use cases. With more data to analyze, it might take a lot of time or some times even not possible to process the events on a single machine. This is where distributed systems like Hadoop and others cuts the requirement.

Apache Flume and Sqoop can be used to move the data from the source to the sink. The main difference is that Flume is event based, while Sqoop is not. Also, Sqoop is used to move data from structures data stores like RDBMS to HDFS and HBase, while Flume supports a variety of sources and sinks.

One of the option is to make the application use Log4J to send the log events to a Flume sink which will store them in HDFS for further analysis.

Here are the steps to configure Flume with the Avro Source, Memory Channel, HDFS Sink and chain them together.

- Download Flume from here and extract it to a folder.

- In the conf/flume.conf define the define the different Flume components and chain them together.
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to Connect it to channel ch1.
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind =
agent1.sources.avro-source1.port = 41414

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://localhost:9000/flume/events/

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = hdfs-sink1

#chain the different components together
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sources.avro-source1.channels = ch1
- Copy flume-env.sh.template to flume-env.sh.
cp conf/flume-env.sh.template conf/flume-env.sh

- Start Flume as
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n agent1
This should start Flume and the source listening at port 41414.

- With the assumption that Hadoop has been installed and configured, start HDFS.

- Now, run the Avro client to send message to the Avro source as below. The contents of the /etc/passwd file should be in HDFS /flume/events as a Sequence file.
bin/flume-ng avro-client --conf conf -H localhost -p 41414 -F /etc/passwd -Dflume.root.logger=DEBUG,console

Now, is the time to create an Eclipse project using Log4J to send info messages to the file and the Flume appender. Here are the steps for the same

- Create a project in Eclipse and include all the jars from the <flume-install-folder>/lib as dependencies.

- Define the file and the Flume appender in the log4j.properties file as
# Define the root logger with appender file
log = /home/vm4learning/WorkSpace/BigData/Log4J-Example/log
log4j.rootLogger = INFO, FILE, flume

# Define the file appender

# Define the flume appender
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = localhost
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = false
- Execute the below program in Eclipse to send the info event to the file and the Flume appender. This is where the Java program hangs. With the exclusion of Flume appender the info message can be see in the log file.
import java.io.IOException;
import java.sql.SQLException;
import org.apache.log4j.Logger;

public class log4jExample {

    static Logger log = Logger.getRootLogger();
    public static void main(String[] args) throws IOException, SQLException {
         for (int i=1; i <=10; i++)
             log.info("Hello this is an info message" + i);

- hdfs.rollCount defaults to 10, this is the reason why log.info has to be called 10 times. More details about the property can be found in the Flume User Guide here. Another option is to define this property to a much smaller number in the flume.conf file.

- The events should appear in the HDFS under the folder specified in the flume.conf configuration file.


  1. May be a dumb question, is port used by avro source is open? At least with CentOS the ports are not open by default. We faced similar issue when we tried to distribute flume flow across nodes. We had flume agent reading log file on one node and avro source and sink on the other. Flow failed as port was not open. - Maruti

    1. Irrespective of Flume running or not, the log4j just hangs in Eclipse.

  2. Do we need to run Avro client and Log4j client as well to make sure HDFS source is listening and receiving, I understand that through log4j ..log goes to flume.log and how do you see the message inside flume/events folders , is that due to Avro client and /etc/passwd file as Sequence file and what do you meant by Sequence file

  3. Did you keep opening up your port using this command? nc -l -p 41414

  4. I seem to get this error
    2014-04-01 00:23:19,393 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:422)] process failed
    java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$AppendRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;


  5. getting below error
    $ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBU
    G,console -n agent1
    bin/flume-ng: line 82: conditional binary operator expected
    bin/flume-ng: line 82: syntax error near `=~'
    bin/flume-ng: line 82: ` if [[ $line =~ ^java\.library\.path=(.*)$ ]]; then

  6. Hi

    Please do following changes. we also faced the same issue and resolved it.

    1) Change the operator to !=
    2) Put condition string in double quote.

    As below :

    if [[ $line != "^java\.library\.path=(.*)$" ]]; then

    It will work !!!!