Many a times the events from the applications have to be analyzed to know more about the customer behavior for recommendations or to figure any fraudulent use cases. With more data to analyze, it might take a lot of time or some times even not possible to process the events on a single machine. This is where distributed systems like Hadoop and others cuts the requirement.
Apache Flume and Sqoop can be used to move the data from the source to the sink. The main difference is that Flume is event based, while Sqoop is not. Also, Sqoop is used to move data from structures data stores like RDBMS to HDFS and HBase, while Flume supports a variety of sources and sinks.
One of the option is to make the application use Log4J to send the log events to a Flume sink which will store them in HDFS for further analysis.
Here are the steps to configure Flume with the Avro Source, Memory Channel, HDFS Sink and chain them together.
- Download Flume from here and extract it to a folder.
- In the conf/flume.conf define the define the different Flume components and chain them together.
- With the assumption that Hadoop has been installed and configured, start HDFS.
- Now, run the Avro client to send message to the Avro source as below. The contents of the /etc/passwd file should be in HDFS /flume/events as a Sequence file.
- Create a project in Eclipse and include all the jars from the <flume-install-folder>/lib as dependencies.
- Define the file and the Flume appender in the log4j.properties file as
- hdfs.rollCount defaults to 10, this is the reason why log.info has to be called 10 times. More details about the property can be found in the Flume User Guide here. Another option is to define this property to a much smaller number in the flume.conf file.
- The events should appear in the HDFS under the folder specified in the flume.conf configuration file.
Apache Flume and Sqoop can be used to move the data from the source to the sink. The main difference is that Flume is event based, while Sqoop is not. Also, Sqoop is used to move data from structures data stores like RDBMS to HDFS and HBase, while Flume supports a variety of sources and sinks.
One of the option is to make the application use Log4J to send the log events to a Flume sink which will store them in HDFS for further analysis.
Here are the steps to configure Flume with the Avro Source, Memory Channel, HDFS Sink and chain them together.
- Download Flume from here and extract it to a folder.
- In the conf/flume.conf define the define the different Flume components and chain them together.
# Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = 0.0.0.0 agent1.sources.avro-source1.port = 41414 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.hdfs-sink1.type = hdfs agent1.sinks.hdfs-sink1.hdfs.path = hdfs://localhost:9000/flume/events/ # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. agent1.channels = ch1 agent1.sources = avro-source1 agent1.sinks = hdfs-sink1 #chain the different components together agent1.sinks.hdfs-sink1.channel = ch1 agent1.sources.avro-source1.channels = ch1- Copy flume-env.sh.template to flume-env.sh.
cp conf/flume-env.sh.template conf/flume-env.sh- Start Flume as
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n agent1This should start Flume and the source listening at port 41414.
- With the assumption that Hadoop has been installed and configured, start HDFS.
- Now, run the Avro client to send message to the Avro source as below. The contents of the /etc/passwd file should be in HDFS /flume/events as a Sequence file.
bin/flume-ng avro-client --conf conf -H localhost -p 41414 -F /etc/passwd -Dflume.root.logger=DEBUG,consoleNow, is the time to create an Eclipse project using Log4J to send info messages to the file and the Flume appender. Here are the steps for the same
- Create a project in Eclipse and include all the jars from the <flume-install-folder>/lib as dependencies.
- Define the file and the Flume appender in the log4j.properties file as
# Define the root logger with appender file log = /home/vm4learning/WorkSpace/BigData/Log4J-Example/log log4j.rootLogger = INFO, FILE, flume # Define the file appender log4j.appender.FILE=org.apache.log4j.FileAppender log4j.appender.FILE.File=${log}/log.out log4j.appender.FILE.layout=org.apache.log4j.PatternLayout log4j.appender.FILE.layout.conversionPattern=%m%n # Define the flume appender log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender log4j.appender.flume.Hostname = localhost log4j.appender.flume.Port = 41414 log4j.appender.flume.UnsafeMode = false log4j.appender.flume.layout=org.apache.log4j.PatternLayout- Execute the below program in Eclipse to send the info event to the file and the Flume appender. This is where the Java program hangs. With the exclusion of Flume appender the info message can be see in the log file.
import java.io.IOException; import java.sql.SQLException; import org.apache.log4j.Logger; public class log4jExample { static Logger log = Logger.getRootLogger(); public static void main(String[] args) throws IOException, SQLException { for (int i=1; i <=10; i++) log.info("Hello this is an info message" + i); } }
- hdfs.rollCount defaults to 10, this is the reason why log.info has to be called 10 times. More details about the property can be found in the Flume User Guide here. Another option is to define this property to a much smaller number in the flume.conf file.
- The events should appear in the HDFS under the folder specified in the flume.conf configuration file.
May be a dumb question, is port used by avro source is open? At least with CentOS the ports are not open by default. We faced similar issue when we tried to distribute flume flow across nodes. We had flume agent reading log file on one node and avro source and sink on the other. Flow failed as port was not open. - Maruti
ReplyDeleteIrrespective of Flume running or not, the log4j just hangs in Eclipse.
DeleteDo we need to run Avro client and Log4j client as well to make sure HDFS source is listening and receiving, I understand that through log4j ..log goes to flume.log and how do you see the message inside flume/events folders , is that due to Avro client and /etc/passwd file as Sequence file and what do you meant by Sequence file
ReplyDeleteDid you keep opening up your port using this command? nc -l -p 41414
ReplyDeleteNice post!!
ReplyDeleteI seem to get this error
ReplyDelete2014-04-01 00:23:19,393 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:422)] process failed
java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$AppendRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
NY IDEA WHY ?
getting below error
ReplyDelete$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBU
G,console -n agent1
bin/flume-ng: line 82: conditional binary operator expected
bin/flume-ng: line 82: syntax error near `=~'
bin/flume-ng: line 82: ` if [[ $line =~ ^java\.library\.path=(.*)$ ]]; then
'
Hi
ReplyDeletePlease do following changes. we also faced the same issue and resolved it.
1) Change the operator to !=
2) Put condition string in double quote.
As below :
if [[ $line != "^java\.library\.path=(.*)$" ]]; then
It will work !!!!