Monday, March 11, 2013

Analyse Tweets using Flume, Hadoop and Hive

Note : Also don't forget to do check another entry on how to get some interesting facts from Twitter using R here. And also this entry on how to use Oozie for automating the below workflow. Here is a new blog on how to do the same analytics with Pig (using elephant-bird).

It's not a hard rule, but almost 80% of the data is unstructured, while the remaining 20% is structured data. RDBMS helps to store/process the structured data (20%), while Hadoop solves the problem of storing/processing both types of data. The good thing about Hadoop, is that it scales incrementally with less CAPEX in terms of software and hardware.

With the ever increasing usage of smart devices and the high speeds internet, unstructured data had been growing at a very fast rate. It's common to Tweet from a smart phone, take a picture and share it in Facebook.

In this blog we will try to get Tweets using Flume and save them into HDFS for later analysis. Twitter exposes the API (more here) to get the Tweets. The service is free, but requires the user to register for the service. Cloudera wrote a three part series (1, 2, 3) for Twitter Analysis using Hadoop, the code for the same is here. For the impatient, I will quickly summarize how to get data into HDFS using Flume and start doing some analytics using Hive.

Flume has the concepts of agents. The sources, sinks and the intermediate channels are the different types of agents. The sources can push/pull the data and send it to the different channels which in turn will send the data to the different sinks. Flume decouples the source (Twitter) and the sink (HDFS) in this case. Both the source and the sink can operate at different speeds, also it's much easier to add new sources and sinks. Flume comes with a set of sources, channels, sinks and new onces can be implemented by extending the Flume base classes.


1) The first step is to create an application in https://dev.twitter.com/apps/ and then generate the corresponding keys.


2) Assuming that Hadoop has already been installed and configured, the next step is download Flume and extract it to any folder.

3) Download the flume-sources-1.0-SNAPSHOT.jar and add it to the flume class path as shown below in the conf/flume-env.sh file
FLUME_CLASSPATH="/home/training/Installations/apache-flume-1.3.1-bin/flume-sources-1.0-SNAPSHOT.jar" 

The jar contains the java classes to pull the Tweets and save them into HDFS.

4) The conf/flume.conf should have all the agents (flume, memory and hdfs) defined as below
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>

TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

The consumerKey, consumerSecret, accessToken and accessTokenSecret have to be replaced with those obtained from https://dev.twitter.com/apps. And,  TwitterAgent.sinks.HDFS.hdfs.path should point to the NameNode and the location in HDFS where the tweets will go to.

The TwitterAgent.sources.Twitter.keywords value can be modified to get the tweets for some other topic like football, movies etc.

5) Start flume using the below command
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
After a couple of minutes the Tweets should appear in HDFS.


6) Next download and extract Hive. Modify the conf/hive-site.xml to include the locations of the NameNode and the JobTracker as below
<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>
7) Download hive-serdes-1.0-SNAPSHOT.jar to the lib directory in Hive. Twitter returns Tweets in the JSON format and this library will help Hive understand the JSON format.

8) Start the Hive shell using the hive command and register the hive-serdes-1.0-SNAPSHOT.jar file downloaded earlier.
ADD JAR /home/training/Installations/hive-0.9.0/lib/hive-serdes-1.0-SNAPSHOT.jar;
9) Now, create the tweets table in Hive
CREATE EXTERNAL TABLE tweets (
   id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweet_count INT,
   retweeted_status STRUCT<
      text:STRING,
      user:STRUCT<screen_name:STRING,name:STRING>>,
   entities STRUCT<
      urls:ARRAY<STRUCT<expanded_url:STRING>>,
      user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
      hashtags:ARRAY<STRUCT<text:STRING>>>,
   text STRING,
   user STRUCT<
      screen_name:STRING,
      name:STRING,
      friends_count:INT,
      followers_count:INT,
      statuses_count:INT,
      verified:BOOLEAN,
      utc_offset:INT,
      time_zone:STRING>,
   in_reply_to_screen_name STRING
) 
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';
Now that we have the data in HDFS and the table created in Hive, lets run some queries in Hive.

One of the way to determine who is the most influential person in a particular field is to to figure out whose tweets are re-tweeted the most. Give enough time for Flume to collect Tweets from Twitter to HDFS and then run the below query in Hive to determine the most influential person.
SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;
Similarly to know which user has the most number of followers, the below query helps.
select user.screen_name, user.followers_count c from tweets order by c desc; 
For sake of making it simple, partitions have not been created in Hive. Partitions can be created in Hive using Oozie at regular intervals to make the queries run faster if queried for a particular period time. Creating partitions will be covered in another blog.

Happy Hadooping !!!

Edit (21st March, 2013) :  Hortonworks blogged a two part series (1 and 2) on Twitter data processing using Hive.

Edit (30th March, 2016) : With the latest version of Flume, the following error is thrown because of the conflicts in the libraries

java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;
at com.cloudera.flume.source.TwitterSource.start(TwitterSource.java:139)'

One solution is to remove the below libraries in the Flume lib folder. There are a couple of more solutions in this StackOverflow article.

lib/twitter4j-core-3.0.3.jar
lib/twitter4j-media-support-3.0.3.jar
lib/twitter4j-stream-3.0.3.jar

85 comments:

  1. Awesome Post Praveen!! Just to clarify flume-env.sh should be in conf folder or bin folder ? when i followed apache flume instillation steps i created flume-env.sh in conf folder.

    ReplyDelete
    Replies
    1. The flume-env.sh should be in the conf folder and not the bin folder. I have updated the blog.

      Delete
    2. hi praveen
      Tweets in which format? I have loaded tweets data into hive table. Now i gave command in hive datadbase like that select * from table name, its not giving sql structure. Please help me on this praveen

      Delete
    3. when executing flume command i am getting below exception
      2016-09-28 17:38:25,545 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] Waiting for 10000 milliseconds
      2016-09-28 17:38:35,547 (Twitter Stream consumer-1[Waiting for 10000 milliseconds]) [ERROR - org.apache.flume.source.twitter.TwitterSource.onException(TwitterSource.java:331)] Exception while streaming tweets
      404:The URI requested is invalid or the resource requested, such as a user, does not exist.
      Unknown URL. See Twitter Streaming API documentation at http://dev.twitter.com/pages/streaming_api

      if u know pls let me know

      Delete
    4. Please post the solution for this. Even I am facing this issue.
      Thanks

      Delete
  2. Thanks for posting awesome tutorial. I have configured hadoop cluster and now I am trying to follow your tutorial for twitter data analysis. I have installed flume and completed until step 3. For the step 4 when you mention conf/flume.conf are you referring to flume-conf.properties.template file inside conf folder? If not, where can I find that file and define agent fields. Please let me know. I would really appreciate your help. Thanks in advance!!

    ReplyDelete
    Replies
    1. Reema please copy fume-conf.properties.template to flme.conf and replace the configurations above mentioned (in step 4)

      Delete
  3. I am not getting the tweets from twitter... the process is stopping at this point mentioned below.... !! Nothing beyond this step.

    13/04/18 04:13:11 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: SINK, name: HDFS, registered successfully.
    13/04/18 04:13:11 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started

    ****************************************************************
    I have been trying to change some files due the error i have been observing.

    In flume.conf have changed the TwitterAgent.sources = TwitterSource

    whereas the original content is TwitterAgent.sources = Twitter.


    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. i need our help .. could u pls help me out ...
      Actually im using apache-flume-1.3.1 to extract the log file into my hdfs ... actually everthing has to be clear .. my flume can try to put the file in hdfs but the issue is it cant able to pull all the content .. clearly says i have a 9.7mb of log file(temporarly store in local) i can give the path to store in hdfs sink .... flume can store it only 3.5 kb of original file sometimes if i refresh the original log file it ll store in hdfs nearly 2 mb but in suffule process and some of the line has not be store that can be neglected similary it cant start it from 1st the line and it take as in centre of line some time .. what can i do for that how can i configure in my flume configuration . solve this from problem .. help me out
      thanks you
      regards

      Delete
  5. Thanks for the great post, however after running step 5) l get the following "ERROR properties.PropertiesFileConfigurationProvider: Failed to load configuration data. Exception follows org.apache.flume.FlumeException: Unable to load source type: com.cloudera.flume.source.TwitterSource".

    Please assist.

    ReplyDelete
    Replies
    1. Looks like you need to make sure your FLUME_CLASSPATH is pointing to the .jar file.

      Delete
  6. In step 5, I had no tweets coming into HDFS.

    But I found time out information in the console.

    Anyone can help?

    2013-05-10 14:25:53,941 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:75)] Post Params: count=0&track=hadoop%2Cbig%20data%2Canalytics%2Cbigdata%2Ccloudera%2Cdata%20science%2Cdata%20scientist%2Cbusiness%20intelligence%2Cmapreduce%2Cdata%20warehouse%2Cdata%20warehousing%2Cmahout%2Chbase%2Cnosql%2Cnewsql%2Cbusinessintelligence%2Ccloudcomputing&include_entities=true
    2013-05-10 14:26:07,436 (conf-file-poller-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:126)] Checking file:conf/flume.conf for changes
    2013-05-10 14:26:14,039 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] connect timed out
    2013-05-10 14:26:14,039 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] Waiting for 16000 milliseconds

    ReplyDelete
  7. Hi Praveen,

    I got tweets coming into HadoopFS but when we are writing various queries i am getting this error: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

    LMK what needs to be done here to run your queries?

    Thanks,
    NT

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. I also got the same error..

      just add hive-serdes-1.0-SNAPSHOT.jar to you hive shell ..

      Delete
  8. Does anyone know why I might be getting the following error?
    [ERROR - org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)] Unable to start EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE} } - Exception follows.
    java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. I got the same error. This is how I resolved it
      a) First checked the jar file to make sure that FilterQuery.class exists in flume-sources-1.0-SNAPSHOT.jar
      [root@localhost lib]# /usr/java/jdk1.6.0_31/bin/jar tvf /tmp/flume-sources-1.0-SNAPSHOT.jar
      You may see an item
      4451 Tue Nov 13 10:06:42 PST 2012 twitter4j/FilterQuery.class
      So it is there.
      Now need to verify that the class has the method setIncludeEntities
      Extracted the jar file and viewed the source code, I used show my code online class decompiler. See
      http://www.showmycode.com/
      Found that the method setIncludeEntities exists. So FLUME_CLASSPATH ( flume-sources-1.0-SNAPSHOT.jar) is correct.

      Now, it seems the issues is another jar file which has the same class, but does not have this method is getting picked by the flume agent java process and causing the issue.

      So we need to find that jar, for that we need the full class path when starting the flume agent

      I run the below command as posted in the blog

      [root@localhost lib]# /usr/bin/flume-ng agent start --conf /etc/flume-ng/conf/ -f /etc/flume-ng/conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

      Which shows a very lengthy classpath

      [root@localhost lib]# /usr/bin/flume-ng agent start --conf /etc/flume-ng/conf/ -f /etc/flume-ng/conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
      Info: Sourcing environment configuration script /etc/flume-ng/conf/flume-env.sh
      Info: Including Hadoop libraries found via (/usr/bin/hadoop) for HDFS access
      Info: Excluding /usr/lib/hadoop/lib/slf4j-api-1.6.1.jar from classpath
      Info: Excluding /usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar from classpath
      Info: Excluding /usr/lib/hadoop-0.20-mapreduce/lib/slf4j-api-1.6.1.jar from classpath
      Info: Including HBASE libraries found via (/usr/bin/hbase) for HBASE access
      Info: Excluding /usr/lib/hbase/bin/../lib/slf4j-api-1.6.1.jar from classpath
      Info: Excluding /usr/lib/zookeeper/lib/slf4j-api-1.6.1.jar from classpath
      Info: Excluding /usr/lib/zookeeper/lib/slf4j-log4j12-1.6.1.jar from classpath
      Info: Excluding /usr/lib/hadoop/lib/slf4j-api-1.6.1.jar from classpath
      Info: Excluding /usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar from classpath
      Info: Excluding /usr/lib/hadoop-0.20-mapreduce/lib/slf4j-api-1.6.1.jar from classpath
      + exec /usr/java/jdk1.6.0_31/bin/java -Xmx20m -Dflume.root.logger=DEBUG,console -cp '/etc/flume-ng/conf:/usr/lib/flume-ng/lib/*:/tmp/flume-sources-1.0-SNAPSHOT.jar:/etc/hadoop/conf:/usr/lib/hadoop/lib/activation-1.1.jar:/usr/lib/hadoop/lib/asm-3.2.jar:..........................................................

      Note the beginning of the classpath
      -cp '/etc/flume-ng/conf:/usr/lib/flume-ng/lib/*:/tmp/flume-sources-1.0-SNAPSHOT.jar:

      See the jar files in /usr/lib/flume-ng/lib/*

      Check the content of the jar files for FilterQuery.class

      [root@localhost lib]# find . -name "*.jar" | xargs grep FilterQuery.class
      Binary file ./search-contrib-0.9.1-cdh4.3.0-SNAPSHOT-jar-with-dependencies.jar matches
      [root@localhost lib]#

      So we have another JAR file search-contrib-0.9.1-cdh4.3.0-SNAPSHOT-jar-with-dependencies.jar with the same class and conflicting with correct one in FLUME_CLASSPATH

      Temporarily rename it to .org extension, so that it will be excluded from classpath at the startup

      [root@localhost lib]# mv search-contrib-0.9.1-cdh4.3.0-SNAPSHOT-jar-with-dependencies.jar search-contrib-0.9.1-cdh4.3.0-SNAPSHOT-jar-with-dependencies.jar.org
      [root@localhost lib]#

      Now start the flume agent again
      [root@localhost lib]# /usr/bin/flume-ng agent start --conf /etc/flume-ng/conf/ -f /etc/flume-ng/conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

      It worked like a charm. I do not know the effect of temporarily renaming and excluding the search-contrib-0.9.1-cdh4.3.0-SNAPSHOT-jar-with-dependencies.jar from the classpath, on other components..but it worked for this scenario..I will continue my investigation further...
      If anybody has a better solution, please post it



      Delete
    3. Hi Praveen and thanks for this tutorial,
      I have the same problem and followed your procedure. The other JAR file conflicting with flume-sources-1.0-SNAPSHOT.jar is twitter4j-stream-3.0.3.jar. Obviously, that last file cannot be renamed without error :

      2013-10-04 10:41:29,873 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 401:Authentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect. Ensure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.

      Can you help me?

      Delete
  9. Hi praveen,
    i have done all the steps in the blog and i got the same results.now how can we create partitions in hive using oozie at regular intervals?
    and i want to schedule them in oozie. Is it possible?

    ReplyDelete
    Replies
    1. i am not able eto create hive table nd i am using apache hadoop not cloudera can you please help me

      Delete
  10. Hi,

    I am getting below error

    HDFS IO error
    java.io.IOException: Callable timed out after 10000 ms on file: hdfs://localhost:8502/user/flume/tweets//FlumeData.1375149990564.tmp
    at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:550)
    at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:220)
    at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:383)
    at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:392)
    at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
    at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
    at java.lang.Thread.run(Thread.java:679)
    Caused by: java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
    at java.util.concurrent.FutureTask.get(FutureTask.java:119)
    at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:543)
    ... 6 more

    could you please help me out ?

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Hi Praveen/Guys,
    I ran the same code on my HortonWorks cluster and once the sink and channel have started, the program times out trying to connect to twitter.

    Is this because this code is cloudera specific as I see here in conf file.

    TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

    Would be glad if someone could help.

    Here is the timout portion of the log:

    25 Sep 2013 07:47:54,763 INFO [conf-file-poller-0] (org.apache.flume.node.nodemanager.DefaultLogicalNodeManager.startAllComponents:141) - Starting Sink HDFS
    25 Sep 2013 07:47:54,763 INFO [conf-file-poller-0] (org.apache.flume.node.nodemanager.DefaultLogicalNodeManager.startAllComponents:152) - Starting Source Twitter
    25 Sep 2013 07:47:54,766 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:89) - Monitoried counter group for type: SINK, name: HDFS, registered successfully.
    25 Sep 2013 07:47:54,766 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:73) - Component type: SINK, name: HDFS started
    25 Sep 2013 07:47:54,770 INFO [Twitter Stream consumer-1[initializing]] (twitter4j.internal.logging.SLF4JLogger.info:83) - Establishing connection.
    25 Sep 2013 07:48:15,182 INFO [Twitter Stream consumer-1[Establishing connection]] (twitter4j.internal.logging.SLF4JLogger.info:83) - connect timed out
    .
    .
    .
    .
    .

    ReplyDelete
    Replies
    1. i got the same error.....are you able to resolve it

      Delete
  13. -Dflume.root.logger=DEBUG,console without this line i got error any idea why? otherwise fine.

    ReplyDelete
  14. Hi Praveen/Guys,

    I am getting an error while getting Tweets into HDFS. Below is the logs for the same.

    13/11/21 06:00:01 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Waiting for 250 milliseconds
    13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Waiting for 500 milliseconds
    13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Waiting for 1000 milliseconds
    13/11/21 06:00:03 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    13/11/21 06:00:03 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    13/11/21 06:00:03 INFO twitter4j.TwitterStreamImpl: Waiting for 2000 milliseconds
    13/11/21 06:00:05 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    13/11/21 06:00:05 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    13/11/21 06:00:05 INFO twitter4j.TwitterStreamImpl: Waiting for 4000 milliseconds
    13/11/21 06:00:09 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    13/11/21 06:00:09 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    13/11/21 06:00:09 INFO twitter4j.TwitterStreamImpl: Waiting for 8000 milliseconds
    13/11/21 06:00:17 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    13/11/21 06:00:17 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    13/11/21 06:00:17 INFO twitter4j.TwitterStreamImpl: Waiting for 16000 milliseconds
    13/11/21 06:00:33 INFO twitter4j.TwitterStreamImpl: Establishing connection.

    Can i get some help ?

    ReplyDelete
  15. I've been getting the some error when i'm trying to run this

    /usr/bin/flume-ng agent -n kings-river-flume -c conf -f /usr/lib/flume-ng/conf/flume.conf ( i used the default " kings-river-flume" twitter agent name (becouse i couldn't modify /etc/default/flume-ng-agent file) just hoping this was the problem ) but i'm getting this error

    13/11/18 16:44:38 INFO lifecycle.LifecycleSupervisor: Starting lifecycle supervisor 1
    13/11/18 16:44:38 INFO node.FlumeNode: Flume node starting - kings-river-flume
    13/11/18 16:44:38 INFO nodemanager.DefaultLogicalNodeManager: Node manager starting
    13/11/18 16:44:38 INFO lifecycle.LifecycleSupervisor: Starting lifecycle supervisor 9
    13/11/18 16:44:38 INFO properties.PropertiesFileConfigurationProvider: Configuration provider starting
    13/11/18 16:44:39 INFO properties.PropertiesFileConfigurationProvider: Reloading configuration file:/usr/lib/flume-ng/conf/flume.conf
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Added sinks: HDFS Agent: TwitterAgent
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
    13/11/18 16:44:39 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [TwitterAgent]
    13/11/18 16:44:39 WARN properties.PropertiesFileConfigurationProvider: No configuration found for this host:kings-river-flume
    13/11/18 16:45:05 INFO node.FlumeNode: Flume node stopping - kings-river-flume
    13/11/18 16:45:05 INFO lifecycle.LifecycleSupervisor: Stopping lifecycle supervisor 8
    13/11/18 16:45:05 INFO properties.PropertiesFileConfigurationProvider: Configuration provider stopping
    13/11/18 16:45:05 INFO nodemanager.DefaultLogicalNodeManager: Node manager stopping
    13/11/18 16:45:05 INFO lifecycle.LifecycleSupervisor: Stopping lifecycle supervisor 8

    Can anybody help?

    ReplyDelete
  16. Hello All,

    I am receiving following error:

    13/11/27 00:11:15 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    13/11/27 00:11:17 INFO twitter4j.TwitterStreamImpl: 406:Returned by the Search API when an invalid format is specified in the request.
    Returned by the Streaming API when one or more of the parameters are not suitable for the resource. The track parameter, for example, would throw this error if:
    The track keyword is too long or too short.
    The bounding box specified is invalid.
    No predicates defined for filtered resource, for example, neither track nor follow parameter defined.
    Follow userid cannot be read.
    Parameter track item index 0 too short:

    13/11/27 00:11:17 WARN twitter4j.TwitterStreamImpl: Parameter not accepted with the role. 406:Returned by the Search API when an invalid format is specified in the request.
    Returned by the Streaming API when one or more of the parameters are not suitable for the resource. The track parameter, for example, would throw this error if:
    The track keyword is too long or too short.
    The bounding box specified is invalid.
    No predicates defined for filtered resource, for example, neither track nor follow parameter defined.
    Follow userid cannot be read.
    Parameter track item index 0 too short:

    Can someone help or guide me?

    I defined the keywords exactly as in the example above...

    Many thanks,

    Filip

    ReplyDelete
  17. Thanks Praveen, debugging tips helped me

    ReplyDelete
  18. Thanks Suresh, nice work around. Solved the problem for me.

    ReplyDelete
  19. I am facing this error althogh keys are correct

    - 401:Authentication credentials (https://dev.twitter.com/docs/auth) were missing or incorrect. Ensure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.

    For your kind help

    ReplyDelete
    Replies
    1. Make sure you have tested your credentials(access token/secret) in Twitter site.

      Delete
    2. hi attia elsayed iam also facing that problem did you find any solution????
      plz reply

      Delete
  20. Thanks Suresh
    A very nice blog ...configured Apache flume in seconds to get data from twitter...

    ReplyDelete
  21. hello,
    i am getting following error ,can someone help please

    ERROR node.PollingPropertiesFileConfigurationProvider: Failed to load configuration data. Exception follows.
    org.apache.flume.FlumeException: Unable to load source type: com.cloudera.flume.source.TwitterSource, class: com.cloudera.flume.source.TwitterSource
    at org.apache.flume.source.DefaultSourceFactory.getClass(DefaultSourceFactory.java:67)
    at org.apache.flume.source.DefaultSourceFactory.create(DefaultSourceFactory.java:40)
    at org.apache.flume.node.AbstractConfigurationProvider.loadSources(AbstractConfigurationProvider.java:327)
    at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
    at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:140)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
    Caused by: java.lang.ClassNotFoundException: com.cloudera.flume.source.TwitterSource
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:190)
    at org.apache.flume.source.DefaultSourceFactory.getClass(DefaultSourceFactory.java:65)

    ReplyDelete
    Replies
    1. Have you got any solution to this above error

      Delete
  22. Hi Guys!!
    I m pretty new to this field. I just wanted to know whether I could install flume,hive,oozie,etc without setting up the cloudera environment as suggested in the post. I have already a working pseudo distributed Hadoop cluster set up on my computer.

    ReplyDelete
  23. I am able to see tweets on console but not getting data into hdfs.

    My conf file is :


    TwitterAgent.sources = Twitter
    TwitterAgent.channels = MemChannel
    TwitterAgent.sinks = HDFS
    TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
    TwitterAgent.sources.Twitter.channels = MemChannel
    TwitterAgent.sources.Twitter.consumerKey = g2VoxfPOrIFKo0o58mhA
    TwitterAgent.sources.Twitter.consumerSecret = L8H80HLL3q2LKTQQBX8BtyleMW1YdqxheWJxWozbcbg
    TwitterAgent.sources.Twitter.accessToken = 1382526218-EvFiViJfN8b2CmPankGyaU6BHty1FXYDK1PLZEQ
    TwitterAgent.sources.Twitter.accessTokenSecret = T0owrTGPvvw548CSiydTibwwi6ZfJJqLBW64vSot1jMTI
    TwitterAgent.sources.Twitter.keywords = ind,pak,cricinfo,cricket,asiacup,rgsharma,viratkohli,rahane,dhoni,ipl,6thmatch,dhawan,shahidafridi,boomboom,
    TwitterAgent.sinks.HDFS.channel = MemChannel
    TwitterAgent.sinks.HDFS.type = hdfs
    TwitterAgent.sinks.HDFS.hdfs.path = hdfs://myhost:8020/whyt/twitter/
    TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
    TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
    TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
    TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
    TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
    TwitterAgent.channels.MemChannel.type = memory
    TwitterAgent.channels.MemChannel.capacity = 10000
    TwitterAgent.channels.MemChannel.transactionCapacity = 100


    Any idea;

    ReplyDelete
    Replies
    1. hi jp, i'm facing the same problem. Did you find any solution?????
      pls reply

      Delete
    2. if your exception id HDFSIOERROR AS java.io.EOFException
      replace hdfs://myhost:8020 with hdfs://localhost:8020 (if name node port number is 8020)
      you can find configurations in /etc/alternatives/hadoop-0.20/core-site.xml

      Delete
  24. I could be wrong but I think the query to determine the most influential person has the following correction - replace 'max(retweet_count)' in the sub-query with 'max(retweeted_status.retweet_count)'.

    ReplyDelete
  25. Hi every body , i have this error while i'm executing Flume , any help ??
    org.apache.flume.node.Application -f conf/flume.conf -n TwitterAgent
    Erreur : impossible de trouver ou charger la classe principale org.apache.flume.node.Application

    ReplyDelete
  26. Hi, I need help. If someone could make the tutorial successfully fovar please contact lhssa@hotmail.com

    ReplyDelete
  27. how to view the output after processing mapreduce?? i

    ReplyDelete
  28. after I installed hive,,i m getting output like this

    > select user.screen_name, user.followers_count c from tweets order by c desc;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=
    In order to set a constant number of reducers:
    set mapred.reduce.tasks=
    Starting Job = job_201404170949_0035, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201404170949_0035
    Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201404170949_0035
    2014-05-14 00:49:49,071 Stage-1 map = 0%, reduce = 0%
    2014-05-14 00:49:52,091 Stage-1 map = 0%, reduce = 100%
    2014-05-14 00:49:53,108 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_201404170949_0035
    OK
    Time taken: 7.453 seconds

    it doesnt show the table,,,bt i got the tweeted info as in tmp file,,,how to resolve this problem

    ReplyDelete
  29. um, this doesnt make any sense.........

    WHERE is the conf/flume-ng??? Why should I have this? I only downloaded the flume-JAR but you didn't say anything about downloading more than this regarding flume?? is there something missing? what else should I DOWNLOAD?

    ReplyDelete
  30. Hi,
    Praveen Sripati

    This post is very helpful for me every things work for me. i want to ask 1 question if i have to store date wise twitter data like (in 4/04/2014 Modi twits) then how to set flume agent or flume.conf.

    ReplyDelete
  31. I couldn't find setIncludeEntities method in my FilterQuery.class. Could you please tell me what to do to solve this issue.

    ReplyDelete
  32. Thanks for great post !!!
    As mentioned in point 8, please give command to register "hive-serdes-1.0-SNAPSHOT.jar" in Hive shell.

    ReplyDelete
  33. Thanks for the Post!
    I am getting following in console.
    14/08/04 02:00:46 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started
    14/08/04 02:00:46 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: Waiting for 250 milliseconds
    14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: Waiting for 500 milliseconds
    14/08/04 02:01:09 INFO twitter4j.TwitterStreamImpl: Establishing connection.
    14/08/04 02:01:09 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
    14/08/04 02:01:09 INFO twitter4j.TwitterStreamImpl: Waiting for 1000 milliseconds
    ...

    and I couldn''t able to find tweets in HDFS.

    Any help!

    ReplyDelete
    Replies
    1. hey
      i am getting the same problem.My net is working fine but i am using cntlm.Have you found any solution for this?

      Delete
  34. I'm trying to start flume but keeping a permission denied error:

    -bash: /user/Flume/apache-flume-1.5.0-bin/bin/flume-ng: Permission denied
    and try to start it like this as well:

    /user/Flume/apache-flume-1.5.0-bin/bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

    not sure if this makes a difference, but the server I'm working on is on amazon EC2 and where I believe you can't login as root but use sudo su instead, does that make a difference in the permissions? Sorry if this is trivial but i'm a linux rookie..

    ReplyDelete
  35. Hi Praveen,
    I am getting below log whenever i m starting flume.But i didn't get any twitte from twitter so according to below log flume is running or not?..
    Could you please give me solution?
    2014-09-02 12:22:23,322 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] X-Twitter-Client-URL: http://twitter4j.org/en/twitter4j-2.2.6.xml
    2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] X-Twitter-Client: Twitter4J
    2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] Accept-Encoding: gzip
    2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] User-Agent: twitter4j http://twitter4j.org/ /2.2.6
    2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] X-Twitter-Client-Version: 2.2.6
    2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] Connection: close

    ReplyDelete
  36. (Twitter Stream consumer-1[Receiving stream]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] Twitter Stream consumer-1[Receiving stream]
    Exception in thread "Twitter4J Async Dispatcher[0]" java.lang.NoSuchMethodError: twitter4j.json.JSONObjectType.determine(Ltwitter4j/internal/org/json/JSONObject;)Ltwitter4j/json/JSONObjectType;
    at twitter4j.AbstractStreamImplementation$1.run(AbstractStreamImplementation.java:100)
    at twitter4j.internal.async.ExecuteThread.run(DispatcherImpl.java:116)
    2014-10-30 12:00:21,105 (conf-file-poller-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:126)] Checking file:conf/flume.conf for changes

    solution for above problem?

    ReplyDelete
  37. This comment has been removed by the author.

    ReplyDelete
  38. sir i m trying to stream twitter data in hdfs but when i run flume agent although data is seen streaming on command line ,when i m checking namenode i m getting empty directory.Data is not going to hdfs pls help

    ReplyDelete
    Replies
    1. change the search words like more general once good bad

      Delete
  39. sir ..im trying my level best to create an hive external table but i m always getting this error ...pls help me sir ... i need to resolve this this issue soon ..

    hive> CREATE External TABLE dandanaka (
    > id BIGINT,
    > created_at STRING,
    > source STRING,
    > favorited BOOLEAN,
    > retweet_count INT,
    > retweeted_status STRUCT<
    > text:STRING,
    > user:STRUCT,
    > retweet_count:INT>,
    > entities STRUCT<
    > urls:ARRAY>,
    > usermentions:ARRAY>,
    > hashtags:ARRAY>>,
    > text STRING,
    > user STRUCT<
    > screen_name:STRING,
    > name:STRING,
    > friends_count:INT,
    > followers_count:INT,
    > statuses_count:INT,
    > verified:BOOLEAN,
    > utc_offset:INT,
    > time_zone:STRING>,
    > in_reply_to_screen_name STRING
    > )
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
    > LOCATION '/user/flume/tweets2';
    FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
    at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:10924)
    at org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:45850)
    at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HiveParser.java:38211)
    at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(HiveParser.java:36342)
    at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java:39707)
    at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:38655)
    at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:38367)
    at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser.java:38051)
    at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HiveParser.java:36203)
    at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5214)
    at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2640)
    at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
    at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:396)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
    FAILED: ParseException line 9:4 Failed to recognize predicate 'user'. Failed rule: 'identifier' in column specification

    ReplyDelete
  40. Is fluming twitter data to download twitter logs still available? or is it stopped? because i tried to flume twitter data today after long time with flume 1.4,java 1.6 but unable to download the twitter data

    ReplyDelete
  41. Hello Everyone,
    I am getting this error when running the select query.Can anybody help?
    Diagnostic Messages for this Task:
    Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable Objavro.schema�
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:185)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
    Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable Objavro.schema�
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:501)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176)
    ... 8 more
    Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
    at [Source: java.io.ByteArrayInputStream@649b97b1; line: 1, column: 2]
    at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:169)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:136)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:100)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:492)
    ... 9 more
    Caused by: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
    at [Source: java.io.ByteArrayInputStream@649b97b1; line: 1, column: 2]
    at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
    at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
    at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442)
    at org.codehaus.jackson.impl.Utf8StreamParser._handleUnexpectedValue(Utf8StreamParser.java:2090)
    at org.codehaus.jackson.impl.Utf8StreamParser._nextTokenNotInObject(Utf8StreamParser.java:606)
    at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:492)
    at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:158)
    ... 12 more

    ReplyDelete
    Replies
    1. hi i am getting this error

      please help me if you find any solution.I am getting this error when creating external hive table for twitter data .

      " Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
      at [Source: java.io.StringReader@6f6b3d33; line: 1, column: 2]
      "

      Delete
  42. Hi,
    while creating the hive table i am getting following exception - ParseException line 9:6 Failed to recognize predicate 'user'. Failed rule: 'identifier' in column specification

    when i change user oclumn name to usr it creates the table but processes no data. Can you please help?

    ReplyDelete
  43. Thanks Alot bro!!! you saved my life....

    ReplyDelete
  44. Hi my hadoop user directory is not showing and twitter data. I am not getting any error while running following command
    bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

    ReplyDelete
    Replies
    1. It might not throw any error, but the problem is you are not giving the correct configuration directory path. The command should be as below

      bin/flume-ng agent --conf -f -Dflume.root.logger=DEBUG,console -n TwitterAgent

      "flume configuration directory path" will usually be "/etc/flume-ng/conf"

      "flume configuration file path" will usually be "/etc/flume-ng/conf/flume.conf"

      and make sure you have created a directories "/user/flume/tweets" in hdfs.

      Delete
  45. Hello,

    Can we define tweet rule while using "twitteragent.sources.twitter.keywords" property to fetch tweets specifically?

    TIA

    ReplyDelete
  46. Hi Everyone ,

    Am New to Hadoop, am getting the below error ,Could u please help me to sort the issue:-


    hive> create table hashtags as select id as id,entities.hashtags.text as words from tweets;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = esak_20160823195256_135b0bdf-3fbb-4f69-8402-6f910afd0e9e
    Total jobs = 3
    Launching Job 1 out of 3
    Number of reduce tasks is set to 0 since there's no reduce operator
    ENOENT: No such file or directory
    at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:724)
    at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:502)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:600)
    at org.apache.hadoop.mapreduce.JobResourceUploader.uploadFiles(JobResourceUploader.java:94)
    at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:95)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:190)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
    at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:433)
    at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:138)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1858)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1562)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1313)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1084)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1072)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
    Job Submission failed with exception 'org.apache.hadoop.io.nativeio.NativeIOException(No such file or directory)'
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. No such file or directory


    please Help me to solve this.............

    ReplyDelete
  47. Is fluming twitter data to download twitter logs still available? or is it stopped?
    C:\>curl http://stream.twitter.com 443
    curl: (7) Failed to connect to stream.twitter.com port 80: Timed out
    curl: (6) Could not resolve host: 443

    can anyone explain.. ?

    ReplyDelete
  48. Is fluming twitter data to download twitter logs still available? or is it stopped?
    C:\Users\abhilash_mohabey>curl http://stream.twitter.com 443
    curl: (7) Failed to connect to stream.twitter.com port 80: Timed out
    curl: (6) Could not resolve host: 443

    can anybody help..?

    ReplyDelete
  49. Hi,

    As per the above steps, I have fetched the Twitter data into HDFS and created a table in Hive. I am unable to fetch the data into Hive Table. Facing the following error. Please help me out of it.

    hive> SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;
    Query ID = cloudera_20161031070909_33a45a44-9a7b-49e8-a754-e694ff554d67
    Total jobs = 2
    Launching Job 1 out of 2
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=
    java.lang.StackOverflowError
    at java.net.URI$Parser.checkChars(URI.java:3000)
    at java.net.URI$Parser.parseHierarchical(URI.java:3086)
    at java.net.URI$Parser.parse(URI.java:3044)
    at java.net.URI.(URI.java:595)
    at java.net.URI.create(URI.java:857)
    at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:473)
    at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:447)
    at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:473)
    at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:447)
    at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:473)
    at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:447)

    FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. null

    ReplyDelete
    Replies
    1. I am facing the same error . If you have figured out how to fix this , please let me know

      Delete
  50. I have the same error. If you have figured out how to fix it, can you please tell me.

    ReplyDelete
  51. While running select * query from data loaded in hive, I am geting result of 1 tweet only when using limit 1. But if I run general select query without any limit, then I get error :
    Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

    Suggest solution

    ReplyDelete
  52. Hi i followed all the steps as per your post but i am facing errors.. anyone can help me??

    INFO twitter4j.TwitterStreamImpl: 404:The URI requested is invalid or the resource requested, such as a user, does not exist.
    Unknown URL. See Twitter Streaming API documentation at http://dev.twitter.com/pages/streaming_api

    ReplyDelete
  53. Hi

    I'm getting the below error, while executing this query,
    SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;

    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2017-10-05 04:52:12,029 Stage-1 map = 0%, reduce = 0%
    2017-10-05 04:52:29,965 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_1506695677341_0040 with errors
    Error during job, obtaining debugging information...
    Examining task ID: task_1506695677341_0040_m_000000 (and more) from job job_1506695677341_0040

    Task with the most failures(4):
    -----
    Task ID:
    task_1506695677341_0040_m_000000

    URL:
    http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1506695677341_0040&tipid=task_1506695677341_0040_m_000000
    -----
    Diagnostic Messages for this Task:
    Error: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:446)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
    ... 9 more
    Caused by: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
    ... 14 more
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
    ... 17 more
    Caused by: java.lang.RuntimeException: Map operator initialization failed
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:157)
    ... 22 more
    Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: Class com.cloudera.hive.serde.JSONSerDe not found
    at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:334)
    at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:352)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:126)
    ... 22 more
    Caused by: java.lang.ClassNotFoundException: Class com.cloudera.hive.serde.JSONSerDe not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:195)

    Could you please tel me, what is the issue

    ReplyDelete
  54. I want to store tweets in KAfka Topic. what changes are required can anyone suggest ? I am new to Hadoop World

    ReplyDelete
  55. i am having problem with creating table in hive

    I Have added the jar as mentioned above
    and in create table command
    it shows the error


    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/serde2/SerDe


    ReplyDelete
  56. Hello,

    I have successfully fetch twitter data using flume and stored it in HDFS , but I was using pig for further analysis and in that I am not able to clean the data because the tweets recorded are in arabic chinese and some other languages including english , but the problem I face here is I am not able to get only english tweets

    ReplyDelete