Big Data and Cloud Tips: Analyse Tweets using Flume, Hadoop and Hive

Monday, March 11, 2013

Analyse Tweets using Flume, Hadoop and Hive

Note : Also don't forget to do check another entry on how to get some interesting facts from Twitter using R here. And also this entry on how to use Oozie for automating the below workflow. Here is a new blog on how to do the same analytics with Pig (using elephant-bird).

It's not a hard rule, but almost 80% of the data is unstructured, while the remaining 20% is structured data. RDBMS helps to store/process the structured data (20%), while Hadoop solves the problem of storing/processing both types of data. The good thing about Hadoop, is that it scales incrementally with less CAPEX in terms of software and hardware.

With the ever increasing usage of smart devices and the high speeds internet, unstructured data had been growing at a very fast rate. It's common to Tweet from a smart phone, take a picture and share it in Facebook.

In this blog we will try to get Tweets using Flume and save them into HDFS for later analysis. Twitter exposes the API (more here) to get the Tweets. The service is free, but requires the user to register for the service. Cloudera wrote a three part series (1, 2, 3) for Twitter Analysis using Hadoop, the code for the same is here. For the impatient, I will quickly summarize how to get data into HDFS using Flume and start doing some analytics using Hive.

Flume has the concepts of agents. The sources, sinks and the intermediate channels are the different types of agents. The sources can push/pull the data and send it to the different channels which in turn will send the data to the different sinks. Flume decouples the source (Twitter) and the sink (HDFS) in this case. Both the source and the sink can operate at different speeds, also it's much easier to add new sources and sinks. Flume comes with a set of sources, channels, sinks and new onces can be implemented by extending the Flume base classes.

1) The first step is to create an application in https://dev.twitter.com/apps/ and then generate the corresponding keys.

2) Assuming that Hadoop has already been installed and configured, the next step is download Flume and extract it to any folder.

3) Download the flume-sources-1.0-SNAPSHOT.jar and add it to the flume class path as shown below in the conf/flume-env.sh file

FLUME_CLASSPATH="/home/training/Installations/apache-flume-1.3.1-bin/flume-sources-1.0-SNAPSHOT.jar"

The jar contains the java classes to pull the Tweets and save them into HDFS.

4) The conf/flume.conf should have all the agents (flume, memory and hdfs) defined as below

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>

TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

The consumerKey, consumerSecret, accessToken and accessTokenSecret have to be replaced with those obtained from https://dev.twitter.com/apps. And, TwitterAgent.sinks.HDFS.hdfs.path should point to the NameNode and the location in HDFS where the tweets will go to.

The TwitterAgent.sources.Twitter.keywords value can be modified to get the tweets for some other topic like football, movies etc.

5) Start flume using the below command

bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

After a couple of minutes the Tweets should appear in HDFS.

6) Next download and extract Hive. Modify the conf/hive-site.xml to include the locations of the NameNode and the JobTracker as below

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>

7) Download hive-serdes-1.0-SNAPSHOT.jar to the lib directory in Hive. Twitter returns Tweets in the JSON format and this library will help Hive understand the JSON format.

8) Start the Hive shell using the hive command and register the hive-serdes-1.0-SNAPSHOT.jar file downloaded earlier.

ADD JAR /home/training/Installations/hive-0.9.0/lib/hive-serdes-1.0-SNAPSHOT.jar;

9) Now, create the tweets table in Hive

CREATE EXTERNAL TABLE tweets (
   id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweet_count INT,
   retweeted_status STRUCT<
      text:STRING,
      user:STRUCT<screen_name:STRING,name:STRING>>,
   entities STRUCT<
      urls:ARRAY<STRUCT<expanded_url:STRING>>,
      user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
      hashtags:ARRAY<STRUCT<text:STRING>>>,
   text STRING,
   user STRUCT<
      screen_name:STRING,
      name:STRING,
      friends_count:INT,
      followers_count:INT,
      statuses_count:INT,
      verified:BOOLEAN,
      utc_offset:INT,
      time_zone:STRING>,
   in_reply_to_screen_name STRING
) 
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';

Now that we have the data in HDFS and the table created in Hive, lets run some queries in Hive.

One of the way to determine who is the most influential person in a particular field is to to figure out whose tweets are re-tweeted the most. Give enough time for Flume to collect Tweets from Twitter to HDFS and then run the below query in Hive to determine the most influential person.

SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;

Similarly to know which user has the most number of followers, the below query helps.

select user.screen_name, user.followers_count c from tweets order by c desc;

For sake of making it simple, partitions have not been created in Hive. Partitions can be created in Hive using Oozie at regular intervals to make the queries run faster if queried for a particular period time. Creating partitions will be covered in another blog.

Happy Hadooping !!!

Edit (21st March, 2013) : Hortonworks blogged a two part series (1 and 2) on Twitter data processing using Hive.

Edit (30th March, 2016) : With the latest version of Flume, the following error is thrown because of the conflicts in the libraries

java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;
at com.cloudera.flume.source.TwitterSource.start(TwitterSource.java:139)'

One solution is to remove the below libraries in the Flume lib folder. There are a couple of more solutions in this StackOverflow article.

lib/twitter4j-core-3.0.3.jar

lib/twitter4j-media-support-3.0.3.jar

lib/twitter4j-stream-3.0.3.jar

85 comments:

AjayKarthickMarch 16, 2013 at 8:19 PM
Awesome Post Praveen!! Just to clarify flume-env.sh should be in conf folder or bin folder ? when i followed apache flume instillation steps i created flume-env.sh in conf folder.
ReplyDelete
Replies
UnknownMarch 30, 2013 at 1:23 AM
Thanks for posting awesome tutorial. I have configured hadoop cluster and now I am trying to follow your tutorial for twitter data analysis. I have installed flume and completed until step 3. For the step 4 when you mention conf/flume.conf are you referring to flume-conf.properties.template file inside conf folder? If not, where can I find that file and define agent fields. Please let me know. I would really appreciate your help. Thanks in advance!!
ReplyDelete
Replies
manojApril 18, 2013 at 5:07 PM
I am not getting the tweets from twitter... the process is stopping at this point mentioned below.... !! Nothing beyond this step.

13/04/18 04:13:11 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: SINK, name: HDFS, registered successfully.
13/04/18 04:13:11 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started

****************************************************************
I have been trying to change some files due the error i have been observing.

In flume.conf have changed the TwitterAgent.sources = TwitterSource

whereas the original content is TwitterAgent.sources = Twitter.

ReplyDelete
Replies
UnknownApril 28, 2013 at 5:31 PM
This comment has been removed by the author.
ReplyDelete
Replies
Diva DollarMay 9, 2013 at 10:06 PM
Thanks for the great post, however after running step 5) l get the following "ERROR properties.PropertiesFileConfigurationProvider: Failed to load configuration data. Exception follows org.apache.flume.FlumeException: Unable to load source type: com.cloudera.flume.source.TwitterSource".

Please assist.
ReplyDelete
Replies
UnknownMay 11, 2013 at 12:03 AM
In step 5, I had no tweets coming into HDFS.

But I found time out information in the console.

Anyone can help?

2013-05-10 14:25:53,941 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:75)] Post Params: count=0&track=hadoop%2Cbig%20data%2Canalytics%2Cbigdata%2Ccloudera%2Cdata%20science%2Cdata%20scientist%2Cbusiness%20intelligence%2Cmapreduce%2Cdata%20warehouse%2Cdata%20warehousing%2Cmahout%2Chbase%2Cnosql%2Cnewsql%2Cbusinessintelligence%2Ccloudcomputing&include_entities=true
2013-05-10 14:26:07,436 (conf-file-poller-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:126)] Checking file:conf/flume.conf for changes
2013-05-10 14:26:14,039 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] connect timed out
2013-05-10 14:26:14,039 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] Waiting for 16000 milliseconds
ReplyDelete
Replies
NT-TNMay 23, 2013 at 10:26 AM
Hi Praveen,

I got tweets coming into HadoopFS but when we are writing various queries i am getting this error: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

LMK what needs to be done here to run your queries?

Thanks,
NT
ReplyDelete
Replies
UnknownJuly 11, 2013 at 1:18 AM
Does anyone know why I might be getting the following error?
[ERROR - org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)] Unable to start EventDrivenSourceRunner: { source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE} } - Exception follows.
java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j/FilterQuery;
ReplyDelete
Replies
UnknownJuly 12, 2013 at 11:38 AM
Hi praveen,
i have done all the steps in the blog and i got the same results.now how can we create partitions in hive using oozie at regular intervals?
and i want to schedule them in oozie. Is it possible?
ReplyDelete
Replies
PrashantJuly 30, 2013 at 8:01 AM
Hi,

I am getting below error

HDFS IO error
java.io.IOException: Callable timed out after 10000 ms on file: hdfs://localhost:8502/user/flume/tweets//FlumeData.1375149990564.tmp
at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:550)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:220)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:383)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:392)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
at java.util.concurrent.FutureTask.get(FutureTask.java:119)
at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:543)
... 6 more

could you please help me out ?
ReplyDelete
Replies
UnknownSeptember 23, 2013 at 1:21 PM
This comment has been removed by the author.
ReplyDelete
Replies
Sameer GuptaSeptember 25, 2013 at 8:02 PM
Hi Praveen/Guys,
I ran the same code on my HortonWorks cluster and once the sink and channel have started, the program times out trying to connect to twitter.

Is this because this code is cloudera specific as I see here in conf file.

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

Would be glad if someone could help.

Here is the timout portion of the log:

25 Sep 2013 07:47:54,763 INFO [conf-file-poller-0] (org.apache.flume.node.nodemanager.DefaultLogicalNodeManager.startAllComponents:141) - Starting Sink HDFS
25 Sep 2013 07:47:54,763 INFO [conf-file-poller-0] (org.apache.flume.node.nodemanager.DefaultLogicalNodeManager.startAllComponents:152) - Starting Source Twitter
25 Sep 2013 07:47:54,766 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:89) - Monitoried counter group for type: SINK, name: HDFS, registered successfully.
25 Sep 2013 07:47:54,766 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:73) - Component type: SINK, name: HDFS started
25 Sep 2013 07:47:54,770 INFO [Twitter Stream consumer-1[initializing]] (twitter4j.internal.logging.SLF4JLogger.info:83) - Establishing connection.
25 Sep 2013 07:48:15,182 INFO [Twitter Stream consumer-1[Establishing connection]] (twitter4j.internal.logging.SLF4JLogger.info:83) - connect timed out
.
.
.
.
.

ReplyDelete
Replies
myblog1October 19, 2013 at 9:37 AM
good post....
ReplyDelete
Replies
GomesOctober 29, 2013 at 12:27 PM
-Dflume.root.logger=DEBUG,console without this line i got error any idea why? otherwise fine.
ReplyDelete
Replies
UnknownNovember 21, 2013 at 7:46 PM
Hi Praveen/Guys,

I am getting an error while getting Tweets into HDFS. Below is the logs for the same.

13/11/21 06:00:01 INFO twitter4j.TwitterStreamImpl: Establishing connection.
13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Waiting for 250 milliseconds
13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Establishing connection.
13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Waiting for 500 milliseconds
13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Establishing connection.
13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
13/11/21 06:00:02 INFO twitter4j.TwitterStreamImpl: Waiting for 1000 milliseconds
13/11/21 06:00:03 INFO twitter4j.TwitterStreamImpl: Establishing connection.
13/11/21 06:00:03 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
13/11/21 06:00:03 INFO twitter4j.TwitterStreamImpl: Waiting for 2000 milliseconds
13/11/21 06:00:05 INFO twitter4j.TwitterStreamImpl: Establishing connection.
13/11/21 06:00:05 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
13/11/21 06:00:05 INFO twitter4j.TwitterStreamImpl: Waiting for 4000 milliseconds
13/11/21 06:00:09 INFO twitter4j.TwitterStreamImpl: Establishing connection.
13/11/21 06:00:09 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
13/11/21 06:00:09 INFO twitter4j.TwitterStreamImpl: Waiting for 8000 milliseconds
13/11/21 06:00:17 INFO twitter4j.TwitterStreamImpl: Establishing connection.
13/11/21 06:00:17 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
13/11/21 06:00:17 INFO twitter4j.TwitterStreamImpl: Waiting for 16000 milliseconds
13/11/21 06:00:33 INFO twitter4j.TwitterStreamImpl: Establishing connection.

Can i get some help ?

ReplyDelete
Replies
UnknownNovember 21, 2013 at 10:48 PM
I've been getting the some error when i'm trying to run this

/usr/bin/flume-ng agent -n kings-river-flume -c conf -f /usr/lib/flume-ng/conf/flume.conf ( i used the default " kings-river-flume" twitter agent name (becouse i couldn't modify /etc/default/flume-ng-agent file) just hoping this was the problem ) but i'm getting this error

13/11/18 16:44:38 INFO lifecycle.LifecycleSupervisor: Starting lifecycle supervisor 1
13/11/18 16:44:38 INFO node.FlumeNode: Flume node starting - kings-river-flume
13/11/18 16:44:38 INFO nodemanager.DefaultLogicalNodeManager: Node manager starting
13/11/18 16:44:38 INFO lifecycle.LifecycleSupervisor: Starting lifecycle supervisor 9
13/11/18 16:44:38 INFO properties.PropertiesFileConfigurationProvider: Configuration provider starting
13/11/18 16:44:39 INFO properties.PropertiesFileConfigurationProvider: Reloading configuration file:/usr/lib/flume-ng/conf/flume.conf
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Added sinks: HDFS Agent: TwitterAgent
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Processing:HDFS
13/11/18 16:44:39 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [TwitterAgent]
13/11/18 16:44:39 WARN properties.PropertiesFileConfigurationProvider: No configuration found for this host:kings-river-flume
13/11/18 16:45:05 INFO node.FlumeNode: Flume node stopping - kings-river-flume
13/11/18 16:45:05 INFO lifecycle.LifecycleSupervisor: Stopping lifecycle supervisor 8
13/11/18 16:45:05 INFO properties.PropertiesFileConfigurationProvider: Configuration provider stopping
13/11/18 16:45:05 INFO nodemanager.DefaultLogicalNodeManager: Node manager stopping
13/11/18 16:45:05 INFO lifecycle.LifecycleSupervisor: Stopping lifecycle supervisor 8

Can anybody help?
ReplyDelete
Replies
UnknownNovember 27, 2013 at 1:53 PM
Hello All,

I am receiving following error:

13/11/27 00:11:15 INFO twitter4j.TwitterStreamImpl: Establishing connection.
13/11/27 00:11:17 INFO twitter4j.TwitterStreamImpl: 406:Returned by the Search API when an invalid format is specified in the request.
Returned by the Streaming API when one or more of the parameters are not suitable for the resource. The track parameter, for example, would throw this error if:
The track keyword is too long or too short.
The bounding box specified is invalid.
No predicates defined for filtered resource, for example, neither track nor follow parameter defined.
Follow userid cannot be read.
Parameter track item index 0 too short:

13/11/27 00:11:17 WARN twitter4j.TwitterStreamImpl: Parameter not accepted with the role. 406:Returned by the Search API when an invalid format is specified in the request.
Returned by the Streaming API when one or more of the parameters are not suitable for the resource. The track parameter, for example, would throw this error if:
The track keyword is too long or too short.
The bounding box specified is invalid.
No predicates defined for filtered resource, for example, neither track nor follow parameter defined.
Follow userid cannot be read.
Parameter track item index 0 too short:

Can someone help or guide me?

I defined the keywords exactly as in the example above...

Many thanks,

Filip
ReplyDelete
Replies
SridharDecember 10, 2013 at 9:35 AM
Thanks Praveen, debugging tips helped me
ReplyDelete
Replies
UnknownDecember 11, 2013 at 7:18 AM
Thanks Suresh, nice work around. Solved the problem for me.
ReplyDelete
Replies
UnknownDecember 23, 2013 at 8:19 PM
I am facing this error althogh keys are correct

- 401:Authentication credentials (https://dev.twitter.com/docs/auth) were missing or incorrect. Ensure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.

For your kind help
ReplyDelete
Replies
UnknownJanuary 30, 2014 at 12:23 PM
Thanks Suresh
A very nice blog ...configured Apache flume in seconds to get data from twitter...
ReplyDelete
Replies
tejuFebruary 3, 2014 at 7:49 PM
hello,
i am getting following error ,can someone help please

ERROR node.PollingPropertiesFileConfigurationProvider: Failed to load configuration data. Exception follows.
org.apache.flume.FlumeException: Unable to load source type: com.cloudera.flume.source.TwitterSource, class: com.cloudera.flume.source.TwitterSource
at org.apache.flume.source.DefaultSourceFactory.getClass(DefaultSourceFactory.java:67)
at org.apache.flume.source.DefaultSourceFactory.create(DefaultSourceFactory.java:40)
at org.apache.flume.node.AbstractConfigurationProvider.loadSources(AbstractConfigurationProvider.java:327)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ClassNotFoundException: com.cloudera.flume.source.TwitterSource
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.flume.source.DefaultSourceFactory.getClass(DefaultSourceFactory.java:65)

ReplyDelete
Replies
howdoyoulikemenowMarch 2, 2014 at 12:58 PM
Hi Guys!!
I m pretty new to this field. I just wanted to know whether I could install flume,hive,oozie,etc without setting up the cloudera environment as suggested in the post. I have already a working pseudo distributed Hadoop cluster set up on my computer.
ReplyDelete
Replies
jpMarch 2, 2014 at 4:53 PM
I am able to see tweets on console but not getting data into hdfs.

My conf file is :

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = g2VoxfPOrIFKo0o58mhA
TwitterAgent.sources.Twitter.consumerSecret = L8H80HLL3q2LKTQQBX8BtyleMW1YdqxheWJxWozbcbg
TwitterAgent.sources.Twitter.accessToken = 1382526218-EvFiViJfN8b2CmPankGyaU6BHty1FXYDK1PLZEQ
TwitterAgent.sources.Twitter.accessTokenSecret = T0owrTGPvvw548CSiydTibwwi6ZfJJqLBW64vSot1jMTI
TwitterAgent.sources.Twitter.keywords = ind,pak,cricinfo,cricket,asiacup,rgsharma,viratkohli,rahane,dhoni,ipl,6thmatch,dhawan,shahidafridi,boomboom,
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://myhost:8020/whyt/twitter/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

Any idea;
ReplyDelete
Replies
howdoyoulikemenowMarch 24, 2014 at 8:53 PM
I could be wrong but I think the query to determine the most influential person has the following correction - replace 'max(retweet_count)' in the sub-query with 'max(retweeted_status.retweet_count)'.
ReplyDelete
Replies
EL FATMIMarch 24, 2014 at 9:33 PM
Hi every body , i have this error while i'm executing Flume , any help ??
org.apache.flume.node.Application -f conf/flume.conf -n TwitterAgent
Erreur : impossible de trouver ou charger la classe principale org.apache.flume.node.Application
ReplyDelete
Replies
UnknownApril 17, 2014 at 10:41 PM
Hi, I need help. If someone could make the tutorial successfully fovar please contact lhssa@hotmail.com
ReplyDelete
Replies
DEMO WEBApril 24, 2014 at 1:30 PM
how to view the output after processing mapreduce?? i
ReplyDelete
Replies
datacloudiesMay 14, 2014 at 1:22 PM
after I installed hive,,i m getting output like this

> select user.screen_name, user.followers_count c from tweets order by c desc;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201404170949_0035, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201404170949_0035
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201404170949_0035
2014-05-14 00:49:49,071 Stage-1 map = 0%, reduce = 0%
2014-05-14 00:49:52,091 Stage-1 map = 0%, reduce = 100%
2014-05-14 00:49:53,108 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201404170949_0035
OK
Time taken: 7.453 seconds

it doesnt show the table,,,bt i got the tweeted info as in tmp file,,,how to resolve this problem
ReplyDelete
Replies
UnknownMay 28, 2014 at 11:01 AM
um, this doesnt make any sense.........

WHERE is the conf/flume-ng??? Why should I have this? I only downloaded the flume-JAR but you didn't say anything about downloading more than this regarding flume?? is there something missing? what else should I DOWNLOAD?
ReplyDelete
Replies
UnknownJune 26, 2014 at 11:21 AM
Hi,
Praveen Sripati

This post is very helpful for me every things work for me. i want to ask 1 question if i have to store date wise twitter data like (in 4/04/2014 Modi twits) then how to set flume agent or flume.conf.
ReplyDelete
Replies
Sri VarshaJune 27, 2014 at 6:08 AM
I couldn't find setIncludeEntities method in my FilterQuery.class. Could you please tell me what to do to solve this issue.
ReplyDelete
Replies
UnknownJuly 29, 2014 at 1:30 PM
Thanks for great post !!!
As mentioned in point 8, please give command to register "hive-serdes-1.0-SNAPSHOT.jar" in Hive shell.
ReplyDelete
Replies
UnknownAugust 4, 2014 at 5:13 PM
Thanks for the Post!
I am getting following in console.
14/08/04 02:00:46 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started
14/08/04 02:00:46 INFO twitter4j.TwitterStreamImpl: Establishing connection.
14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: Waiting for 250 milliseconds
14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: Establishing connection.
14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
14/08/04 02:01:08 INFO twitter4j.TwitterStreamImpl: Waiting for 500 milliseconds
14/08/04 02:01:09 INFO twitter4j.TwitterStreamImpl: Establishing connection.
14/08/04 02:01:09 INFO twitter4j.TwitterStreamImpl: stream.twitter.com
14/08/04 02:01:09 INFO twitter4j.TwitterStreamImpl: Waiting for 1000 milliseconds
...

and I couldn''t able to find tweets in HDFS.

Any help!
ReplyDelete
Replies
k1ng87August 29, 2014 at 7:13 AM
I'm trying to start flume but keeping a permission denied error:

-bash: /user/Flume/apache-flume-1.5.0-bin/bin/flume-ng: Permission denied
and try to start it like this as well:

/user/Flume/apache-flume-1.5.0-bin/bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

not sure if this makes a difference, but the server I'm working on is on amazon EC2 and where I believe you can't login as root but use sudo su instead, does that make a difference in the permissions? Sorry if this is trivial but i'm a linux rookie..
ReplyDelete
Replies
UnknownSeptember 2, 2014 at 12:53 PM
Hi Praveen,
I am getting below log whenever i m starting flume.But i didn't get any twitte from twitter so according to below log flume is running or not?..
Could you please give me solution?
2014-09-02 12:22:23,322 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] X-Twitter-Client-URL: http://twitter4j.org/en/twitter4j-2.2.6.xml
2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] X-Twitter-Client: Twitter4J
2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] Accept-Encoding: gzip
2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] User-Agent: twitter4j http://twitter4j.org/ /2.2.6
2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] X-Twitter-Client-Version: 2.2.6
2014-09-02 12:22:23,323 (Twitter Stream consumer-1[Establishing connection]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] Connection: close

ReplyDelete
Replies
UnknownOctober 31, 2014 at 12:35 AM
(Twitter Stream consumer-1[Receiving stream]) [DEBUG - twitter4j.internal.logging.SLF4JLogger.debug(SLF4JLogger.java:67)] Twitter Stream consumer-1[Receiving stream]
Exception in thread "Twitter4J Async Dispatcher[0]" java.lang.NoSuchMethodError: twitter4j.json.JSONObjectType.determine(Ltwitter4j/internal/org/json/JSONObject;)Ltwitter4j/json/JSONObjectType;
at twitter4j.AbstractStreamImplementation$1.run(AbstractStreamImplementation.java:100)
at twitter4j.internal.async.ExecuteThread.run(DispatcherImpl.java:116)
2014-10-30 12:00:21,105 (conf-file-poller-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:126)] Checking file:conf/flume.conf for changes

solution for above problem?
ReplyDelete
Replies
UnknownMarch 29, 2016 at 5:15 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownApril 2, 2016 at 12:00 PM
sir i m trying to stream twitter data in hdfs but when i run flume agent although data is seen streaming on command line ,when i m checking namenode i m getting empty directory.Data is not going to hdfs pls help
ReplyDelete
Replies
YUVANApril 12, 2016 at 4:24 PM
sir ..im trying my level best to create an hive external table but i m always getting this error ...pls help me sir ... i need to resolve this this issue soon ..

hive> CREATE External TABLE dandanaka (
> id BIGINT,
> created_at STRING,
> source STRING,
> favorited BOOLEAN,
> retweet_count INT,
> retweeted_status STRUCT<
> text:STRING,
> user:STRUCT,
> retweet_count:INT>,
> entities STRUCT<
> urls:ARRAY>,
> usermentions:ARRAY>,
> hashtags:ARRAY>>,
> text STRING,
> user STRUCT<
> screen_name:STRING,
> name:STRING,
> friends_count:INT,
> followers_count:INT,
> statuses_count:INT,
> verified:BOOLEAN,
> utc_offset:INT,
> time_zone:STRING>,
> in_reply_to_screen_name STRING
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
> LOCATION '/user/flume/tweets2';
FailedPredicateException(identifier,{useSQL11ReservedKeywordsForIdentifier()}?)
at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:10924)
at org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:45850)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HiveParser.java:38211)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(HiveParser.java:36342)
at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java:39707)
at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:38655)
at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:38367)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser.java:38051)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HiveParser.java:36203)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5214)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2640)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:396)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 9:4 Failed to recognize predicate 'user'. Failed rule: 'identifier' in column specification
ReplyDelete
Replies
karthik golaganiMay 23, 2016 at 10:40 AM
Is fluming twitter data to download twitter logs still available? or is it stopped? because i tried to flume twitter data today after long time with flume 1.4,java 1.6 but unable to download the twitter data
ReplyDelete
Replies
UnknownMay 26, 2016 at 2:44 PM
Hello Everyone,
I am getting this error when running the select query.Can anybody help?
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable Objavro.schema�
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:185)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable Objavro.schema�
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:501)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176)
... 8 more
Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.ByteArrayInputStream@649b97b1; line: 1, column: 2]
at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:169)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:136)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:100)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:492)
... 9 more
Caused by: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.ByteArrayInputStream@649b97b1; line: 1, column: 2]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442)
at org.codehaus.jackson.impl.Utf8StreamParser._handleUnexpectedValue(Utf8StreamParser.java:2090)
at org.codehaus.jackson.impl.Utf8StreamParser._nextTokenNotInObject(Utf8StreamParser.java:606)
at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:492)
at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:158)
... 12 more
ReplyDelete
Replies
Harpreet SethiJune 29, 2016 at 9:12 PM
Hi,
while creating the hive table i am getting following exception - ParseException line 9:6 Failed to recognize predicate 'user'. Failed rule: 'identifier' in column specification

when i change user oclumn name to usr it creates the table but processes no data. Can you please help?
ReplyDelete
Replies
UnknownJuly 2, 2016 at 12:54 PM
Thanks Alot bro!!! you saved my life....
ReplyDelete
Replies
UnknownJuly 19, 2016 at 10:25 PM
Hi my hadoop user directory is not showing and twitter data. I am not getting any error while running following command
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

ReplyDelete
Replies
UnknownAugust 8, 2016 at 11:59 PM
Hello,

Can we define tweet rule while using "twitteragent.sources.twitter.keywords" property to fetch tweets specifically?

TIA
ReplyDelete
Replies
UnknownAugust 23, 2016 at 8:06 PM
Hi Everyone ,

Am New to Hadoop, am getting the below error ,Could u please help me to sort the issue:-

hive> create table hashtags as select id as id,entities.hashtags.text as words from tweets;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = esak_20160823195256_135b0bdf-3fbb-4f69-8402-6f910afd0e9e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:724)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:502)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:600)
at org.apache.hadoop.mapreduce.JobResourceUploader.uploadFiles(JobResourceUploader.java:94)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:95)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:190)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:433)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:138)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1858)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1562)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1313)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1084)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1072)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Job Submission failed with exception 'org.apache.hadoop.io.nativeio.NativeIOException(No such file or directory)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. No such file or directory

please Help me to solve this.............
ReplyDelete
Replies
Gold SmitSeptember 26, 2016 at 9:34 PM
Is fluming twitter data to download twitter logs still available? or is it stopped?
C:\>curl http://stream.twitter.com 443
curl: (7) Failed to connect to stream.twitter.com port 80: Timed out
curl: (6) Could not resolve host: 443

can anyone explain.. ?
ReplyDelete
Replies
Gold SmitSeptember 26, 2016 at 9:37 PM
Is fluming twitter data to download twitter logs still available? or is it stopped?
C:\Users\abhilash_mohabey>curl http://stream.twitter.com 443
curl: (7) Failed to connect to stream.twitter.com port 80: Timed out
curl: (6) Could not resolve host: 443

can anybody help..?
ReplyDelete
Replies
UnknownOctober 31, 2016 at 5:00 PM
Hi,

As per the above steps, I have fetched the Twitter data into HDFS and created a table in Hive. I am unable to fetch the data into Hive Table. Facing the following error. Please help me out of it.

hive> SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;
Query ID = cloudera_20161031070909_33a45a44-9a7b-49e8-a754-e694ff554d67
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
java.lang.StackOverflowError
at java.net.URI$Parser.checkChars(URI.java:3000)
at java.net.URI$Parser.parseHierarchical(URI.java:3086)
at java.net.URI$Parser.parse(URI.java:3044)
at java.net.URI.(URI.java:595)
at java.net.URI.create(URI.java:857)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:473)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:447)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:473)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:447)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:473)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:447)

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. null
ReplyDelete
Replies
Ayushi JainNovember 14, 2016 at 10:35 AM
I have the same error. If you have figured out how to fix it, can you please tell me.
ReplyDelete
Replies
UnknownDecember 20, 2016 at 3:05 PM
While running select * query from data loaded in hive, I am geting result of 1 tweet only when using limit 1. But if I run general select query without any limit, then I get error :
Failed with exception java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

Suggest solution
ReplyDelete
Replies
vinayJanuary 14, 2017 at 11:56 PM
Hi i followed all the steps as per your post but i am facing errors.. anyone can help me??

INFO twitter4j.TwitterStreamImpl: 404:The URI requested is invalid or the resource requested, such as a user, does not exist.
Unknown URL. See Twitter Streaming API documentation at http://dev.twitter.com/pages/streaming_api
ReplyDelete
Replies
UnknownOctober 5, 2017 at 5:40 PM
Hi

I'm getting the below error, while executing this query,
SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-10-05 04:52:12,029 Stage-1 map = 0%, reduce = 0%
2017-10-05 04:52:29,965 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1506695677341_0040 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1506695677341_0040_m_000000 (and more) from job job_1506695677341_0040

Task with the most failures(4):
-----
Task ID:
task_1506695677341_0040_m_000000

URL:
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1506695677341_0040&tipid=task_1506695677341_0040_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:446)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 17 more
Caused by: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:157)
... 22 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: Class com.cloudera.hive.serde.JSONSerDe not found
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:334)
at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:352)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:126)
... 22 more
Caused by: java.lang.ClassNotFoundException: Class com.cloudera.hive.serde.JSONSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:195)

Could you please tel me, what is the issue
ReplyDelete
Replies
hetadesaiOctober 5, 2017 at 6:15 PM
I want to store tweets in KAfka Topic. what changes are required can anyone suggest ? I am new to Hadoop World
ReplyDelete
Replies
UnknownOctober 11, 2017 at 12:35 PM
i am having problem with creating table in hive

I Have added the jar as mentioned above
and in create table command
it shows the error

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/serde2/SerDe

ReplyDelete
Replies
UnknownFebruary 20, 2018 at 5:37 PM
Hello,

I have successfully fetch twitter data using flume and stored it in HDFS , but I was using pig for further analysis and in that I am not able to clean the data because the tweets recorded are in arabic chinese and some other languages including english , but the problem I face here is I am not able to get only english tweets
ReplyDelete
Replies

Add comment

Pages

Monday, March 11, 2013

Analyse Tweets using Flume, Hadoop and Hive

85 comments: