Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.
In the future blogs, we will see how to perform complex processing using the other transformations and actions provided by Spark.
import re import sys from pyspark import SparkContext #function to extract the data from the line #based on position and filter out the invalid records def extractData(line): val = line.strip() (year, temp, q) = (val[15:19], val[87:92], val[92:93]) if (temp != "+9999" and re.match("[01459]", q)): return [(year, temp)] else: return [] logFile = "hdfs://localhost:9000/user/bigdatavm/input" #Create Spark Context with the master details and the application name sc = SparkContext("spark://bigdata-vm:7077", "max_temperature") #Create an RDD from the input data in HDFS weatherData = sc.textFile(logFile) #Transform the data to extract/filter and then find the max temperature max_temperature_per_year = weatherData.flatMap(extractData).reduceByKey(lambda a,b : a if int(a) > int(b) else b) #Save the RDD back into HDFS max_temperature_per_year.saveAsTextFile("hdfs://localhost:9000/user/bigdatavm/output")The map function during the transformation is similar to the map function in the MR model and reduceByKey transformation is similar to the reduce function in the MR model.
In the future blogs, we will see how to perform complex processing using the other transformations and actions provided by Spark.
No comments:
Post a Comment