Thursday, December 26, 2013

Store and ETL Big Data in the Cloud with Apache Hive

Big Data and cloud storage paired with the processing capabilities of Apache Hadoop and Hive as a service can be an excellent complement to expensive data warehouses. The ever-increasing storage demands of big data applications put pressure on in-house storage solutions. This data is generally not stored in database systems because of its size. In fact, it is commonly the precursor to a data mining or aggregation process, which writes the distilled information into a database. However, for this process the constantly growing data collected from logs, transactions, sensors, or others has to be stored somewhere, inexpensively, safely, and accessibly.

Cloud Storage

Most companies can achieve two of the three attributes, e.g. safe and inexpensive (tapes) or safe and accessible (multi-location, data servers), or inexpensive and accessible (non-redundant server or network attached drive). The combination of the three requires economies of scale beyond most company's ability and feasibility for their data. Cloud providers like Amazon Web Services have taken on the challenge and offer excellent solutions like Amazon S3. It provides a virtually limitless data sink with a tremendous safety (99.999999999% durability), instant access, and reasonable pricing.

Utilizing Cloud Data

Scenarios for big data in the cloud consequently should consider S3 as a data sink to collect data in a single point. Once collected the data has to be processed. One option is to write custom software to load the data, parse it, and aggregate and export the information contained in it to another storage format or data store. This common ETL (Extract, Transform, Load) processing is encountered in data warehouses situations. So reinventing the wheel by developing a custom solution is not desirable. At the same time building or scaling a data warehouse for Big Data is an expensive proposition.

There is an alternative, which can ETL Giga-, Tera- and Petabytes of data with low information density cheaply and with mature tools to simplify the design and management of the ETL jobs. This is the domain of Apache Hadoop and Hive.

Accessing Cloud Data with Hive

Apache Hive offers a SQL-like interface to data stored on Hadoop's HDFS or Amazon S3. It does this by mapping data formats to tables and associating a directory with a table. We can then execute SQL-like queries against this data. Hive turns the queries into map reduce jobs for Hadoop to execute. Hadoop will read all the files in the directory, parsed based on the table definition, and the extracted data to process it according to the query.

A very simple example is a comma-separated file stored on S3:
1,10.34,Flux Capacitor,\N
2,5.00,Head phone,ACME Corp.
The '\N' is a Hive specific encoding and is read by Hive as NULL. An empty string would be read exactly as an empty string and not NULL.

A table definition in Hive to utilise this data would be:
CREATE EXTERNAL TABLE products(
    id INT,
    price DECIMAL,
    name STRING,
    manufacturer STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3n://your_bucket/some_where/';
The table create statement contains the usual column definitions. The EXTERNAL keyword indicates to Hive that the table data is managed outside of Hive. This means that creating and dropping the table will not affect the data - however overwriting data in the table will replace the data at the location! This is commonly used with data sinks where external (to Hive) processes deposit data in a  location and Hive is used to read and process the data.

The lines ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' and STORED AS TEXTFILE tell Hive to expect the data in comma separated fields in a text file format. Be aware, though, that Hive does not understand how to escape comma in strings, e.g. when the string is enclosed by quotes like Microsoft Excel exports it. Alternatively delimiters can be used that do not occur in the data like special unicode characters (Hive's default is 0x01 for example).

Lastly, the LOCATION indicates to Hive where to find the data on S3 or HDFS. Hive will read all files in the location and represent them as one table to queries. Be aware that large number of small files usually degrades performance.

Storing Data with Hive File Formats

Apache Hive supports custom extensions to its data formats with SerDe, which is short for serialization/deserialization. These small programs can be written by anyone and added to Hive as JARs. There are already SerDes available to process JSON, Avro, Thrift, and regular expressions (to parse and extract data from strings at reading time, e.g. to map logs to tables) to name the most prominent ones (Avro, Thrift and regular expression SerDes are included in Hive).

Hive also supports compression and detects compressed text files based on the file extensions. Commonly used and available in most Hive versions are LZO, Gzip, BZip2, Deflate, and Snappy. The tricky part with compression is splitting files. Not all algorithms allow Hadoop to split (large) compressed files to utilize its parallelism. BZip2 and LZO compress by blocks and allow Hadoop to split files though they have caveats. LZO out of the box does not add the required index to the files to make it splittable. BZip2, while very effective, is not very fast. A good practice is to store text files split and compressed as GZip files for Hadoop to processes one file per map task. This requires the process depositing the files in the data sink to be aware and able to split files practicable for the following Hadoop processing.

An alternative solution is to store text files uncompressed, of course, and bear the additional storage and network costs. Intermediate and final outputs, however, nearly always should use compression and optimized file formats - exceptions are outputs that are required in a specific format for other systems to load the data. Otherwise, Hive file formats like SEQUENCEFILE, RCFILE, and ORC store outputs more efficiently than text formats. They can be defined in the table create statement with 'STORED AS RCFILE', for example. These formats are block oriented and permit compression algorithms on the blocks without penalizing the parallelism of Hadoop. Preferred compression algorithms for these formats are usually GZip, which is efficient and fast, or Snappy, which is slightly less efficient yet faster than GZip.

Sequencefile

The traditional file format for Hadoop is the Sequencefile. It stores key-value pairs of data in a flat binary file. The advantages are a more efficient storage of using byte over text format, wide support by the Hadoop ecosystem, and block compression support. The downside for use with Hive is that for any kind of query all of the data has to be read since Hive rows are stored as values. So to retrieve one column in one row the whole data has to be read and no query deduced optimization is possible.

RCFile

The RCFile (Record Columnar File) is an optimized format for storing and retrieving column oriented data in a parallel processing environment. RCFile guarantees that all the data of a row is stored in a row group and that a row group is not split across nodes. A node always has all the data of one row locally accessible, which simplifies processing. Within a row group RCFile stores data by columns.

RCFile (Record Columnar File) Hive file format

This offers two major advantages. Firstly, compression algorithms can work more efficiently since data types in a column do not change and cardinality may be low in a column. Secondly, query conditions can be pushed down the stack and a query reading and evaluating only some of the columns can skip reading data from irrelevant columns. Map tasks therefore can reduce the data read and processed saving IO and CPU time.

ORC

The ORC (Optimized Row Columnar) file format is an evolution of the RCFile idea. It stores groups of data in columnar format. ORC adds data dependent encoding, i.e. block-mode compression based on data type. Integers are run-length encoded and string columns dictionary encoded so the most frequent items are encoded efficiently.

ORC (Optimized Row Columnar) Hive file format

The second improvement is the addition of header and footers to data blocks, which can store block statistics and indices. These can be simple information like minimum and maximum values of columns in the block. These features allow further improvements on data processing beyond RCFile. ORC and RCFile both allow skipping columns in blocks. ORC also allows skipping whole blocks that don't pass predicate filtering conditions. Row blocks with certain minimum/maximum values in a column, for example, can be skipped if the block statistics indicate that no items of interest are contained in the block. ORC is already released but features around these indices are still under development. Especially Hive is increasingly taking advantage of these indices. In the future bloom filters may be added to exclude even more data from the need of being parsed.

Conclusion

Big Data can easily be stored inexpensively in the cloud and processed by Hive to ETL data. It is a cost-effective complement to data warehouse solutions, and it reduces risk, cost, and/or improves accessibility over in-house solutions. Once data is processed and stored in Hive it does make sense to consider the various file formats available. Generally, Snappy or GZip block compressed RCFile or ORC formats improve data storage size and access performance for Hive processing.

You can try out the Qubole Data Service, which offers highly optimized S3 access and on-demand Hadoop clusters, a graphical web interface to Hive and Pig, adapters to RDBMS and NoSQL, and an advanced job management for complex, repeating ETL tasks.

Author Bio:

Rajat Jain currently works as software engineer at Qubole developing its backend and middleware systems.

1 comment:

  1. Came to this searching for AVRO v/s ORC
    Not the kind of things to compare, are they?

    ReplyDelete