Thursday, May 22, 2014

Pig as a Service: Hadoop challenges data warehouses

Thanks to Gil Allouche (Qubole's VP of Marketing) for this post.

Hadoop and its ecosystem has evolved from a narrow map-reduced architecture to a universal data platform set to dominate the data processing landscape in the future. Importantly, the push to simplify Hadoop deployments with managed cloud services known as Hadoop-as-a-Service is increasing Hadoop’s appeal to new data projects and architectures. Naturally, the development is permeating the Hadoop ecosystem in shape of Pig as a Service offerings, for example.

Pig, developed by Yahoo research in 2006, enables programmers to write data transformation programs for Hadoop quickly and easily without the cost and complexity of map-reduce programs. Consequently, ETL (Extract, Transform, Load), the core workload of DWH (data warehouse) solutions, is often realized with Pig in the Hadoop environment. The business case for Hadoop and Pig as a Service is very compelling from financial and technical perspectives.

Hadoop is becoming data’s Swiss Army knife
The news on Hadoop last year have been dominated by SQL (Structured Query language) on Hadoop with Hive, Presto, Impala, Drill, and countless other flavours competing on making big data accessible to business users. Most of these solutions are supported directly by Hadoop distributors, e.g. Hortonworks, MapR, Cloudera, and cloud service providers, e.g. Amazon and Qubole.

The push for development in the area is driven by the vision for Hadoop to become the data platform of the future. The release of Hadoop 2.0 with YARN (Yet Another Resource Negotiator) last year was an important step. It turned the core of Hadoop’s processing architecture from a map-reduce centric solution into a generic cluster resource management tool able to run any kind of algorithm and application. Hadoop solution providers are now racing to capture the market for multipurpose, any-size data processing. SQL on Hadoop is only one of the stepping-stones to this goal.

3 Ways Hadoop is gaining on the Data Warehouses
The target and incentive is clear, Hadoop is a comparatively inexpensive technology to store and process large data sets with. One established market is particularly lucrative and tempting to enter. Data warehouse (DWH) solutions easily can cost many millions of dollars and Hadoop with its economical distributed computing architecture and growing ecosystem promises to achieve much of their feature set for a fraction of the cost.

Three exciting, active developments are eating away on established DWH solutions’ lead on Hadoop and reasons for spending 10 or 100 times more:
  • SQL on Hadoop is making data accessible to data and business analysts and existing tools for visualisation and analytics via SQL interfaces. Presto and other new SQL engines highlight that real-time querying on big data (Petabytes and beyond) can be done with Hadoop at dramatically lower cost than what DWH solutions offered.
  • Cloud computing based platforms and software services for Hadoop remove complexity, risk, and technical barriers to get started with Hadoop. Importantly, it enables incremental iterative development of Hadoop data projects, which makes it also attractive for medium sized data projects. Today all major Hadoop projects are offered in one shape or another as-a-Service with cloud service providers working on making increasingly complete Hadoop ecosystems available as-a-Service, billed by the hour, and fully scalable in minutes.
  • ETL capabilities of Hadoop have matured significantly. Pig offers a full set of data transformation operations executable on Hadoop. Sqoop integrates SQL stores, and scalable NoSQL stores like HBase Cassandra, or MongoDB are regularly designed with Hadoop in mind in the first place or at least tightly integrated. Workflow tools like Oozie can orchestrate complex data pipelines. Together these developments compete with core abilities of DWH solutions and go beyond some of their established features.
Pig, the silent Hero
SQL on Hadoop has been extensively covered in the media in the last year. Pig, being a well-established technology, has been largely overlooked though Pig as a Service was a noteworthy development. Considering Hadoop as a data platform though requires Pig and an understanding why and how it is important.

Data users are generally trained in using SQL, a declarative language, to query for data for reporting, analytics, and ad-hoc explorations. SQL does not describe how the data is processed, it is more declarative and appeals to a lot of data users. ETL processes, which are developed by data programmers, benefit and sometimes even require the ability to detail the data transformation steps. At times ETL programmers like a procedural language as opposed to a declarative language. Pig’s programming language, Pig Latin, is procedural and gives programmers control over every step of the processing.

Business users and programmers work on the same data set yet usually focus on different stages. The programmers commonly work on the whole ETL pipeline, i.e. they are responsible to clean and extract the raw data, transform it and load it into third party systems. Business users either access data on third party systems or access the extracted and transformed data for analysis and aggregation. The requirement of diverse tooling is therefore important as the interaction patterns with the same data set are divers.

Importantly, complex ETL workflows need management, extensibility, and testability to ensure stable and reliable data processing. Pig provides strong support on all aspects. Pig jobs can be scheduled and managed with workflow tools like Oozie to build and orchestrate large scale, graph-like data pipelines.

Pig achieves extensibility with UDFs (User Defined Function), which let programmers add functions written in one of many programming languages. The benefit of this model is that any kind of special functionality can be injected and that Pig and Hadoop manage the distribution and parallel execution of the function on potentially huge data sets in an efficient manner. This allows the programmers to focus on adding and solving specific domain problems, e.g. like rectifying specific data set anomalies or converting data formats, without worrying about the complexity of distributed computing.

Reliable data pipelines require testing before deployment in production to ensure correctness of the numerous data transformation and combination steps. Pig has features supporting easy and testable development of data pipelines. Pig supports unit tests, an interactive shell, and the option to run in a local mode, which allows it to execute programs in a fashion not requiring a Hadoop cluster. Programmers can use these to test their Pig programs in detail with test data sets before they ever enter production and also help them try out ideas quickly and inexpensively, which is essential for fast development cycles.

None of these features are particularly glamorous yet they are important to evaluate Hadoop and data processing with it. The choice of leveraging Pig for a big data project can easily make the difference between success and failure.

Pig as a Service
Pig by itself is the important glue that turns raw data from (No)SQL and object stores into structured data. Yet, Pig requires the Hadoop environment to execute its programs. The as-a-Service offerings focus on providing the necessary cluster environment ready to run for data projects to focus on the ETL aspects.

The business case for Pig as a Service is simple and convincing. Hadoop is a complex data platform and the continued growth of the Hadoop market meant that experts are costly and hard to find. At the same time, as mentioned before, Hadoop has the potential to shatter data processing costs per byte.

The second argument for the service route is the fact that while per byte Hadoop may beat alternative processing solutions; it is hard to achieve the economies of scale required for most business and projects. And even if these may be achievable in the future not many businesses are willing to invest significant capital into a large scale Hadoop infrastructure without a proven track record of what the eventual savings will be.

The service solution addresses all these problems by effectively outsourcing the expertise and scale challenge. Cloud computing enables providers like Amazon, Mortar Data, or Qubole, to offer scalable services around Hadoop billed on usage basis. Their business models and services provided vary from provider to provider though they all offer effectively Pig as a Service removing technical barriers, and capital investments, while adding expertise for customers enriching the offerings on various aspects.

This approach is significantly different to the more traditional solution offered by Hadoop distributors. They provide complete Hadoop ecosystems and support for running them. However, the customer still has to hire experts and operate the Hadoop cluster, and has to pay a significant fee per cluster node and year for support. Cloud computing has also advanced into this area and allows customers to install Hadoop distributions on virtual machines removing capital investments. However, the operational burden of maintaining a cluster, even if supported, still remains a cost and additional effort stuck with the customer. As-a-Service solutions remove these problems.

Today, anyone considering data projects involving ETL workloads should sincerely consider how Pig and Hadoop might fit into their data processing architecture. Pig as a Service is offering a low cost, low barrier, testable, manageable option for business to enter big data platforms and potentially save time and money over traditional data warehouse options. It also opens up the opportunity to increasingly leverage the Hadoop ecosystem, e.g. SQL on Hadoop for scalable querying of big data for business users or distributed computing for advanced data processing projects.

No comments:

Post a Comment