Tuesday, October 1, 2013

Sqoop2 vs Sqoop

Apache Sqoop uses a client model where the user needs to the install Sqoop along with connectors/drivers on the client. Sqoop2 (the next version of Sqoop) uses a service based model, where the connectors/drivers are installed on the Sqoop2 server. Also, all the configurations needs to be done on the Sqoop2 server. Note that this blog entry refers to Sqoop 1x (client based model) as Sqoop and Sqoop 2x (service based model) as Sqoop2.

From an MR perspective another difference is that Sqoop submits a Map only job, while Sqoop2 submits a MapReduce job where the Mappers would be transporting the data from the source, while the Reducers would be transforming the data according to the source specified. This provides a clean abstraction. In Sqoop, both the transportation and the transformations were provided by Mappers only.

Another major difference in Sqoop2 is from a security perspective. The administrator would be setting up the connections to the source and the targets, while the operator user uses the already established connections, so the operator user need not know the details about the connections. And operators will be given access to only some of the connectors as required.

Along with the continuation of the CLI, Web UI can also be used with Sqoop2. The CLI and the Web UI consume the REST services provided by the Sqoop Server. One thing to note is that the Web UI is part of Hue (HUE-1214) and not part of the ASF. The Sqoop2 REST interface also makes it easy to integrate with other frameworks like Oozie to define a work flow involving Sqoop2.
Here is a video on why Sqoop2. The voice is a bit not clear. For those who are more into reading, here are the articles (1, 2) for the same.

Also, here is the documentation for installing and using Sqoop2.  I tried installing it and as with any other framework in the initial stages, Sqoop2 is still half cooked and a lot more to be improved/developed from a usability/documentation perspective. Although the features of Sqoop2 are interesting, until Sqoop2 becomes more usable/stable we are left with Sqoop to get the data from RDBMs to Hadoop.

4 comments:

  1. Sqoop 2 is nonfunctional peace of shit, throw it away immediately

    ReplyDelete
    Replies
    1. As mentioned in the blog, it is half cooked. May be I should have highlighted it.

      Delete
  2. By using sqoop2 we can able to connect to mysql. but we cant able to connect to oracle db. is sqoop2 does'nt support oracle. please provide the solution how to connect to oracle db using sqoop2

    ReplyDelete
    Replies
    1. What is the exact issue are you facing? Since oracle jdbc connector is available for sqoop2, you should be able to connect to Oracle.

      Delete