Monday, July 30, 2018

Compatibility between the Big Data vendors

What the Big Data vendors have to offer?

Finally that the Big Data wars have pretty much ended, we have got Cloudera, MapR and Hortonworks as the major Big Data vendors. There are also other pure vendors that focus on one or two Big Data softwares (like DataStax on Apache Cassandra), but the above mentioned Cloudera, MapR and Hortonworks vendors provide a complete suite of softwares covering storage, processing, security, easy installation etc. These vendors solve some of the problems like

  • Integrating the different softwares from Apache. Not every Big Data software from Apache is compatible with other. These vendors make sure that the different softwares from Apache play nice with each other.

  • Installation and fine tuning of the Big Data softwares is not easy. It's no more download and click. These vendors make the installation process easier and automate as much as possible.

  • Although the software from Apache is free to use. Apache Software Foundation doesn't provide any commercial support. Companies like Cloudera, MapR and Hortonworks fill the gap as long as the software from these vendors is being used.


Compatibility between the different Big Data vendors?

It's all fine and dandy till now. One of the purpose of using Open Source Software is to avoid vendor lock-in. But, the different Big Data vendors deliberately or not cause customers to lock into their software. Here we will look into it.

The suite from the different vendors is actually a combination of different softwares. For example Cloudera CDH is a combination of Hadoop, Hive, Pig, Sqoop, Oozie and a variety of different Big Data softwares. These and others are the common set of softwares in the different distributions. But, on top of them the vendors promote unique softwares which makes them incompatible with others.

For example, Cloudera pushes Impala for Big Data Analytics, but Hortonworks had been pushing Hive for the same. Cloudera includes Impala in their distribution, while Hortonworks doesn't include in its distribution. Impala and Hive are similar to an extent that they provide an SQL like interface for the user. But, as one digs deep more differences will surface.

So, if a particular customer uses Impala from Cloudera from CDH, they are bound to stuck to it and migrating to Hortonworks will definitely be a big migration project. Impala vs Hive is just one example of the differences between the different distributions. Apache Sentry (in Cloudera CDH) and Apache Ranger (in Hortonworks HDP and HDF) is another example. As we dig deeper more such are definitely to come.

Here is the matrix with the different softwares in the Cloudera CDH and Hortonworks HDP/HDP distributions. This has been prepared from the corresponding release notes of the distributions (1, 2, 3). The detailed document (xlsx) for the same is here. For the sake of clarity, the mostly used and common softwares have been highlighted.

Big Data Vendor Software Matrix

Besides the compatibility of the different distributions, from the above matrix it's obvious that the softwares from the Cloudera are a bit behind those from Hortonworks. For example, Cloudera offers Spark 1.6.0 while Hortonworks offers Spark 2.3.1. It's a major version difference. Definitely the difference is at the API level also. Either Hortonworks is too aggressive in using the latest versions or Cloudera thinks that Spark 2.3.1 is not stable enough.

Conclusion

Although open source promises vendor lock-in, it had been only to some extent. One way to achieve compatibility is for the different vendors to offer the same set of softwares along with the versions within their suite. This will remove the diversity in them and give less choice to the end user. Don't think this might be a solution which we will see in the future.

Another way to avoid vendor lock-in is to use the common features across the different distributions. With this approach the end-user might not be able to take full advantage of the Big Data distributions. In the above discussions the proprietary extensions have not been considered. Using them will make one pretty much locked to the vendor.

It's difficult or close to impossible to move from one vendor to another in-spite of the vendor promises. So, it's better to think twice before sticking to a Big Data product from a particular vendor.

No comments:

Post a Comment