Updated 2019-07-18: To address the open questions around the exact nature of what Cloudera is going to open source, added Cloudera statement
Cloudera recently announced changes. The title of the blog post is Our Commitment to Open Source Software but some changes are not made very explicit in the article. They have also published a FAQ to go along with those changes.
This is my summary of the major changes which tries to summarize the important changes. Remember, this is my interpretation of those two published pieces, nothing more, nothing less. I've tried to mostly keep the opinions out of this and a just focus on what's been published.
With the recent licensing changes from companies like MongoDB, Elastic or Confluent to "defend" against Amazon and other SaaS companies profiting of the open source work they are doing it is very refreshing to see a company going against this trend and doubling down on open source. Cloudera tries to follow in Red Hat's footsteps, we'll have to see how well that works out.
Others seem to be more optimistic in reading these changes than I am. There are articles to be found all over the place applauding Cloudera. I believe some of the things stated in those articles are false.
This statement from CBR, for example is - I believe - not true as I'll show later:
with free unsupported releases, and paid-for versions featuring support and maintenance, updates and security patches to drive revenue.
Cloudera says they will make the source available for all their products under either the Apache 2 or the AGPL license.
However, there is this paragraph which mentions something about only paying customers getting access:
Yes, source code will be provided pursuant to the applicable open source license. Customers that have an active subscription agreement for CDP Private Cloud software or CDP Data Center software will have the ability to access CDP source repositories that Cloudera hosts.
Edit: This following section has been updated with a response from Cloudera
Cloudera sends this statement:
By February 2020, all of Cloudera's software will be licensed under the open source Apache Software License or the Affero General Public License. Anyone can download source code from public open source repositories and can modify and use the source code pursuant to those open source licenses. This will include products and features that were previously under a closed source license.
Cloudera’s open source licensing model is aligned with the approach developed by Red Hat and accepted globally by thousands of businesses. Cloudera plans to sell subscriptions that grant access to compiled software and the Cloudera hosted source it was built from plus consulting services, support, training, and tools for more quickly installing new releases and security patches to its data analysis software.
This means that we will indeed get access to all the sources but Cloudera will not provide any binaries. They will also not provide any release-specific source code. There is still some uncertainty about what exactly this means but it clarifies things a great deal.
Currently, Cloudera delivers the contents of its distribution in the form of Parcels (as well as RPM and DEB packages). It is still unclear whether the sources for these will also be open sourced.
Paying customers get access to all those sources, presumably those must be covered by a different license or agreemen.
This move will probably make it possible (or likely) for someone to create a "Community" version of the Cloudera distribution built from those sources for as long as Cloudera is going to maintain that product.
As someone frequently supporting these systems I'm very happy to get access to the full source code, so this is definitely a step in the right direction.
And let's not forget that Cloudera Manager and CDH is only one part of the offering. Things like Cloudera Data Science Workbench will also be open sourced and they are definitely going to be useful.
Cloudera states the following:
Customers and developers will be able to access our products with a subscription agreement with Cloudera. We will have free (unsupported) subscription agreements for developers, and short-term trial subscriptions. Commercial subscriptions will be available for all customers who want support and maintenance, including access to software with the latest updates and security patches
This reads very generous but read another way - from today's state - things are a bit different. What's missing here? The current free versions of CDH and HDP (or the future CDP) are missing. The statement seems to be worded very carefully but what it really says is that to use Cloudera software in production you will need a paid subscription agreement. Only customers, trial users and developers can access the products.
Considering that Cloudera does not have any competition left in the Hadoop distribution space (Hadoop obviously referring to much more than just Apache Hadoop) after the merger with Hortonworks this leaves all those users currently running Cloudera software without paying without any options other than staying at the last free version (looks to be CDH 6.2 and HDP 3.1), paying up, migrating to something else or build all the software themselves once the source code has been released.
Something else will often be some cloud product but might also be a move to the upstream products (i.e. compile Apache Hadoop yourself). The latter is not an attractive option for most people.
Those two changes are really the gist of the announcement.
Either way this hopefully is a good step for Cloudera because some users currently on the free version will obviously opt to buy a subscription rather than changing their infrastructure so this should be good news for the shareholders (including me).
I'm interested in any kind of feedback: Any errors or clarifications regarding my interpretation. We're however also interested in customers currently running the free version and contemplating moving to a vanilla Hadoop stack. Talk to us about your requirements, what are you looking for in a (data engineering/data science/data analytics) distribution?