So the biggest revolution in database and analytics technology – namely the distributed batch processing technique known as MapReduce (and the associated Hadoop-centric ecosystem that has built up around it) is a legacy technology for one Silicon Valley player. Last week Google announced the arrival of Google Cloud Dataflow – a new service for cloud-based big data analytics that, Google says, supersedes MapReduce.[
While various VCs and corporations have pumped billions into Hadoop (Cloudera alone getting one of those billions), Google is now making a huge point of the fact that it abandoned MapReduce a long time ago as it was too hard to build flexible analytics pipelines.
So what did they replace it with? Well up to last week there was a lot of speculation, leaks, reports from former employees etc. but no real clue on the details. It was clear that internally they continued to use a distributed big data-crunching platform and that the initial MapReduce concepts were still a key part of that infrastructure. But we suspected, given the time advantage they had (they were working on these technologies for 10 years+ after all), they had evolved them into something with a real-time capability and that they optimized the algorithms for much faster results. Also given the many developer-years of data science that drives Google forward, one would assume that the MapReduce concept had been well refined in terms of usability, productivity etc. – with a large amount of the complexity abstracted away (as Pig had also attempted to do with MapReduce). Google has, of course, taken the concept of a data-driven business to a new level, where timely insights become mission critical, so they would have built out a robust, fault tolerant architecture and refined it to operate at massive scale in data centers across the globe.
So yes, Dataflow turns out to be all of the above. It’s clear that Dataflow makes the life of an analytics developer so much easier – especially if there is a real-time or online (in the machine learning sense of the word) aspect to the computation that is required, much the way IBM did with Infosphere Streams. It hides all the complexity. You subscribe to publish-subscribe message broker to pull a real-time feed, you specify the steps to take processing each message (transforms) and then you write the outputs back to another message broker. Alternatively you can access Google’s file system directly to read and write from files or access your data via BigQuery tables. As the pipeline is the same for both (parallel collections) you can reuse code written for batch processing for online processing jobs and vice versa. The presentation at Google I/O even suggests you can mix real-time and batch processing jobs – but no details were provided on how easy it is to do this.
The real magic on the Google side appears to be the power it has to optimize the data crunching pipelines. As real-world analytics jobs get much more complicated and feature rich, it can often be the case that you have to process the same sets of events multiple times. Dataflow can optimize your code so that it combines multiple passes on the data into a smaller number of processing steps and the Google tools can hide all this from the developer – so they can focus on what the jobs should do and not how they do it.
Now, adding real-time capabilities to Hadoop is not that revolutionary. I tried it myself in previous life – where we had a pretty powerful setup with Apache Kafka, feeding a Storm cluster doing real-time data enrichment and online machine learning, feeding outputs data into HBase, where batch processing jobs were operating on the now-enriched raw data. We also introduced an intermediate layer between real-time and batch processing – so that we could close the loop the other way – the outputs of batch processing could be made available via an in-memory cache, to the real-time layer. So the open-source building blocks are all out there – and getting better all the time. But not everyone wants to go to the expense of building their own analytics framework, so Dataflow does the heavy-lifting in the same way that our customers use Logentries for log analytics rather than building out their own infrastructure.
We are also particularly very pleased to see the Google announcement highlight the importance of log analytics (around 27:45 into the video)! We are very keen to see how Dataflow and Logentries could work together to solve some of these problems. We, of course, aim to be the one-stop service for all your log management needs, but we also recognize that an ecosystem of pre-integrated services (as per our announcement this week) can help solve more log management problems for more of our users.
Google Cloud Dataflow is in private beta right now – so the jury is still out on how ground breaking this announcement really is (we have asked nicely if we can try it out). But, if this really is Google opening up its own internal analytics infrastructure to the rest of the world as a cloud-hosted platform, then combined with Google’s arrival in the infrastructure-as-a -service space, it really looks like it could shake things up in the cloud services arena.
Expect a strong reaction from Amazon with new Kenesis-related announcements.