When Hadoop first started gaining attention and early adoption, it was inseparable – both technologically and rhetorically – from MapReduce, its then-venerable Big Data processing algorithm. But that's changing, and rapidly. With the release of Hadoop 2.0, MapReduce is taking a back seat to some newer technology. But of all the front-seat occupants, who will take the wheel?
MapReduce in Big Data history
Originally, the MapReduce algorithm was essentially "hard-wired" into the guts of Hadoop's cluster management infrastructure. It was a package deal; Big Data pioneers had to take the bitter with the sweet. At first this seemed reasonable, since MapReduce is truly powerful, as it divides the query work – and the data – up amongst the numerous servers in its cluster, facilitates teamwork between them, and gets the answer.
The problem underlying all of this is pretty simple. MapReduce's "batch" processing approach (where jobs are queued up and then dutifully run) doesn't cut it when multiple, short-lived queries need to be run in quick succession. Hadoop 2.0 introduces YARN (an acronym for "yet another resource negotiator") as a processing algorithm-independent cluster management layer. It can run MapReduce jobs, but it can host an array of other engines, as well.
Along comes Spark
Meanwhile, separate from the development of YARN, an organization called AMPLab, within the University of California at Berkley, developed an in-memory distributed processing engine called Spark. Spark can run on Hadoop clusters and, because it uses memory instead of disk, can also avoid MapReduce's batch mode malaise. Better still, Hortonworks worked with personnel at Databricks (the commercial entity founded by Spark's AMPLab creators) to make Spark run on YARN.
So far, so good. YARN provides a general framework for batch and interactive engines to process data in a Hadoop cluster, and Spark is one such engine, which utilizes Random Access Memory (RAM) for very fast results on certain workloads.
A question remains though: what about other Hadoop distribution components – like SQL query layer Hive or data transformation scripting environment Pig – that have a reliance on MapReduce? How can those components be retrofit to take advantage of the shifts in Hadoop's architecture?
Up the stack
Hortonworks, whose engineering team effectively spearheaded the work on YARN, also created a component for it called Tez that sandwiches in between Hive or Pig on the one hand, and YARN on the other. Hortonworks added Tez to Hadoop's Apache Software Foundation source code as it did an updated version of Hive.
Get the most recent versions of Hive and Hadoop itself and, bam!, you can use them interactively for iterative query work. Meanwhile, an industry consortium, which includes Cloudera and MapR, has announced it will be retrofitting Hive and Pig – as well most other Hadoop distro components – to run directly against Spark.
Symbiotic adversaries
Spark and Tez, which in most contexts likely wouldn't be compared, suddenly find themselves competitors. Both of them pave the way for MapReduce's diminished influence, and for interactive Hadoop to move to the mainstream. But with the competitive approaches they offer, there is a risk of fragmentation here and customers should take notice.
In-memory engines work extremely well for certain workloads, machine learning chief among them. But making an in-memory engine the default for most every job, especially those that get into petabyte-scale (or higher) data volumes seems unorthodox.
Batch-oriented MapReduce having exclusive placement for data processing wasn't Enterprise-ready. YARN, Tez and Spark have all emerged out of the need to address that shortcoming. The irony here is that giving customers multiple ways to use the very same Hadoop distribution components isn't especially well-suited to the Enterprise.
The engines, united?
If YARN's open architecture is to enable multiple, nuanced, overlapping solutions, then an optimizer that picks the right one for a given query may be needed, so that the customer needn't make that decision, query after query. Choice is good, but fragmentation and complexity are not.
In the 1980s, the UNIX operating system splintered badly, and this impeded market momentum for that operating system. In this decade, Hadoop has become a data operating system. Hopefully it will avoid UNIX-like entropy.
Related research and analysis from Gigaom Research:
Subscriber content. Sign up for a free trial.