Apache Spark has been winning over users since it was developed at the University of California, Berkekey, AMPLab in 2009, but it has taken on a whole new level of popularity in the last year. All of the major Hadoop distributions now support it, it’s a top-level Apache Software Foundation project and there’s a startup, called Databricks, dedicated to productizing, supporting and certifying Spark. Matei Zaharia, one of the creators of Spark and the co-founder and CTO of Databricks, came on the Structure Show podcast this week to talk about what Spark is and why people love it so much.
Here are the highlights of that interview, but anyone interested in the history and capabilities of Spark, or where the big data industry might be heading, will want to hear the whole thing. The second-annual Spark Summit also kicks off Monday in San Francisco should anyone want to plan a last-minute trip.
Spark is fast
“Basically, [Spark is] based on seeing how some of the earliest users were using MapReduce and seeing the problems they had, and trying to improve on the model to solve those problems,” Zaharia explained.
He continued: “The thing that got us started was some users of Hadoop at UC Berkeley actually, in our our lab who were doing machine learning, wanted to run the algorithms at scale on Hadoop and they ran them and they said, 'Well, actually, because it's doing all these scans over the data this is slower than me running it on my laptop. So can we design a distributed execution engine that can actually scale these out?' … As we went along, we started covering other use cases beyond machine learning, as well.”
And while Spark is best known for being much faster than MapReduce because Spark is an in-memory data-processing framework, it can still be 5 to 10 times faster on disk depending on the workload. Zaharia said the goal was to let users write the same program and run it anywhere.
Spark is flexible
Hadoop really is a revolutionary technology, but probably more because it lets users store unprecedented amounts of data for much less cost than previously possible than because MapReduce is the best thing since sliced bread. “Spark actually extends and generalized the MapReduce execution model to be able to do more types of computations more efficiently,” Zaharia said.
“OK, now that you've stored a bunch of data, what can you compute with it?” he continued later. “And the MapReduce model initially was designed for these batch jobs that existed at web companies that they need to run once a night, so it was fine for that. And after that people wanted to do more and more things.
“In Spark, in some of the research we did, we explicitly wanted to come up with a single programming model that is very general that covers these interactive [SQL] use cases, the streaming ones, the more complex applications. I think the thing that really sets Spark apart compared to some other systems that tackle these is that it can actually do all of them. You only have to learn one system and you can easily make an application that combines these. It's only one thing to manage, and I think that's what gets people interested in it.”
Zaharia noted commercial use cases ranging from machine learning at Yahoo to real-time video optimization at Conviva (another company Zaharia co-founded along with Databricks CEO Ion Stoica), and scientific use cases ranging from DNA sequencing to analyzing huge datasets from NASA telescopes.
And going forward, he said, Spark could be extended to work with additional storage layers beyond HDFS, including HBase, Cassandra and MongoDB.
Spark is easy (relatively speaking)
“The thing we're excited about is if you're a much smaller organizations, a much smaller group, and you just want to get started with big data, you can do a lot go things — either basic things or complex things — in just a few lines of code,” Zaharia explained. “It's just much easier to get started.”
However, he acknowledged later in the interview, being easier than MapReduce still doesn’t mean a whole world of folks without serious big data and distributed systems skills will start deploying Spark tomorrow: “Really, I think the biggest problem with big data is how do you use it if you're not a tech company, if you don't have a lot of expertise, a lot of Ph.Ds. in-house that can work with it. I think innovation in the tools that let the next 10 times more companies use big data is where the action will be.”
This blog post co-written by Zaharia on the Cloudera blog helps explain why and how Spark is simpler than MapReduce.
Example code for writing a wordcount job in Spark. It’s about one-eight the code required to do the same in MapReduce. Source: Cloudera/Databricks
Spark plays nice and lets Hadoop shine
Even with all the talk about all the things Spark can do, though, Zaharia is still quick to point out that it’s not always the best tool for the job. Take interactive SQL queries, for example, a space where Spark has a sub-project called Shark. “That's a place where if you have a smaller cluster, all of these [SQL-on-Hadoop] technologies are trying to play catchup with [existing databases],” he said.
And as companies like Google push entirely new, simple frameworks for big data processing, Zaharia thinks Spark is one of the things that could help Hadoop match those capabilities. “I don't think Hadoop is a thing of the past … MapReduce might be a thing of that past,” he said. “The Hadoop ecosystem of projects is still very strong and Spark is part of that ecosystem. … Essentially, most companies that are not Google or Microsoft are actually using this stack of systems, so I think that's a very important benefit of it.”
"In practice, no one uses one of these technologies in a vacuum,” he noted earlier in the interview. “[Spark] part of a pipeline with more pieces, and being able to just easily build that pipeline is the interesting part.”
Related research and analysis from Gigaom Research:
Subscriber content. Sign up for a free trial.