It’s been a while since I have posted… in the midst of trying to plow through this dissertation while working on papers for submission to some conferences.
Hadoop has become the de facto standard in the research and industry uses of small and large-scale MapReduce. Since its inception, an entire ecosystem has been built around it including conferences (Hadoop World, Hadoop Summit), books, training, and commercial distributions (Cloudera, Hortonworks, MapR) with support. Several projects that integrate with Hadoop have been released from the Apache incubator and are designed for certain use cases:
- Pig, developed at Yahoo, is a high-level scripting language for working with big data and Hive is a SQL-like query language for big data in a warehouse configuration.
- HBase, developed at Facebook, is a column-oriented database often used as a datastore on which MapReduce jobs can be executed.
- ZooKeeper and Chukwa
- Mahout is a library for scalable machine learning, part of which can use Hadoop.
- Cascading (Chris Wensel), Oozie (Yahoo) and Azkaban (LinkedIn) provide MapReduce job workflows and scheduling.
Hadoop is meant to be modeled after Google MapReduce. To store and process huge amounts of data, we typically need several machines in some cluster configuration. A distributed filesystem (HDFS for Hadoop) uses space across a cluster to store data so that it appears to be in a contiguous volume and provides redundancy to prevent data loss. The distributed filesystem also allows data collectors to dump data into HDFS so that it is already prime for use with MapReduce. A Data Scientist or Software Engineer then writes a Hadoop MapReduce job.
As a review, the Hadoop job consists of two main steps, a map step and a reduce step. There may optionally be other steps before the map phase or between the map and reduce phases. The map step reads in a bunch of data, does something to it, and emits a series of key-value pairs. One can think of the map phase as a partitioner. In text mining, the map phase is where most parsing and cleaning is performed. The output of the mappers is sorted and then fed into a series of reducers. The reduce step takes the key value pairs and computes some aggregate (reduced) set of data such as a sum, average, etc. The trivial word count exercise starts with a map phase where text is parsed and a key-value pair is emitted: a word, followed by the number “1″ indicating that the key-value pair represents 1 instance of the word. The user might also emit something to coerce Hadoop into passing data into different reducers. The words and 1s are sorted and passed to the reducers. The reducers take like key-value pairs and compute the number of times the word appears in the original input.
After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.
- Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
- Documentation for the bloated Java API is sufficient, but not the most helpful.
- HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
- Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
- Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
- Large clusters require a dedicated team to keep it running properly, but that is not surprising.
- Writing a Hadoop job becomes a software engineering task rather than a data analysis task.
Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:
Unlike Hadoop, BashReduce is just a script! BashReduce implements MapReduce for standard Unix commands such as sort, awk, grep, join etc. It supports mapping/partitioning, reducing, and merging. The developers note that BashReduce “sort of” handles task coordination and a distributed file system. In my opinion, these are strengths rather than weaknesses. There is actually no task coordination as a master process simply fires off jobs and data. There is also no distributed file system at all, but BashReduce will distribute files to worker machines. Of course, without a distributed file system there is a lack of fault-tolerance among other things.
Intermachine communication is facilitated with simple passwordless SSH, but there is a large cost associated with transferring files from a master machine to its workers whereas with Hadoop, data is stored centrally in HDFS. Additionally, partition/merge in the standard unix tools is not optimized for this use case, thus the developer had to use a few additional C programs to speed up the process.
Compared to Hadoop, there is less complexity and faster development. The result is the lack of fault-tolerance, and lack of flexibility as BashReduce only works with certain Unix commands. Unlike Hadoop, BashReduce is more of a tool than a full system for MapReduce. BashReduce was developed by Erik Frey et. al. of last.fm.
Disco was initially developed by Nokia Research and has been around silently for a few years. Developers write MapReduce jobs in simple, beautiful Python. Disco’s backend is written in Erlang, a scalable functional language with built-in support for concurrency, fault tolerance and distribution — perfect for a MapReduce system! Similar to Hadoop, Disco distributes and replicates data, but it does not use its own file system. Disco also has efficient job scheduling features.
It seems that Disco is a pretty standard and powerful MapReduce implementation that removes some of the painful aspects of Hadoop, but it also likely removes persistent fault tolerance as it relies on a standard filesystem rather than one like HDFS, but Erlang may impose some functionality that provides a “good enough” level of fault tolerance for data.
Spark is one of the newest players in the MapReduce field. Its purpose is to make data analytics fast to write, and fast to run. Unlike many MapReduce systems, Spark allows in-memory querying of data (even distributed across machines) rather than using disk I/O. It is of no surprise then that Spark out-performs Hadoop on many iterative algorithms. Spark is implemented in Scala, a functional object-oriented language that sits on top of the JVM. Similar to other languages like Python, Ruby, and Clojure, Scala has an interactive propt and users can use Spark to query big data straight from the Scala interpreter.
One wrinkle is that Spark requires installing a cluster manager called Mesos. I had some difficulty installing it on Ubuntu, but the development team was an amazing help, and made a few changes to the source and now it runs well. On the downside, Mesos adds a layer of complexity that we are trying to avoid. On the upside, Mesos allows Spark to co-exist with Hadoop and it can read any data source that Hadoop supports, and it “feels” light, similar to Disco’s server UI.
Spark was developed by the UC Berkeley AMP Lab. Currently, its main users are UC Berkeley researchers and Conviva. Hadoop Summit 2011 featured a talk on Spark by one of the developers, which I wrote about earlier this summer.
GraphLab was developed at Carnegie Mellon and is designed for use in machine learning. GraphLab’s goal is to make the design and implementation of efficient and correct parallel machine learning algorithms easier. Their website states that paradigms like MapReduce lack expressiveness while lower level tools such as MPI present overhead by requiring the researcher to write code that beats a dead horse.
GraphLab has its own version of the map stage, called the update phase. Unlike MapReduce, the update phase can both read and modify overlapping sets of data. Recall that MapReduce requires data to be partitioned. GraphLab accomplishes this by allowing the user to specify data as a graph where each vertex and edge in the graph is associated memory. The update phases can be chained in such a way such that one update function can recursively trigger other update functions that operate on vertices in the graph. This graph-based approach would not only make machine learning on graphs more tractable, but it also improves dynamic iterative algorithms.
GraphLab also has its own version of the reduce stage, called the sync operation. The results of the sync operation are global and can be used by all vertices in the graph. In MapReduce, output from the reducers is local (until committed) and there is a strict data barrier among reducers. The sync operations are performed at time intervals, and there is not as strong of a tie between the update and sync phases. What I mean is that the sync intervals are not necessarily dependent on some prior update completing.
GraphLab’s website also contains the original UAI paper and presentation, a document better explaining the abstraction, and there is even a Google Group for the GraphLab API. To me, GraphLab seems like a very powerful generalization, and re-specification, of MapReduce.
Recently, Nathan Marz of BackType made waves in the Twitter big data community with a blog post titled Preview of Storm: The Hadoop of Realtime Processing. Within a day, Storm became known as “Real-time Hadoop” to the chagrin of some developers from Apache. Hadoop is a batch-processing system — that is, give it a lot of fixed data and it does something with it. Storm is real-time — it processes data in parallel as it streams.
Marz writes that with their previous system, much time was spent worrying about graphs of queues and workers: where to send and receive messages, deploying workers and queues, and a lack of fault tolerance. Storm abstracts all of these complications away. Storm is written in Clojure, but any programming language can be used to write programs on top of Storm. Storm is fault-tolerant, horizontally scalable, and reliable. Storm is also very fast, with ZeroMQ used as the underlying message passing system.
Nathan Marz is a software developer at BackType, and made waves in 2010 with Cascalog. Cascalog really took off after his presentation at the 2010 Hadoop Summit, and I am delighted I got to see him present it. Storm will be open-sourced soon and I hope to write more about it later.
I included Storm in this post based on its colloquial name “Real-time Hadoop” — it is not clear to me whether or not Storm even uses MapReduce though.
HPCC Systems (from LexisNexis)
Perhaps the project with the least flattering name comes from LexisNexis, which has developed its own framework for massive data analytics. HPCC attempts to make writing parallel-processing workflows easier by using Enterprise Control Language (ECL), a declarative, data-centric language. I should note that SQL, Datalog and Pig are also said to be declarative, data-centric languages. A matter of fact, the development team has a converter for translating Pig jobs to ECL. HPCC is written in C++. Some have commented that this will make in-memory querying much faster because there is less bloated object sizes originating from the JVM. I also prefer C++ simply because it feels closer to human though — we think in terms of objects (object-oriented) at times, and a series of steps (procedural) at other times and use both thought processes together.
HPCC already has its own jungle of technologies like Hadoop. HPCC has two “systems” for processing and serving data: the Thor Data Refinery Cluster, and the Roxy Rapid Data Delivery Cluster. Thor is a data processor, like Hadoop. Roxie is similar to a data warehouse (like HBase) and supports transactions. HPCC uses a distributed file system.
Although details are still preliminary as is the system, this certainly has a “feel” for potentially being a solid alternative for Hadoop, but only time will tell.
With all these alternatives, why use Hadoop?
One word: HDFS. For a moment, assume you could bring all of your files and data with you everywhere you go. No matter what system, or type of system, you login to, your data is intact waiting for you. Suppose you find a cool picture on the Internet. You save it directly to your file store and it goes everywhere you go. HDFS gives users the ability to dump very large datasets (usually log files) to this distributed filesystem and easily access it with tools, namely Hadoop. Not only does HDFS store a large amount of data, it is fault tolerant. Losing a disk, or a machine, typically does not spell disaster for your data. HDFS has become a reliable way to store data and share it with other open-source data analysis tools. Spark can read data from HDFS, but if you would rather stick with Hadoop, you can try to spice it up:
Hadoop Streaming is an easy way to avoid the monolith of Vanilla Hadoop without leaving HDFS, and allows the user to write map and reduce functions in any language that supports writing to stdout, and reading from stdin. Choosing a simple language such as Python for Streaming allows the user to focus more on writing code that processes data rather than software engineering. Once code is written, it is easy to test from the command line:
cat a_bunch_of_files | ./mapper.py | sort | ./reducer.py
And, running and monitoring the job is similar to Vanilla Hadoop. Hadoop Streaming was my first introduction to Hadoop and it was quite pleasant.
Or, you could use a Hadoopified project that better solves the problem. Vanilla Hadoop can do some sophisticated stuff, but it suffers the problems I mentioned at the beginning of the post. Developers have created software that works on HDFS, but is geared toward different audiences. A Data Scientist may prefer Pig or Hive for data analysis whereas a Systems and Software Engineer may prefer a workflow solution (Oozie, Cascading etc.) and a (modern) DBA may want to use HBase. Each of these achieve different goals, but still rely on HDFS.