It’s been a while since I have posted… in the midst of trying to plow through this dissertation while working on papers for submission to some conferences.

Hadoop has become the de facto standard in the research and industry uses of small and large-scale MapReduce. Since its inception, an entire ecosystem has been built around it including conferences (Hadoop World, Hadoop Summit), books, training, and commercial distributions (Cloudera, Hortonworks, MapR) with support. Several projects that integrate with Hadoop have been released from the Apache incubator and are designed for certain use cases:

• Pig, developed at Yahoo, is a high-level scripting language for working with big data and Hive is a SQL-like query language for big data in a warehouse configuration.
• HBase, developed at Facebook, is a column-oriented database often used as a datastore on which MapReduce jobs can be executed.
• ZooKeeper and Chukwa
• Mahout is a library for scalable machine learning, part of which can use Hadoop.
• Cascading (Chris Wensel), Oozie (Yahoo) and Azkaban (LinkedIn) provide MapReduce job workflows and scheduling.

Hadoop is meant to be modeled after Google MapReduce. To store and process huge amounts of data, we typically need several machines in some cluster configuration. A distributed filesystem (HDFS for Hadoop) uses space across a cluster to store data so that it appears to be in a contiguous volume and provides redundancy to prevent data loss. The distributed filesystem also allows data collectors to dump data into HDFS so that it is already prime for use with MapReduce. A Data Scientist or Software Engineer then writes a Hadoop MapReduce job.

As a review, the Hadoop job consists of two main steps, a map step and a reduce step. There may optionally be other steps before the map phase or between the map and reduce phases. The map step reads in a bunch of data, does something to it, and emits a series of key-value pairs. One can think of the map phase as a partitioner. In text mining, the map phase is where most parsing and cleaning is performed. The output of the mappers is sorted and then fed into a series of reducers. The reduce step takes the key value pairs and computes some aggregate (reduced) set of data such as a sum, average, etc. The trivial word count exercise starts with a map phase where text is parsed and a key-value pair is emitted: a word, followed by the number “1” indicating that the key-value pair represents 1 instance of the word. The user might also emit something to coerce Hadoop into passing data into different reducers. The words and 1s are sorted and passed to the reducers. The reducers take like key-value pairs and compute the number of times the word appears in the original input.

After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.

• Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
• Documentation for the bloated Java API is sufficient, but not the most helpful.
• HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
• Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
• Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
• Large clusters require a dedicated team to keep it running properly, but that is not surprising.

Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:

BashReduce

Unlike Hadoop, BashReduce is just a script! BashReduce implements MapReduce for standard Unix commands such as sort, awk, grep, join etc. It supports mapping/partitioning, reducing, and merging. The developers note that BashReduce “sort of” handles task coordination and a distributed file system. In my opinion, these are strengths rather than weaknesses. There is actually no task coordination as a master process simply fires off jobs and data. There is also no distributed file system at all, but BashReduce will distribute files to worker machines. Of course, without a distributed file system there is a lack of fault-tolerance among other things.

Intermachine communication is facilitated with simple passwordless SSH, but there is a large cost associated with transferring files from a master machine to its workers whereas with Hadoop, data is stored centrally in HDFS. Additionally, partition/merge in the standard unix tools is not optimized for this use case, thus the developer had to use a few additional C programs to speed up the process.

Compared to Hadoop, there is less complexity and faster development. The result is the lack of fault-tolerance, and lack of flexibility as BashReduce only works with certain Unix commands. Unlike Hadoop, BashReduce is more of a tool than a full system for MapReduce. BashReduce was developed by Erik Frey et. al. of last.fm.

Disco Project

Disco was initially developed by Nokia Research and has been around silently for a few years. Developers write MapReduce jobs in simple, beautiful Python. Disco’s backend is written in Erlang, a scalable functional language with built-in support for concurrency, fault tolerance and distribution — perfect for a MapReduce system! Similar to Hadoop, Disco distributes and replicates data, but it does not use its own file system. Disco also has efficient job scheduling features.

It seems that Disco is a pretty standard and powerful MapReduce implementation that removes some of the painful aspects of Hadoop, but it also likely removes persistent fault tolerance as it relies on a standard filesystem rather than one like HDFS, but Erlang may impose some functionality that provides a “good enough” level of fault tolerance for data.

Spark

Spark is one of the newest players in the MapReduce field. Its purpose is to make data analytics fast to write, and fast to run. Unlike many MapReduce systems, Spark allows in-memory querying of data (even distributed across machines) rather than using disk I/O. It is of no surprise then that Spark out-performs Hadoop on many iterative algorithms. Spark is implemented in Scala, a functional object-oriented language that sits on top of the JVM. Similar to other languages like Python, Ruby, and Clojure, Scala has an interactive propt and users can use Spark to query big data straight from the Scala interpreter.

One wrinkle is that Spark requires installing a cluster manager called Mesos. I had some difficulty installing it on Ubuntu, but the development team was an amazing help, and made a few changes to the source and now it runs well. On the downside, Mesos adds a layer of complexity that we are trying to avoid. On the upside, Mesos allows Spark to co-exist with Hadoop and it can read any data source that Hadoop supports, and it “feels” light, similar to Disco’s server UI.

Spark was developed by the UC Berkeley AMP Lab. Currently, its main users are UC Berkeley researchers and Conviva. Hadoop Summit 2011 featured a talk on Spark by one of the developers, which I wrote about earlier this summer.

GraphLab

GraphLab was developed at Carnegie Mellon and is designed for use in machine learning. GraphLab’s goal is to make the design and implementation of efficient and correct parallel machine learning algorithms easier. Their website states that paradigms like MapReduce lack expressiveness while lower level tools such as MPI present overhead by requiring the researcher to write code that beats a dead horse.

GraphLab has its own version of the map stage, called the update phase. Unlike MapReduce, the update phase can both read and modify overlapping sets of data. Recall that MapReduce requires data to be partitioned. GraphLab accomplishes this by allowing the user to specify data as a graph where each vertex and edge in the graph is associated memory. The update phases can be chained in such a way such that one update function can recursively trigger other update functions that operate on vertices in the graph. This graph-based approach would not only make machine learning on graphs more tractable, but it also improves dynamic iterative algorithms.

GraphLab also has its own version of the reduce stage, called the sync operation. The results of the sync operation are global and can be used by all vertices in the graph. In MapReduce, output from the reducers is local (until committed) and there is a strict data barrier among reducers. The sync operations are performed at time intervals, and there is not as strong of a tie between the update and sync phases. What I mean is that the sync intervals are not necessarily dependent on some prior update completing.

GraphLab’s website also contains the original UAI paper and presentation, a document better explaining the abstraction, and there is even a Google Group for the GraphLab API. To me, GraphLab seems like a very powerful generalization, and re-specification, of MapReduce.

Storm

Recently, Nathan Marz of BackType made waves in the Twitter big data community with a blog post titled Preview of Storm: The Hadoop of Realtime Processing. Within a day, Storm became known as “Real-time Hadoop” to the chagrin of some developers from Apache. Hadoop is a batch-processing system — that is, give it a lot of fixed data and it does something with it. Storm is real-time — it processes data in parallel as it streams.

Marz writes that with their previous system, much time was spent worrying about graphs of queues and workers: where to send and receive messages, deploying workers and queues, and a lack of fault tolerance. Storm abstracts all of these complications away. Storm is written in Clojure, but any programming language can be used to write programs on top of Storm. Storm is fault-tolerant, horizontally scalable, and reliable. Storm is also very fast, with ZeroMQ used as the underlying message passing system.

Nathan Marz is a software developer at BackType, and made waves in 2010 with Cascalog. Cascalog really took off after his presentation at the 2010 Hadoop Summit, and I am delighted I got to see him present it. Storm will be open-sourced soon and I hope to write more about it later.

I included Storm in this post based on its colloquial name “Real-time Hadoop” — it is not clear to me whether or not Storm even uses MapReduce though.

HPCC Systems (from LexisNexis)

Perhaps the project with the least flattering name comes from LexisNexis, which has developed its own framework for massive data analytics. HPCC attempts to make writing parallel-processing workflows easier by using Enterprise Control Language (ECL), a declarative, data-centric language. I should note that SQL, Datalog and Pig are also said to be declarative, data-centric languages. A matter of fact, the development team has a converter for translating Pig jobs to ECL. HPCC is written in C++. Some have commented that this will make in-memory querying much faster because there is less bloated object sizes originating from the JVM. I also prefer C++ simply because it feels closer to human though — we think in terms of objects (object-oriented) at times, and a series of steps (procedural) at other times and use both thought processes together.

HPCC already has its own jungle of technologies like Hadoop. HPCC has two “systems” for processing and serving data: the Thor Data Refinery Cluster, and the Roxy Rapid Data Delivery Cluster. Thor is a data processor, like Hadoop. Roxie is similar to a data warehouse (like HBase) and supports transactions. HPCC uses a distributed file system.

Although details are still preliminary as is the system, this certainly has a “feel” for potentially being a solid alternative for Hadoop, but only time will tell.

With all these alternatives, why use Hadoop?

One word: HDFS. For a moment, assume you could bring all of your files and data with you everywhere you go. No matter what system, or type of system, you login to, your data is intact waiting for you. Suppose you find a cool picture on the Internet. You save it directly to your file store and it goes everywhere you go. HDFS gives users the ability to dump very large datasets (usually log files) to this distributed filesystem and easily access it with tools, namely Hadoop. Not only does HDFS store a large amount of data, it is fault tolerant. Losing a disk, or a machine, typically does not spell disaster for your data. HDFS has become a reliable way to store data and share it with other open-source data analysis tools. Spark can read data from HDFS, but if you would rather stick with Hadoop, you can try to spice it up:

Hadoop Streaming is an easy way to avoid the monolith of Vanilla Hadoop without leaving HDFS, and allows the user to write map and reduce functions in any language that supports writing to stdout, and reading from stdin. Choosing a simple language such as Python for Streaming allows the user to focus more on writing code that processes data rather than software engineering. Once code is written, it is easy to test from the command line:

cat a_bunch_of_files | ./mapper.py | sort | ./reducer.py

And, running and monitoring the job is similar to Vanilla Hadoop. Hadoop Streaming was my first introduction to Hadoop and it was quite pleasant.

Or, you could use a Hadoopified project that better solves the problem. Vanilla Hadoop can do some sophisticated stuff, but it suffers the problems I mentioned at the beginning of the post. Developers have created software that works on HDFS, but is geared toward different audiences. A Data Scientist may prefer Pig or Hive for data analysis whereas a Systems and Software Engineer may prefer a workflow solution (Oozie, Cascading etc.) and a (modern) DBA may want to use HBase. Each of these achieve different goals, but still rely on HDFS.

• I think Sector/Sphere should probably have a place in this sort of roundup.

Also, your promise of your data going where you go isn’t really met by HDFS. That’s not a knock against HDFS; it’s just not what HDFS is for, or what it’s good at. There are filesystems that do local sync (or near-sync) replication for the sake of availability and there are filesystems that do remote async replication for the sake of disaster recovery, or sometimes for performance in either case. HDFS is simply in the first category, while the best current open-source example of the second category is probably XtreemFS.

• I have never heard of Sector/Sphere. I look forward to reading about it!

My analogy was bad and/or lazy. My point was that given a cluster running HDFS, I can kick off a job from any machines with the proper software and use the data on HDFS. In that sense, HDFS kind of fits, but I think I mislead people. I’ll have to correct that.

• xmikedavis

Nice post, but your conclusion that HDFS is a unique flower among MR frameworks rests on shaky ground. Disco makes extensive use of a fault tolerant filesystem, DDFS. It’s certainly not as tightly coupled to the core framework as HDFS is, but I’ve found that to be a blessing as using it is much more straightforward and transparent than HDFS. In practice, DDFS’ simplicity is a huge win as it affords the overall system to have the sort of malleability that allows one to easily tailor the core to absorb new datasets far outside of traditional MR’s ascii-newline-chunkable log data wheelhouse.

• Thanks. HDFS is definitely not “unique” but it is by far the most ubiquitous right now, unfortunately. Thanks for the tip about Disco. I had no idea Disco used a filesystem. I will admit Disco was probably the system I researched the least because I could never get it to work. I love the concept, so I think Spark and Disco will be on my list of projects to try. The fact that it does in fact have a fault tolerant filesystem is a huge win.

Part of the ubiquity of HDFS is the ecosystem that has been built around it. I would love an alternative, but all alternatives are shut out currently. I think that will change. I am hoping Disco will come out of the silence.

• Great post!

Do note the LexisNexis link is to the Cascalog project. The correct URL is: http://www.lexisnexis.com.

• Thanks. Fixed.

• […] Open-Sources its Hadoop Alternative A month ago, I wrote about alternatives to the Hadoop MapReduce platform and HPCC was included in that article. For more information, see […]

• albert

At this year’s Data Mining Camp, the revelation is Yelp’s Python Hadoop framework mrjob.
http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html

• Nikhil Prabhakar (@_nipra)

Nice overview. LexisNexis link points to cascalog git repo.

• Depending on your needs, riak (http://wiki.basho.com/) also has a map reduce type system. You can write the jobs in javascript or erlang. Its nice because it provides a mix of a flexible data store and being able to bring the code to the data.

• Thanks!

• Rajendran

Great article. Doing research on various languages / frameworks around Big Data. Learnt a lot. Thanks so much Ryan.

• Happy to help!

• Dan M

I agree with the above comments that HDFS is not all that. As for the Hadoop scheduler, and all the add-ons that do workflow, debugging, etc…. Condor has been doing it better for almost 20 years. If you use Condor with whichever file system is best for your task (maybe HDFS), then there is no technical reason to EVER use Hadoop. Condor is to Hadoop as Linux is to Windows’95.

• Great article!

You wrote: “Storm in this post based on its colloquial name “Real-time Hadoop” — it is not clear to me whether or not Storm even uses MapReduce though.”

• Thanks! You’re absolutely correct. I probably should have revised the post once I learned more. At the time, it wasn’t entirely clear what exactly Storm was. The phrase “real-time MapReduce” kept standing out, but it does not use map-reduce natively.

• Matt J.

I guess the reason he did not mention my pet peeve with Hadoop is that this article was written a while ago. That pet peeve is: why, oh why, won’t Apache’s MapReduce plugin work with the up to date releases/versions of Eclipse? It is really silly to have to run it in a VM just so that it doesn’t use the later releases I must use for the other projects I do with Eclipse.

• Actually, the reason I did not mention it was because I was afraid of Eclipse. Fortunately, in my previous job I got to use it religiously and am not afraid of Eclipse or Java anymore. I love them now! I agree though, it is frustrating that there is no decent plugin for Eclipse.

• sandish

If the TaskTracker daemon on a slave node crashes, which of the following will occur

• Frank

An alternative to HDFD is QFS.

They claim that QFS is 75% faster for writes and 45% faster for read than HDFS: see here:

HDFS is fault tolerant because it replicates (usually 3 times) data (it thus loose 66% of disk space).
In a way, HDFS is similar to a RAID1 setup: you duplicate data to get better fault tolerance.
But there has been since a long time a better approach to “fault tolerance”: for example RAID5 (that looses only 20% of disk space).
QFS uses an approach similar to RAID5 (they use Error Correction Code) to be as “fault tolerance” as hadoop but, at the same time, loosing a lot less disk space (apprarently only 33% compared to the 66% of space that is lost in hadoop).

QFS is also open source with a nice Apache V2.0 license that also everybody to use it. Source code is available here:
https://github.com/quantcast/qfs/wiki/Introduction-To-QFS

To summarize: QFS is faster than HDFS, QFS consumes significantly less disk space, QFS is also open source and with a nice license.
I think QFS should really not go unnoticed.
See you!
Frank

• […] propuesta de paralelización MapReduce. En estos enlaces puedes profundizar en esta información: Fatiga de Hadoop: alternativas emergentes y Alternativas a MapReduce y […]

• […] the most widely-used tool for big-data management among vendors and end users, although there are several alternatives including Storm, HPCC, and BashReduce. Whichever system you are using, it’s important to make […]

• Home Design Ideas

Hey there, You’ve done a fantastic job. I will definitely digg it and personally recommend
to my friends. I’m confident they’ll be benefited from this web site.

• Nice writeup Ryan do you have an updated version since this was from 2011? It is interesting that I had much of the same conclusions about the complexity of Hadoop, and in fact many like you are saying this is slowing adoption of Big Data processes. I am also very put off by the unnecessary complexity of Java and prefer Disco’s approach where being able to write such simple, efficient and high performance Python code natively is a huge advantage, yet I also agree it still seems Hadoop has the best ecosystem.