Some Lessons in Production Development (Hadoop) – Part 1

Wow. I can’t believe it has been a month since I have posted. On December 1, I started a new chapter in my life, working full time as a Data Scientist at the Rubicon Project. Needless to say, that has been keeping me occupied, as well as thinking about working on my dissertation. For the time, I am getting settled in here.

When I accepted this position, one of my hopes/expectations would be to become professionally competent and confident in C, Java, Python, Hadoop, and the software development process rather than relying on hobby and academic knowledge. That is something a degree cannot help with. It has been a great experience, although very frustrating, but that is expected when jumping into development professionally.

I am writing this post to chronicle what I have learned about using Hadoop in production and how it majorly differs from its use in my research and personal analysis.
To start, I was asked to check out a huge stack of code from a Subversion repository. But then what?

But you’re a Computer Scientist! This should be easy!

The first part is true, but there is a stark difference between a garden variety computer scientist and one that converts from another field. Homegrown developers typically have an undergraduate degree in Computer Science (mine is close, but not purely CS). They have a strict and challenging curriculum of coursework, and their lives are peppered with summer internships. In these experiences, they are seasoned with professional work experience that they were not expected to have beforehand. Once they graduate, they have the skills necessary to work as a software engineer.

“Converts” typically have a love for Computer Science, but typically hold undergraduate degrees in related fields. Mine was Mathematics of Computation and Statistics. The Statistics B.S. was totally irrelevant. The Math of Computation B.S. was relevant in discovering my true interest, however, my life was consumed with writing proofs and solving puzzles. I only programmed for fun and to solve problems that I faced in daily life. Shortly after starting a Ph.D. in Statistics, I discovered that I wanted a career change to Computer Science and software engineering. Although I was subjected to the same curriculum as the undergraduates, it was rushed and did not provide the full experience the college CS majors received. Although I am booksmart in the fields, I come to the development world either being expected to know the ins and outs of engineering, or expected to pick it up quickly without the pampering. Learning engineering this way is fun, and not so mundane, but more frustrating.

But I though you already knew Java and Hadoop?

Well, that is all relatively speaking! The Java code I have written was not for Hadoop. The code that I write for my own research and hobby is much different from the code I am expected to write in production. In my research I always used Python (except in a few instances where I used Java) and the Hadoop Streaming package. Although Streaming is a great package, I feel it lacks the customization that a standard Hadoop job written in Java has. It also lacks the “meat” that the vanilla Hadoop distribution enjoys in the professional world. There are also performance gains from using vanilla Hadoop, as both the code and the framework are written in the same language and there is no interface among languages to deal with.

  1. My code must execute successfully on tens and hundreds of thousands of files, not just some manageable subset that I have created.
  2. I did not write the code that created these files, so there are bugs and intricacies that I must defensively program against.
  3. These files are created in real time, on the fly, as real events are occurring. Stuff happens. Sometimes it’s not good stuff, and code must be able to work with (or toss out) that data. Data for my own research has been massaged and curated by moi before running in Hadoop. In production, this is not realistic.
  4. The code must be efficient. These jobs must not take forever to run, because time can equal money: either wasted CPU cycles that others could use, or time in the cloud.
  5. Development time must be used to integrate existing code rather than reinventing the wheel. My coworkers have already written a lot of code. I have been learning how to integrate this code into my own.
  6. The developer must not go rogue with code and must code carefully. Crashing the cluster or maxing out the hard drive can have dire consequences.

Tip 0: Nobody’s Perfect

0.1 Prepare to be Frustrated. I cannot stress this enough. We all like to start new things thinking that we know close to everything we need to know to get started. Not true. Unless you literally have somebody sitting next to you whose job it is to train you, expect to do a lot of self learning and to be very frustrated. So, don’t get down on yourself, get plenty of sleep, eat, and drink plenty (after hours) if necessary.

0.2: Java Development can be Intimidating at First. To use somebody else’s Java code, just point the CLASSPATH to the src directory containing the package. If the source code is packaged correctly, this should work fine. The CLASSPATH can be set in your .bashrc file, or you can pass it using the -cp flag. Note that you must pass the CLASSPATH both to the compiler javac as well as the runtime java. This gets old really fast…more on this later.

When using a JAR file containing an archive of a package, the first line of each source file has a line that starts with package. This is essentially an “address” that allows you to point to that class without having to know its name, and is a good way to index the content of the JAR file.

0.3: Know your Limitations. A coworker would frequently chime in “use Eclipse!” Time and time again I have tried to use Eclipse and it makes me want to cry. Eclipse introduces its own level of complexity and customization to the build process that feels icky to me.

I have found that vim and ant have proven valuable for the time being. Adding an IDE just potentially adds another layer of misery to the learning process. The only IDE I ever recommend is Visual C++, and I am not even a Microsoft guy.

0.4: Make Subversion Play Nice. I did not realize that it is not necessary to check out the entire project. There is nothing wrong with checking out only the trunk directory. You can also give it a new name, so instead of having some meaningless directory called trunk, I can call it ryans_first_project:

svn co ryans_first_project

Tip 1: Partition your Data

Suppose we have n mappers and more than one of these mappers sees a key cat. We want to make sure that each instance of cat gets sent to the same reducer. One way to do this is to use a Partitioner. An ad-hoc way to do this is to just hash the key and get the remainder after dividing by some number k. This might also provide a way of controlling how many reduce tasks are spawned (I am not positive about that though).

Tip 2: The Key in the Mapper is Useless

If you use TextInputFormat, know that lines of text come into the mapper in key/value pairs and also leave the mapper in key/value pairs. The key coming INTO the mapper will be of this weird type LongWritable. It is useless. It is just the byte offset of the line in the file. What we really want to parse is the value coming into the mapper, and we emit the key and value to the next phase.

Tip 3: Use ant (or maven) to Configure and Build your Jobs

As I said, passing CLASSPATHs around .bashrc and on the command-line is a clusterf*ck. In my case, the code stack I am working with has a build.xml file. When I write source code, I don’t need to do anything, ant knows to compile the file because the build.xml file contains instructions to compile it.

Also, all and any libraries I need I just dump into a lib directory, and by simply adding the name of the library file to build.xml (it is obvious where to put it), it is automatically added to the CLASSPATH at compile time. To build the project, all I type is

ant some_target

and it spits out a cute little JAR file in the target directory, ready for me to use with Hadoop. Of course, this process actually builds the entire project, not just my code, but it only takes 10 seconds or so to build.

Tip 4: input_dir and output_dir are just Parameters

The Hadoop command given in tutorials usually has the following form

hadoop jar somejarfile.jar input output

input and output are not “set” parameters, they are just plain old parameters that you can interact with in your Java programs by using the argv array. If you want to pass in 10 directories, you can do that!

Tip 5: The JAR file is the Key to Success

Once you have a JAR file built by hand or by ant, all you need to do is move that baby around to wherever you want to run the Hadoop job. Of course, this assumes that Hadoop is installed on the machines you want to use. Then, with one file, running the job is as simple as:

hadoop jar myjarfile.jar input_dir output_dir

Tip 6: When it Gets to be Stressful, It’s Nothing a Little Ping Pong Can’t Fix

I’ve only been at this for 2 weeks… there will undoubtedly be a part 2.

2 comments to Some Lessons in Production Development (Hadoop) – Part 1

  • […] This post was mentioned on Twitter by Ryan Rosario. Ryan Rosario said: New at Byte Mining: Some Lessons in Production Development (Hadoop) – Part 1 […]

  • Byron

    Re: Tip 1. job.setNumReduceTasks(n) in your task setup (usually a class that extends Configured and implements Tool) controls the number of reducers spawned. You’ll note that your Partitioner implementation takes the number of reducers as an argument (so you’re often doing something like (segmentOfKeyIWanted.hashCode() & Integer.MAX_VALUE)%numReducers. From experience, a common mistake is to forget the mask against MAX_VALUE not realizing that hashCode() can return negative values.

    Re: Tip 2. Using the “new” Hadoop lib’s mapper (mapreduce.Mapper, which is a class not an interface) I usually specialize the mapper as Mapper so that the input key can change to anything and I will never care.

    Other useful things are creating the mapper’s output key and value object once and then using the appropriate set() call on each. Calling new Text() a couple of billion times can actually make a difference (not much of one, but it can be noticeable)

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>