Byte Mining

Estimating Population Size: Animals and Web Pages

Ryan Rosario — Sun, 11 Mar 2018 04:45:04 +0000

Thinking about all of the things in the statistical world that we can estimate, the one that has always perplexed me is estimating the size of an unknown population $N$. Usually when we compute estimates based on samples, we involve the size of the sample $n$ somewhere, thus we take “size” for granted — the size of a sample is known. We also make inferences based on sample statistics using theory such as the Central Limit Theorem, but seem to never care about the population size $N$, we either know it, or assume it is infinite. But, in fields like ecology and environmental studies, this attitude of gluttony is dangerous!

In the summers I spend a lot of time in the Eastern Sierra where there are deer and bears. These animals are so large that we cannot consider their population in the area to be infinite (like we may with ants or bacteria). A matter of fact, one of our local animal behavior specialists knows the exact number of bears that live in my mountain community. I always had just assumed he spent all day tracking down the local bears and marking them with radio collars. There must be some way to estimate the population size, but how? This will be the first aspect of the problem we will discuss. Then we will talk about how to formally derive an estimate of the population. The computation, and even the probability distributed used, is quite niche and really piqued my interest. Finally, I conclude with some wisdom on why we even need to go through this theoretical process, something I did not appreciate in school.


Two deer enjoying a nice day across the street.	A bear fishing in Lake Mary.	A (sedated) bear walks past me at Lake Mary.

The Capture-Recapture Design

One nugget of information I had learned in a course about web data mining and search taught by Professor John Cho, was a method called capture-recapture. Oddly enough, it was used to estimate the number of web pages on the Internet¹. A short while later, I came across this same mechanism in a textbook about Markov Chain Monte Carlo. The mechanism is called capture-recapture, but goes by many other names such as mark-recapture and others.

How it works for animals such as my woodland friends:

Capture some number of bears, deer, whatever. The number itself doesn’t matter.
Mark these animals somehow to remember that we have seen them. For animals, this means physically marking them (birds, snails, squirrels) or putting radio collars on them (bears, deer and mountain lions).
Release them back into the wild.
Allow enough time to pass so that the animals disperse into their natural habitat.
Capture a sample of $n$ animals and count the number of animals that are marked.

For estimating the number of web pages on the Internet, it might look like the following:

Collect a random sample of web pages using a random walk (see ², ³), or assume a week of search queries and the link clicked is a sample of the web (see ⁴).
Store the page address.
Crawl, or collect, a sample again.
Count the number of pages in the new sample that we have already seen.

Estimating $N$ with a Simple Proportion

We can simply form an 8th grade proportion of the following form, where we let $N$ represent the unknown population size, $n$ represent the number of animals we captured at the second sampling point, $K$ be the total number of animals we marked and $k$ be the number of tagged animals we counted in the recaptured sample.

\[
\frac{k}{n} = \frac{K}{N}
\]

This yields the estimator:

\[
\hat{N}_{\tiny\mbox{EST}} = \frac{Kn}{k}
\]

This estimator is actually called the Lincoln index and assumes that only one marking and one recapture takes place, and that the population is “closed.” A “closed” population is one that does not lose animals due to death or migration and does not gain any from birth or migration. In my area, the population is mostly closed throughout the year if we perform the mark and recapture with one day or so. During migration periods and mating periods, this assumption is not as reliable. Being in the mountains, targeted or clustered deaths of these larger animals is not common.

Computing the Maximum Likelihood Estimator of $N$

Consider the formulation of the problem again. We have some population $N$, and from that population we draw a sample of $n$ objects. That sample itself is divided into two subgroups: marked objects $k$, and unmarked objects $n-k$. This sounds like the classic colored balls in urns problem where we want to find the probability that we select $k$ of the $K$ red balls when we draw a sample of $n$ balls from an urn containing $N$ balls. If this smells hypergeometric to you, you would be right.

The hypergeomtric is a discrete probability distribution with the following probability density function:

\[
P(X = k) = \frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}
\]

Usually when we construct a likelihood function, we take the product of the joint PDF over all the observations. Since we are estimating a population estimate, and it’s not based on a sample of observations and instead just one observation, we represent the likelihood as just the PDF. Perhaps if we did multiple recaptures, this would be different. The likelihood function thus looks something like the following.

\[
\begin{align}
\mathscr{L}(N ; K, k, n) &= \frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}
\end{align}
\]

We could take the log of this likelihood function like we usually do, but we will see that this is not necessary. Recall that the beauty of the log-likelihood was that it converted all of the products into sums of logs. Since we are only working with one term, we don’t need to do this.

This is where the story gets weird. Usually we take the partial derivative of the log-likelihood function with respect to the parameter of interest, set it equal to zero and then solve for the estimate. There is a wrinkle in this case because the parameter of interest $N$ is a non-negative integer. This means we cannot find the maximum likelihood estimate using calculus by taking the derivative with respect to $N$. Instead, we look at the likelihood ratio of successive hypothetical values of $N$. That is

\[
D(N) = \frac{\mathscr{L}(N)}{\mathscr{L}(N-1)} = \frac{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}}{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K – 1 \\ n – k \end{array} \right)}{\left( \begin{array}{c} N – 1 \\ n \end{array} \right)}}
\]

The purpose of taking the partial derivative of the log-likelihood function was to determine the point where the log-likelihood (and thus the likelihood) function switches from being increasing to decreasing — an optimum. We want to find this same phenomenon with our likelihood ratio $D$. We want to find where $D$ switches from increasing to decreasing. That is, we want to find where $D(N) > 1$ and $D(N) < 1$. Note that for $N > 1$, $D(N) \ne 1$. I apologize for the formatting of the math; I wanted to save screen real estate. Anyway, we then have

\[
\begin{align}
D(N) &= \frac{\mathscr{L}(N)}{\mathscr{L}(N-1)} = \frac{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K \\ n – k \end{array} \right)}{\left( \begin{array}{c} N \\ n \end{array} \right)}}{\frac{\left( \begin{array}{c} K \\ k \end{array} \right) \left( \begin{array}{c} N – K – 1 \\ n – k \end{array} \right)}{\left( \begin{array}{c} N – 1 \\ n \end{array} \right)}} = \frac{\frac{\frac{K!}{(K-k)! x!} \frac{(N-K)!}{(n-k)!(N-K-n+k)!}}{\frac{N!}{(N-n)! n!}}}{\frac{\frac{K!}{(K-k)! k!} \frac{(N-K-1)!}{(n-k)!(N-K-n-1+k)!}}{\frac{(N-1)!}{(N-n-1)! n!}}} \\
& \mbox{Cancel the ${}_KC_k$ term as well as the $n!$ terms and $(n-k)!$ terms.} \\
&= \frac{\frac{\frac{(N-K)!}{(N-K-n+k)!}}{\frac{N !}{(N-n)!}}}{\frac{\frac{(N-K-1)!}{(N-K-n -1+k)!}}{\frac{(N-1)!}{(N-n-1)!}}} = \frac{\frac{(N-K)! (N-n)!}{N! (N-K-n+k)!}}{\frac{(N-K-1)! (N-n-1)!}{(N-1)! (N-K-1-n+k)!}} = \frac{(N-K)! (N-n)! (N-1)! (N-K-1-n+k)!}{N! (N-K-n+k)! (N-K-1)! (N-n-1)!} \\
& \mbox{Apply variations of $y (y – 1)! = y!$} \\
&= \frac{(N-K)(N-K-1)! (N-n)! (N-K-1-n+k)! (N-1)!}{N(N-1)! (N-K-n+k)(N-K-n+k-1)! (N-K-1)! (N-n-1)!} \\
&= \frac{(N-K) (N-n)}{N (N-K-n+k)}
\end{align}
\]

which is a nice result, but we still are not done. We need to find where $D(N)$ goes from being greater than 1, to being less than 1. So let’s continue.

\[
\begin{align}
D(N) = \frac{(N-K)(N-n)}{N(N-K-n+k)} &> 1 \\
(N-K)(N-n) &> N(N-K-n+k) \\
N^2 + Kn – NK – Nn &> N^2 -NK – Nn +Nk \\
N^2 – N^2 + Kn – NK + NK – Nn + Nn &> Nk \\
Kn &> Nk \\
\frac{Kn}{k} &> N \\
N &< \frac{Kn}{k}
\end{align}
\]

Thus $D(N) > 1$ and $\mathscr{L}(N) > \mathscr{L}(N-1)$ when $N < \frac{Kn}{k}$ and thus $D(N) < 1$ and $\mathscr{L}(N) \le \mathscr{L}(N-1)$ when $N > \frac{Kn}{k}$. So $\hat{N} = \frac{Kn}{k}$ is the maximum likelihood estimator, right? Not quite.

According to the definition of a maximum likelihood estimator, the estimate must fall in the original parameter space, which is in the field of integers. The fraction $\frac{Kn}{k}$ is likely not an integer, so it cannot be the maximum likelihood estimator. But don’t fret, this is very close to the MLE.

The correct maximum likelihood estimator is (drumroll) the integer part of this result.

\[
\hat{N}_{MLE} = \left[ \frac{Kn}{k} \right]
\]

Note that our original 8th grade proportion gave us the Lincoln index, which does not use the integer part. For small $n$, the Lincoln index is biased. Shao⁵ shows that there is also a second candidate for the MLE, due to the increasing and decreasing nature of $D(N)$ and the fact that the estimate must be an integer. This second candidate is simply $\left[ \frac{Kn}{k} \right] + 1$, but there is no further discussion of this candidate in Shao’s work. References: ⁶ ⁷

So What Was the Point of Having to Compute the MLE?

When I was in graduate school, I used to ask the same thing all the time. We would spend 5-10 minutes deriving the fact that the sample mean is the maximum-likelihood estimator of the population mean under the normal distribution. I would ask myself “What is the point of this? Isn’t it obvious?”

In some ways it is obvious, but just like mathematicians have to prove obvious statements, so do statisticians. If somebody were to ask “how do we know that the 8th grade proportion yields a good estimate?” As a statistician, we would be expected to prove it rigorously. Consider that in our first derivation of the Lincoln index that we made a crucial assumption that the relationship between the sample markings and the population was linear. Aside from that, it came out of educated thin air. We also have no idea about how good of an estimator our result is, and because we essentially just assumed it, we cannot prove anything about it without some sort of theoretical context.

So now that we know we have to prove it rigorously, the next step is to figure out what distribution the phenomenon follows. In this case it was the hypergeometric, which is a bit niche, but still a fairly simple distribution to work with. For other problems it could be a distribution that is very messy, or a mixture model, or something requiring simulation. Once we have defined a probability distribution that makes sense for our context, we can compute all the estimators we want, whether it be by maximum likelihood, method of moments or MAP such as in a Bayesian context. Once we have constructed an estimate using these tried and true methods, we can rely on centuries worth of statistical theory to measure how optimal each estimator is — unbiasedness, consistency, efficiency, sufficiency etc. If we wanted, we could probably show that in order for our estimate of $N$ to have low variance, we would need high $n$ or high $K$ or both. Let’s leave that as an exercise for the reader…

Conclusion

In this article we take a problem that seems hopeless, or at least difficult, to solve and solve it rigorously. We start with an educated guess of an estimator by setting up a simple proportion. We then set up this basic form of the problem as a realization of the hypergeometric distribution. We then execute a niche maximum-likelihood estimation based on the likelihood ratio since the parameter we are trying to solve for must be an integer. We get the same answer as our original guess, but we now have the backup of statistical theory such that we can ask questions about how good of an estimator the Lincoln index really is.

Footnotes and References

This course is actually the course that got me interested in pursuing additional graduate study in computer science. ↩
Baykan, Eda, et al. “A comparison of techniques for sampling web pages.” arXiv preprint arXiv:0902.1604 (2009). ↩
Rusmevichientong, Paat, et al. “Methods for sampling pages uniformly from the world wide web.” AAAI Fall Symposium on Using Uncertainty Within Computation. 2001. ↩
Mauldin, Michael L. “Measuring the Size of the Web with Lycos” ↩
Shao, J. “Mathematical Statistics, Springer Texts in Statistics.” (2003). Pg. 227 ↩
Watkins, Joe. “Topic 15: Maximum Likelihood Estimation.” 1 Nov. 2011. Mathematics 363, University of Arizona. ↩
Zhang, Hanwen. “A note about maximum likelihood estimator in hypergeometric distribution.” Revista Comunicaciones en Estadistica, UNIVERSIDAD SANTO TOMAS 2.2 (2009). ↩

Highlights from My First NIPS

Ryan Rosario — Thu, 14 Dec 2017 05:18:07 +0000

The first few hundred registrations received a mug.

As a machine learning practitioner in the Los Angeles area, I was ecstatic to learn that NIPS 2017 would be in Long Beach this year. The conference sold out in a day or two. The conference was held at the Long Beach Convention Center (and Performing Arts Center), very close to the Aquarium of the Pacific and about a mile from the Queen Mary. The venue itself was beautiful, and probably the nicest place I’ve ever attended a conference. It’s also the most expensive place I’ve ever had a conference. $5 for a bottle of Coke? $11 for two cookies? But I digress.I attended most of the conference, but as someone who has attended many conferences, I’ve learned that attending everything is not necessary, and is counterproductive to one’s sanity. I attended the main conference, and one workshop day, but skipped the tutorials, the Saturday workshops and the industry demos. The conference talks were livestreamed via Facebook Live at the NIPS Foundation’s Facebook page, and the recordings are also archived there.

This may make some question why one would actually want to attend the conference in person, but there are several!

to talk with the authors of interesting papers during the poster sessions;
to meet up with likeminded people — a reunion of sorts. I had dinner with the LA Data Science crew;
to be surrounded by likeminded people and perhaps get to meet some of the big names in machine learning, or people whose work has been valuable. During the week, I saw Yann LeCun, Ian Goodfellow, Hal Daume III, Judea Pearl etc. There were so many people at this NIPS that I did not see many others that I knew were present;
As my friend Rob pointed out “THE WORKSHOPS!” Yes, the workshops are legit and are not recorded. You can also buy a special ticket for only the tutorials or only the workshops. That is something to keep in mind if your time is limited;
The sponsor and employer expo can be useful for those looking for internships, full time jobs, or post-docs. Unfortunately, the opportunities were heavily focused on research fellow positions, and positions in research labs as a researcher, not the standard applied roles that I usually gravitate towards. This was a real bummer.
There are also plenty of parties for the TensorBros. I kid. If convex optimization, TPUs and functional programming are too boring for you, you could have chilled with Flo Rida instead. Wait, who??

~~I usually write very long, drawn out blog posts about these conferences, but I am getting old, so I will try to just summarize some of the sessions and research I found the most interesting.~~ It looks like it is just as long as usual.

Keynotes

Usually I nod off during the keynote and plenary talks as I tend to find them too general. Honestly, I think I found these talks to be the most interesting and motivating talks of the entire conference. The presenters spoke more about issues facing the community without getting hung up on deep learning and particular ways of doing ~~machine learning~~ AI.

Ali Rahimi, this year’s winner of the Test of Time award, delivered an acceptance speech that earned a standing ovation, and it gave all of us a reality check about the direction of ~~machine learning~~ AI. He described a “self-congratulatory” aura in the AI community. He further likened our current deep learning discourse to alchemy and encouraged a return to rigor, which NIPS seemed to be quite religious about in earlier days. He seemed to take issue with Andrew Ng’s tweet, “AI is the new electricity.” My take is that we are currently in a hype cycle, one that I believe transformed from the “data science” hype cycle. I admit I have not embraced deep learning in my own work and Ali’s claim that we are treating AI as alchemy really struck me to the point that I feel a bit vindicated. I am not a proofs or theory person, but I cannot use methods that do not seem to have some sort of mathematical basis… and to use such methods for life changing decisions would be unethical and irresponsible. Yann LeCun posted disagreement to some of Ali’s points here.

Kate Crawford spoke about fairness and bias in machine learning models, and how many models are biased against particular groups because they are trained on data that is biased by preconceived notions about race, gender roles, and more. Her concern is that if we allow these biases to affect models that make life-changing decisions, machine learning will suffer negative backlash leading to another AI winter. Kate listed several examples, but a few of them stood out to me as very surprising. She noted that one study showed that when Googling a name that sounds African-American, Google’s ad server chose to display an ad for criminal background checks. To approach resolving the problem, Kate suggested building pre-release models and carefully studying how the model treats each subpopulation. This is something that is commonplace in the world of educational testing (I originally studied psychometrics), a field test procedure is always performed on new test items. If a particular subpopulation performs significantly better or worse than the others net of all other factors, the test item is dropped. This phenomenon can be described mathematically as differential item functioning (DIF). Anyway, back to Kate. What I appreciated about her talk is that the problem was clearly obvious to anyone that works in machine learning, but she went into a level of detail that we have not heard before.

Main Conference

The main conference was divided into two parallel tracks that started with 4-6 15-minute talks followed by 12-20 “spotlight” (i.e. lightning) talks of 5 minutes each. The tracks were: Algorithms, Optimization, Algorithms/Optimation (a 2 for 1!), Theory (goodness no), Deep Learning Applications, Probabilistric Methods Theory, Deep Learning and Reinforcement Deep Learning. The tracks were very blurred – I mean, the entire conference was theory and there is a lot of optimization involved with deep learning, so the main conference was sort of a grab bag involving a lot of walking back and forth between rooms depending on the topic… or for me, whether or not the air conditioning was on.

Most of the talks involved deep learning obviously. I found that the majority of applications focused on images, video and speech… the usual. I would love to see more talks focused on language/text, music and motion, though I am sure those are coming. There was some discussion about art and style transfer, which is cool, but, well, cool. There were a lot of interesting talks, but the one that stood out to me (and many others) was actually a 5 minute spotlight/lightning talk on interpreting models using a technique called Shapley Additive Explanations, or SHAP (paper, code). The method boils down to an importance score for each feature and each observation which can be studied after model prediction to determine why a particular observation was labeled as it was and which feature(s) was/were responsible. There was a similar talk focused on image processing, where a proposed algorithm would “highlight” parts of an image that “encouraged” the model to attach a certain label (such as the ears and nose for a dog).

Many, if not all, of the spotlight/lightning talks are associated with a poster and also have slides, code, and sometimes a video associated with it. Check the NIPS 2017 schedule to find resources for each poster.

Symposia

The symposia seemed just seemed like another “main conference” track but with a panel discussion. I attended the symposium on Interpretable Machine Learning, which seems like a hot topic right now… but we statisticians have been doing it for years, and have stuck to the unsexy methods “regression” and “decision trees.” Many of the talks involved causality and interventions, which initially came out of left field to me, but makes sense in the grand scheme of things. If one can “prove” that x causes y, interpreting models becomes easier. Although we can prove correlations, many are spurious and meaningless, and thus the model likely is not interpretable. This whole issue seems to have arisen from the medical community (my opinion/observation) as ~~machine learning~~ AI is being used more and more in medicine for diagnoses and recommendations. If we are going to deploy models that prescribe certain medicines or procedures, we (and doctors) need to be able to debug model errors or we will injure or kill many people. For machine learning practitioners, this “debugging” is conceptual and mechanical. For doctors, this debugging must be in terms of their original training… in other words, the model must be interpretable. Another area where AI-in-a-box can cause problems is with driverless cars. Honk, honk!

One talk I found very interesting in this session was a talk called On Fairness and Calibration (unfortunately I don’t remember which author spoke). The speaker basically rehashed the importance of looking at metrics other than accuracy such as true positive rate, false positive rate, true negative rate etc., particularly among subpopulations. He suggested analyzing performance across groups and looking at the gap between how we expect the model to perform for a subpopulation and what performance we actually observe. What was amusing to me as a statistician is that this paper basically “rediscovers” ROC and PR curves (calibration in general), hypothesis testing (observed vs. expected results), and the mixed effects model (different analysis for each group) used in statistics. Of course, the audience was not from statistics and it was a very impactful talk.

In statistics we are taught that interpretable models are extremely important. This is why some machine learning competitions on platforms such as Kaggle are bothersome for aspiring data scientists. The problem statements and datasets often encourage extremely complicated models that really have no meaning but seem to “just work.” I suppose these models are fine for products, but they are dangerous in high stakes situations.

The panel discussion showed that we have a long way to go in terms of interpretable models. Much of the discussion involved participant’s definitions of the word interpretability and statements of “it depends on your definition of interpretable.” I think this whole issue is going to end up being resolved by, “try to make models interpretable, but if you can’t, just don’t make them so complicated that nobody knows what it’s doing.”

I ended up leaving early and grabbing dinner with some fellow attendees. By the time I left, my ears had become completely numb to the words interpretable and interpretability — a cacophany of syllables that just ran into each other.

Friday Workshop: Machine Learning Systems

I’ve never been to a conference where there were 27 workshops going on at the same time, on the same day. Then there was another 26 the following day, all at the same time. This was a bummer because there were so many good ones to choose from. One might as well just throw their hands in the air and go to the one that had the best seating.

Since I mainly work as a machine learning engineer, and have experienced the usual issues building and monitoring machine learning systems, I decided to attend the ML Systems workshop. Of course, that’s not what the workshop was about, but it was still very interesting. Ion Stoica presented on Ray, a distributed execution system for AI and reinforcement learning. Another interesting talk was on DLVM, which is a compiler framework for creating neural network DSLs. There was also a series of talks giving updates on current AI systems: TensorFlow (project), PyTorch (project), Caffe2 (project), CNTK (project), MXNet (project), TVM (project), Clipper (project), MacroBase (project) and ModelDB (project). There was also a presentation about ONNX (project), an ecosystem for interchageable models that can be used across deep learning systems (which reminds me of the seldom used PMML). Most, if not all, of these systems are based on Python. Woof!

There were two very interesting talks that did not involve deep learning frameworks. Alex Beutel from Google presented on The Case for Learned Index Structures which focused on using the distributions of the data within an index to speed up common database operations such as selects, by presumably using percentiles and other statistical measures. The premise of Alex’s talk was that the B-Tree induced by the index can be considered a machine learning model over an assumed uniform distribution. Further work is required for data that changes over time. Virginia Smith presented on Federated Multi-Task Learning (paper) which discussed a framework for building models from data provided by several heterogeneous devices all which have their own failure rates and communication limitations.

Takeaway Lessons and Learnings

I had a good time at NIPS, but because I have not yet embraced deep learning, I did not dive into it as much as I could have. My first learning is that I can no longer put off reading Ian Goodfellow’s book and some of the other deep learning books I’ve collected such as Josh Patterson’s book and Francois Chollet’s new book. NIPS is a very academic conference, and I do not believe I have been to this level of an academic conference before (and I’ve been to IJCAI, KDD, CIKM, and JSM). That is not a bad thing, but as a more applied person, I think KDD et. al. are more my cup of tea. With those conferences, I felt the application was the star, and that the methods and theory were discussed in detail as a means to an end. At NIPS, I feel the methods and theory are the star.

datascience.LA dinner at Padre. Source: Szilard Pafka

It was empowering being surrounded by the best minds in ~~machine learning~~ artificial intelligence that rise above the hype. By listening to the talks, following the discussion on Twitter and speaking with others, I could sense a level of frustration with the hype train and some of the charlatans that are going to crash the train if the community lets them. It was also great to meet up with several folks from the LA Data Science organizations. I am looking forward to going to WSDM, the ACM International Conference on Web Search and Data Mining, which is more my cup of tea, in February, in Los Angeles. I hope to see you there!

Ph.D. Defense Post Mortem and Advice for Others

Ryan Rosario — Fri, 14 Apr 2017 18:00:03 +0000

NOTE 1: This is part 1 in a series that will probably contain 3 or 4 parts. Then I will return to the usual data science etc. posts.
NOTE 2: This post was intentionally delayed until I received final approval on the submission of my final dissertation.

On March 14, I passed my final oral defense for my Ph.D. in Statistics.

It was the moment I had been waiting for. The moment I dreaded for so many years and the moment that I thought would never come. I had very few days and times to choose from. It came down to the 13th (bad luck?) and the 14th (Pi Day). My defense began at 10am and was supposed to take a max of 2 hours. I left home around 7am to be sure I arrived in enough time after what could be a long ride. I arrived at 8:45am after sitting in rush hour traffic for almost 2 hours. I printed color copies of my slides and four copies of my dissertation and set up each committee member’s spot nicely: one water, slides, dissertation, a plate and a napkin. I placed a big tray of color-sprinkled cookies in the middle of the table equidistant from each member’s “assigned” seat. The IT manager set up my laptop so it would project to the portable projector, and set up a second laptop which would serve as one of my committee members — my original advisor retired and was participating remotely via Skype.

Then the waiting started. At 9:50am my original advisor rang in on Skype and gave me a quick wave while chewing on a toothpick, a signature quirk. At 9:59am, nobody else was there and I started to panic. A few minutes later I got an email that one member was sitting outside a locked door on the other side of the hallway. I went to meet him, and the rest of my committee all arrived together, including my advisor, who I was warned by others is always late. I was then kicked out of the room while the committee deliberated my case, decided who would ask what questions, and what behavioral tests may be used (I am convinced this is a thing in PhD defenses, and I suppose it’s useful). Then a problem arose. Nobody could hear my remote member. I ran down the hallway to get the IT manager and he brought in some speakers that really didn’t help. Sigh I thought. We had no choice but to continue.

My advisor had joked that I had chosen a very “nice” committee. My advisor himself is very soft spoken, but knows how to deliver constructive criticism. He has this down to an art. I had never known this about him before working with him in this capacity. Another committee member is on the younger end and is perhaps even more soft-spoken than my advisor. I had heard that he is very rigorous in terms of theory and has high expectations in that regard. Many of his comments trailed off and I couldn’t really hear them. My outside member is known for being very friendly and I always felt this way about him (but I had forgotten something — more on that later). And nobody could hear my previous advisor, which was a shame. The day before my defense, my advisor was trying to pep me up speaking very highly of my work and that this should be pretty easy if not a “slam dunk.” So I was feeling pretty good, though he warned me “X really appreciates theory, so it would be helpful if you can frame your research theoretically.” The color left my face as I thought to myself “you’re telling me this now?” I had become accustomed to minor surprises though. During each of our meetings, my advisor would at one point put his clenched knuckles to his lower lip and stare deep into space for about 60 seconds or so, and return with a suggestion that required significant thought. It was almost like he was running through all combinations and permutations of how things could go down in an effort to be “preemptive” (that’s the word he used).

I won’t go into a narrative of everything that happened during the defense because that would take all day. It was a tough experience, but in hindsight it was a positive experience. Actually, it is a good thing I am writing this now because the committee’s feedback makes a lot more sense now and made my dissertation stronger. I consider myself open to constructive criticism (and even flat out criticism) but I was so caught off-guard and confused I sort of wish I could have seen the look on my face.

Fast forward to the aftershow:

My advisor was smiling, but there was no “congratulations” which scared me. He invited me into the room, closed the door, and we all quickly discussed some small changes. I took notes, and soon everyone left. On their way out they congratulated me on passing the defense.

After all that, nobody ate any of the cookies or drank the water.

My advisor said that my defense went “very smoothly,” and was “gentle” though it felt “brutal.” I admit it could have been a ton worse. These are some of the comments I received during my defense, which kind of made me think I had failed, but I had to keep up a facade that I was confident about what I was doing:

My dissertation title, and the name I gave to the method I developed, was rejected.
“The mathematical notation is very critical, and doesn’t seem dissertation-quality.”

Thought in my head: “I am meticulous about notation. Are you serious?” His arguments made sense though which I will discuss later.

“Your method is very ad-hoc and makes a lot of assumptions.”

Thought in my head: “So are 99% of the methods in the literature!”

“You have not really proven that your method works.”

Thought in my head: “I had a very easy to understand baseline, and my method exceeded that baseline.”

I got a lot of questions from a committee member about what TF-IDF is which was bizarre because the person that asked me… was the person that taught me what TF-IDF is. I figured out what he was doing. He was doing his job and testing me. I think he was genuinely confused, but I also feel he intentionally stayed on the same topic, to see how I would respond. This is important for any academic that does research or teaches… to be able to communicate effectively.
“Please don’t think I hate your research because I don’t, I’m just confused!” This made everyone laugh and was a great moment of levity.
My advisor commented that I could address many concerns by adding some more literature review on data augmentation and he explained how it is used on the MNIST dataset in computer vision applications with neural networks (his research area).

Thought in my head: “Please no. How am I going to relate computer vision (a field that kind of scares me) methods to NLP?” and “But, neural networks are not my strong suit…”

A newer faculty member in my field asked, “wait, is that a truncated SVD?”

Thought in my flustered head: “WTF is a truncated SVD? This has nothing do with with truncated or censored data!” He then mentioned the k in the subscript in my SVD and immediately “Oh, yeah, it’s a truncated SVD”

And even the quintessential, “Why didn’t you use Deep Learning or a GAN?”

I won’t comment on the thought in my head.

Most of the feedback above came from two of my members, with my advisor basically agreeing with them which made me nervous, but I found out why later. Sometimes if you can agree with people on the small stuff, they won’t be so eager to move on to harder things. I’ve been told that I am humble, so I know that my job as a student and professional in a setting like this is to shut up and listen because these people know much more than I do about what I am doing in terms of theory. I am actually much more comfortable in a technical-explained-practically setting. Anyway, I had seven strategies to get through the defense:

Be confident. You did a lot of work and you know this work better than anyone else. Besides, there is a quote that the dissertation is “supposed to be the worst research you ever write, and is the beginning of your research career, not the end.”
Be cooperative. (My advisor actually told me this one, but it’s obvious). The faculty want you to pass (my committee was very friendly, not all are), and they want you to also write a good manuscript.
Be humble and patient. It’s possible someone on my committee will be in the dark (possibly the outside member) and/or you may assume the committee is more familiar with your specialization (i.e. machine learning, finance, whatever) than they are.
Defend yourself, and push back (hard) if needed. Everybody of course has an opinion, but each should be respected (especially in this setting). My goal was to acknowledge and appreciate their suggestion, but push back if I felt someone’s suggestion was infeasible, was confusing, or did not make sense for my work. (YMMV here though as I think I gave my advisor the wrong idea though but that’s always bound to happen depending on personality…)
Readily admit “I Don’t Know.” I actually never had to use this, but considering the wringer I was put through in terms of questioning, I would have had no problem saying “I don’t know” if I truly did not know the answer to the question.
Take tons of notes to show you care about their feedback (you should!).
While organizing my research and writing the dissertation, I questioned myself about everything. “Why did you choose a topic model instead of a regular classifier?” “What kind of cross-validation did you do and why?” “What is the point of this? Why is this method not stupid?” “Why didn’t you use Deep Learning?” Except for Deep Learning, none of the questions on my “list” were asked.

I had a few big hiccups that I could have avoided:

My advisor suggested that I meet with my committee with my slides and have them go through them with me. Some of the issues I faced (notation) could have been fixed much earlier. Also, the slides were not in the sequence that one of the committee members was expecting. It lead to the awkward “what problem are you solving?” question. I had this exact same problem in my preliminary oral. If your advisor tells you to do something, do it, even it sounds like a “mere suggestion.”
Somehow, I must have given the wrong impression that I took the feedback personally. The thing is, I can be soft-spoken and intimidated when talking with faculty, but I am not that way in general, so I think they may have been surprised when I was a little more assertive than I usually was with them. Also, I figured out that my advisor may have interpreted the word “brutal” literally when all I meant was that it was “tough.” Because of that, he thought I was offended. Later on, he had used another word quite literally and I figured out what was going on.
It had been years since I had witnessed a final oral defense, and it was in a different department (Computer Science), not my own (Statistics). I do recall the one Stats defense I attended was very tense (even compared to CS) and I genuinely thought the guy was going to fail. The committee sounded like they hated his research, one member was late and the student called him out on it, and at one point the candidate told one of the members (paraphrased) “all of your questions could have been answered by actually reading my dissertation when I sent it to you a month ago” and his advisor laughed. He passed. And everyone was smiles and laughter afterwards. Go see a dissertation defense in your own department before you defend!
Not really a hiccup, but it had been a while since I had seen my outside member in a research setting. At the last seminar of his I presented at, I discussed a fun project I was working on, something I did not intend to publish. I remember he was very vocal and asked me really tough questions about my work that I was not expecting at all. This is what faculty is supposed to do. I wish I had taken that project more seriously and looked at it through publishing rather than engineering curiosity. I had forgotten about that experience, but boy did the memories come back during the defense.

My advisor met with me for about 20 minutes after my defense to go over some minor revisions and he spent quite a bit of time explaining why the committee had the feedback they did. I think he realized how surprised I was at the amount of “tension” in the room. I did not take any of it personally because a defense is not supposed to be easy, but since he was much more critical of my work compared to the past ten weeks, it threw me for a loop. We discussed the changes to be made and explained that this is like submitting a paper to a journal. If you want the paper to be published and a referee says to do something, just do it, even if it may not make sense. I agree with him and after thinking about their feedback more, I realize that there is a kernel of truth in each suggestion, and part of this process is learning how to incorporate feedback into something that makes sense:

The title was confusing and suggested that I was researching something very theoretical, and it seems the committee wasn’t satisfied enough theory had gone into the work assuming the title was correct as it was. I had been lazy with my terminology, and I changed the title to something more believable.
The notation was on par with some research papers I have read, but I remember feeling so confused about their notation. I had made a lot of assumptions and placed a lot of conditions on my sampling method, and those conditions were not reflected in the notation. So, this was a fair point and I ended up redoing all of the notation (a frustrating process) and adding a “notation guide” in the appendix.
The ad-hoc nature ended up being fine. The feedback came from the theoretical member and our program graduates dissertations that are applied in nature.
Not proving the method works was the hardest one to agree with but even as I was doing the research, I felt like something was missing in my comparison because it was my method of creating the baseline. I quickly did another experiment with someone else’s baseline.
Data augmentation was an area I had completely missed in my literature review. I avoided it because it seemed to only used with neural networks and computer vision. I instead focused on “query expansion” which is still used to this day, but the theory behind it is quite dated. I found through my expanded literature review that Data Augmentation was the other side of the coin and filled in some of the gaps in my thought process. I saw other researchers that had done similar (though still quite different) things that I did, but were much more relatable than query expansion. Honestly, it got me a bit more interested in computer vision and neural networks because it suggested that there were some principled methods and not just magic black boxes.

I took about a week off and then started making my edits. I used several different color pens to mark where I needed to make changes and what those changes should be. After all was said and done, that draft looked like it had open heart surgery performed on it. It seemed like no word went untouched.

Anyway, I hope this advice can help someone. “You’ll get through it” is what I say with some trepidation because it is an experience I do not want to go through again!

For anyone interested, my defense slides are here and my final dissertation is here.

Some New Year Resolutions for This Data Scientist in 2017

Ryan Rosario — Tue, 10 Jan 2017 18:00:00 +0000

I’ve never been very big on New Year’s resolutions. I’ve tried them in the past, and while they are nice to think about, they are always overly vague, difficult to accomplish in a year, trite, or just don’t get done (or attempted). This year I decided to try something different instead of just not making resolutions at all. I set out some professional goals for myself as a Data Scientist. So without further ado…

**1. Don’t Complain about It, Fix It: Contribute to Open Source Software (More)**

Open source software is only as good as its community and/or developer(s). Developers are human and typically cannot manage all bugs and feature requests themselves. My goal is to routinely contribute back to the community either with new features, or by fixing bugs that I discover. This not only helps the community at large, but also helps me as a software engineer. There is no better way to become an even better engineer than by wading through someone else’s code. While this is something I did all day every day at my $DAYJOB, I do it less while on my sabbatical.

Some of the projects I use the most and that I hope to contribute to are scikit-learn and pandas, particularly parts for higher performance computing such as out-of-core processing, and batched processing. These “tricks” are critical to working with huge datasets on small machines, particularly for students that may not wish to pay for Amazon EC2, Azure etc.

2. Focus on Sharing, Not Just Doing

One of the qualities of my Ph.D. advisor that I admire the most is his dedication to sharing pretty much everything that he does, even if it isn’t complete. Anytime there is a new medium for writing and sharing technical content, he adopts it for this work. Through him, I learned about GitBooks and RPubs. There is also GitHub Pages, ways to share Jupyter Notebooks, R Notebooks, etc. It is hard to keep up with all these new ways of sharing work, but the takeaway is that I need to get better at it.

When I am asked, I typically recommend that people only post on their GitHub completed projects that they are proud of. I am thinking of using a secondary GitHub account for exploration. There are often times I start a project, get distracted and never complete it. But, many times I learn some interesting techniques or hacks that somebody else could use and do not have time to blog about it. Right now, all of that knowledge goes to waste. By sharing this work somebody else can find these gems, even if the project itself is not complete(d). In academia there is the mantra publish or perish. While my academic pursuits will end at a Ph.D., some teaching and maybe a conference talk or two, I want to start taking this to heart in the technical world — give talks, contribute to meetups, blog more, participate on StackOverflow, Quora, Gitter, IRC… and maybe Slack…

One other aspect of sharing work I have done is that it encourages accountability. If I post unfinished work, I may be more likely to finish it, and if there is enough content to remind me of what I did and why I was doing it, that would encourage even more success.

This concept of sharing also applies to my persona at-home habit of writing one-off scripts. In software engineering we focus on reusable code. I want to start taking these one-off scripts and turning them into scripts with at least a command-line interface. This of course assumes that the script has some use to someone other than myself. It is my job to try to make it so, all in the name of sharing and contributing. It is not always about the goal. While some of my advisor’s shared manuscripts and code snippets are not useful to me in what they do, I have learned a lot about coding techniques and new algorithms and that makes sharing content worth it.

3. Create a Usable Web Service, Running on a Real Server (not localhost)

People say that communication is a big part of being a Data Scientist. I believe this depends on the type of Data Scientist role. A Data Science Engineer focuses more of his/her time on accessibility… developing data products or tools that allow people (or machines) to make decisions or present data in a way that a human can easily understand with interactive graphics and other forms of user interaction. Of course this is a special form of communication.

I’ve built machine learning systems, but at the time I did not appreciate the full lifecycle of the system. The system needs to sell itself. Not only should it implement a model in a scalable way, it also needs to adapt to new data (online learning and tuning) and also . At the time, I thought this was a pain, but I now realize that this is what makes a system speak to the user: a full feedback loop.

Going with that theme, web apps, web services, whatever, is a very useful way of sharing insights and presenting data to the client. I’ve worked almost entirely backend, and thus I would like to spend time turning some of my projects into front end web apps using JavaScript (D3, Leaflet, React) and perhaps some of the new features of Python 3 such as ayncio. Most of my experimentation has been on localhost. I want to create a real app on Amazon EC2 or running on this host… just somewhere other than 127.0.0.1.

4. Open My Mind (More) to Neural Networks and Deep Learning

Naturally, I do a lot of machine learning. Over the past 2 to 3 years, Deep Learning has been named the solution to every problem under the sun and I am sure it can be used to find that missing sock. I knew about neural networks but it wasn’t the branch of machine learning I focused much on, so I put the whole deep learning thing on the “not right now” list. During 2016, the field went from just a bunch of headlines by well-known practitioners, but I also saw what people I respected thought on the matter as well as academia.

I received my copy of Deep Learning by Goodfellow and friends and I intend to read it cover to cover. I never had an interest in computer vision because I was not sure if we could ever solve vision problems on consumer hardware and yet here we are. While deep learning is part of that, I feel that it may be a more natural fit for vision and it would be more accessible to me and others. Of course, I am also very interested in applications to natural language processing.

5. Learn a New Language

I love Python. I also love R. Both of them do pretty much everything I need to do as a Data Scientist. Ok, I also need SQL and Bash quite a bit as well. As a developer, I do not want to get stuck in my ways and for my brain to start to rust so I would like to learn a new language. Scala is the one that is really calling my name. I suppose its rise was due to the rise of Spark, but it seems to have idioms that make it very useful for Data Science. I am very conservative when it comes to learning new programming languages because I have been caught up in the fad of new languages that end up being popular for a year or two and then falling out of popularity, even if they are still useful. Think Haskell, Erlang…

Of course, I might also just read Stroustrup’s C++ book cover-to-cover as well as Bruce Eckel’s Java book cover to cover to beef up my C++ and Java respectively as both of those languages are very important for high performance computing (C++), distributed computing (both, but mostly Java) and systems development.

6. Learn about Electronics and Explore

If you look anywhere on Twitter or in the blogosphere, you are sure to read about some gadget somebody has developed using a Raspberry Pi, Arduino, or just plain ole’ circuit boards and components. Developing software can be exciting, but it can only do so much. We need circuit boards, sensors and other components to physically do something. Of course this something can be talking to servers, software, the cloud or other devices. I have very little idea of how electronics work at this basic level and I am looking for a challenge. Right now, my most advanced “gadget” is actually just a tiny computer powered by Raspberry Pi, which serves as a snowcam.

My parents need a new doorbell, one that has a camera that always runs, has decent motion detection and sends alerts over multiple different channels of communication. We have the Ring which is proprietary and just does not do a great job of this. Wireless performance is terrible, and the bell only rings in one room. With some electronic components and either an Arduino or Raspberry Pi, I am convinced I can do better at least for our purpose. I can also access all of the video and alerts on my own server rather than having to pay and deal with the cloud. Another thing… my mother has an elaborate Christmas display in the front yard connected to several timers. The timers are neversynched properly and half the yard will be dark. I want to create a power bar that can be programmed over wifi or Bluetooth and that keeps itself synched. Such a device already exists, but I want to do it myself.

My fear or electricity and either electrocuting myself or wasting money burning out circuit boards has precluded me from participating in this fascinating field. I plan on going through this book on electronics to get me started, and from there we will see!

As for myself, personally…

I only have one personal resolution. One that is doable and that would give me joy: Travel somewhere new just to mountain bike. Who knows where I will end up in 2017, but if it involves me mountain biking somewhere other than Mammoth, Lake Tahoe or Southern California, I will consider that a success. Some places on my wishlist include Moab, UT, Bend, OR, Ashland, OR, Whistler, BC, Downieville (not really a trip though), Crested Butte, CO, Park City, UT and maybe Brevard, NC… or… Scotland?

What are your Data Science resolutions or goals this year? Tell me in the comments, and also if you have any recommendations for me based on what I wrote above, feel free to share them!

It’s Been a While

Ryan Rosario — Thu, 18 Feb 2016 18:00:02 +0000

This past three years has really flown. It’s time for me to finally get back to my roots and also start blogging more, like I did previously.

My last post was about Strata 2013. During this time period, I was taking a break from working full-time to finish a Ph.D. dissertation that I had neglected during my previous two positions. I learned my lesson the hard way, never work externally if you want a Ph.D. in a reasonable amount of time! I quickly got my dissertation from an intro to the first 65 pages or so during this gap. I then received an offer from Facebook. I was ready to move to Silicon Valley and enjoy all the things I had been envious over for so many years: the perks, the culture of innovation and intelligence, and the technology community. This was an opportunity I could not pass, and the dissertation went on the back-burner for another two years as I spent the majority of my waking hours, both during the week and the weekend… and on holidays… coding into a frenzy. I was looking forward to living in a world where I was entrenched in the technology and data ecosystem. But…

The Grass Isn’t Greener on the Other Side

The technology community is definitely there and is obviously very strong, but it isn’t what I thought it would be. Due to the sheer size of the industry, meetups and other events were very impersonal compared to what I was used to in the LA area. Additionally, it seems that most of that original Silicon Valley startup energy has moved to San Francisco. To get to meetups, I would spend hours on shuttles, Caltrain, BART and Muni getting to SoMa and then being disappointed at the frequent company pitches instead of discussing actual science and technology. Not all groups are like that, as I attended plenty of meetups that were technical and whetted my appetite to learn more. Of course, there was also the question if I could even get into the meetup. The majority of the meetup groups I was a part of would fill up in a few hours for a hot topic or engaging speaker with waiting lists sometimes 100 to 200 people long. The final blow was that my attendance assumed I could get away from my projects at work, which I really could not. My technology community ended up being the others at the company, which may have been helpful for my job, but gave me a narrower focus than I wanted, and was just one more thing that kept me at work. Meetups are not the only important thing in the technology community though. I did attend a few conferences such as ACM SIGCIKM, BayLearn, Strata 2014 (but for recruiting), and I spoke at PyData when it was held at Facebook. To be fully immersed in the technology community and experience, it seems one now needs to live in San Francisco, and San Francisco is definitely not the city for me — I am more of a Silicon Valley suburb type but the energy wasn’t the same.

I am not alone when I say that I spent most of my waking life working. Since I had moved there for a job, and I didn’t have any roots, friends or family in the area I thought it would make sense for me to do this. But, working at this rate took a toll on me physically, mentally and emotionally. Although there is a lot to do in the Bay Area, there really wasn’t any time to do it because of the work culture. And people didn’t seem to have time for me for the same reason. This is not true for everyone, but I found it much more true in the Bay Area than anywhere else I have lived. To add to the long work hours, this is not the first time I have been an “overachiever” in life — this is something I had been afflicted with since high school (the 90s).

Not only is there stress from long hours and a lack of any outside world, the Bay Area is extremely expensive — nobody argues against this point. My 850 square foot one-bedroom apartment in Mountain View is now on the market for $4200/month. Buying a house is typically not practical for new and mid-career engineers unless they have been at a large company for a while, or had a big payout from a startup, or were willing to have a longer commute from outside of the valley. A small one-bedroom house can easily list at over one-million dollars in Mountain View and Palo Alto. Next, there are going to be n other bidders that also want the same house. It is incredibly common for already ridiculously overpriced houses in Palo Alto to sell for a lot more than the listing price. If you are single, you are going to need that tax writeoff, or you are in for a huge surprise at tax time. This lifestyle is not sustainable in the long run. And for me, it was an issue that does not make me miss the area much.

On the other hand, many parts of the Bay Area are absolutely beautiful. From the green forests above Santa Cruz, to the pristine coastline from Monterey north, the green rolling hills in San Jose and the East Bay, to the bizarre other-worldly marshes along the bay. It was a dry two years so the weather was not all that different from LA.

I learned a valuable lesson. It’s true that the grass isn’t greener on the other side. You can shower a person with free meals, free rides and other perks (I even forget what they were… they ended up not being important), but all it does is keep you at work, and keep you engaged with only that one part of your life. Your “friends” end up being at the company and ends up being a bad thing in such a competitive environment. Other perks like an onsite doctor, dentist and physical therapist may sound nice, but they were not up to par with services I received elsewhere, and again are just ways to keep you at work. These things are gimmicks. They are good to entice people, they are good to make life convenient, but they really are just ways to keep you at work and pay you less.

Burn Out: Time to Reflect and Slow Down

When I returned to LA, I drove down Pacific Coast Highway and looked out to the ocean. As the orange winter sun beat on my face through the window, I could not believe it had been 2 years since I had taken that drive. That was not like me. I lived for the beach atmosphere and the sense of unwinding it provided me. At that moment, I realized that I wanted to slow down my “nerves” — not only back to my original levels, but even slower. I wanted to take time out for myself, not only to finish my research, but also to enjoy life, and think about what makes me happy both as a person and as a professional. I realized that in that two years, I lost myself and with each lonely day, I lost my passion for my interests and I did not have many hobbies other than working.

For the past ten years, I’ve spent time in the Eastern Sierra but not nearly enough. I was finally able to buy a vacation home in Mammoth and now have the time to enjoy it. I have spent the past several months hiking, mountain biking, snowshoeing and taking drives through some of nature’s finest beauty. The solitude and intrigue of the wilderness is very cleansing and good for the soul. I’ve also lost quite a bit of weight since leaving behind the free meals and becoming so active. I had always been a brisk walker, and always preferred to use my legs rather than my wheels, but this was the first time since college I did vigorous heart-pumping activity on a routine basis. When I look back, I realize it was not just the past three years that have burned me out. It’s the entire way I have been living my adulthood.

Things Have Changed

I am not closing any doors in this post, but I have learned to value a sane work environment and work hours over perks and pure compensation. Rather than focus on compensation and working at a “hot” company, my intention is to do work that benefits the common good with respect to my interests, while providing me with the means to live, retire, and have funds available for my own hobbies and side projects. There is only one thing that I will not compromise on (ok, two): I must be able to wear shorts, and I must be able to have flexible hours. Whether or not to accept a position is now a lot more complex than looking at a company’s base product and having coding, machine learning and statistics in the job requirements/description. I do not want spend 1% of my time doing machine learning using some basic model (i.e. Naive Bayes or Logistic Regression) and the other 99% scaling it to billions of observations. Rather, I would like to be able to explore more on the machine learning side, and learn new algorithms and methods for prediction and classification. This does not mean that I completely want to move away from the systems engineering stuff, but it will really depend on the product and the team rather than just the company.

For the time being, I am consulting and also mentoring a startup. I may continue to do that as my career, I may not. I have several ideas for startups that I may pursue, but I may not. Who knows, maybe I will return to the Bay Area (under the constraints I mentioned earlier), or I may not. I will probably return to being more active in the community like I used to be, but I have realized that there is a lot of noise, hype and ego in the blogosphere, Twittersphere, and these thousands of dollars conferences. There is something to be said by just doing my own thing. Maybe I have just grown as a professional, I don’t know. I just know that these things should be taken with a grain of salt.

Switching Fields?

After a lot of introspection, I want to take a look at some other fields outside of “pure tech” including but not limited to:

Environmental and activity geospatial data. After living in the mountains, I’ve become very interested in environmental data, particularly using time series, GPS telemetry and geospatial analysis. My interest in this field has applications from everything from efficient placement of snowmobiles for SAR operations, to action sports and activity intelligence, even navigation.
Finance. Finance used to be on my list of “never ever.” After learning more about economics and Wall Street from my time in startups and Silicon Valley, I am also interested in some applications in finance. Machine learning is obviously very useful for automated investing, but data visualization has proven to be useful in manual transactions for me.
Education. My original draw to statistics was the field of psychometrics and the develop of educational assessments. I am considering going back in this direction. I am also interested in the educational technology sector improving the delivery of educational materials and assessment of learning. Of course, I may go into teaching altogether, most likely at the college level, or as some type of training consultant.
Aviation: Airliners and drones. People that know me well know that I love airports, airlines and flying. Aviation uses a lot of different data science techniques. Drones are an emerging technology and routing drones in the sky has become a challenge that companies are working on. Routing, both for drones and airliners, uses geospatial/map data and network/graph data and takes into account many variables that affect flight, airspace congestion, and airport/ground resource usage. Wait time and queuing theory is also very important for runway operations. There is a lot of game theory, network analysis, and other data science involved in pricing and scheduling of airliner flights. All of these challenges are interesting to me.
“Internet of Things.” It annoys me that the emerging field of embedded systems, their development and data processing has become yet another cheap buzzword like “big data” or the misuse of the term “data science.” Devices such as the Raspberry Pi, Arduino and custom printed circuit boards allow the masses to create new data collection devices that unobtrusively fit anywhere data need recording. While the data itself is interesting, in this one particular case, I am actually more interested in the hardware, and pure engineering side rather than the data science side.
Security is an exponentially growing field that has become pivotal not only for national security, but for privacy. Security is a field that is very interesting to me, but one I know very little about, and thus is an option for a more ambitious change of field. I can see it being a field I would be passionate about the more I learn about it. Security would be unchartered waters for me, but I do not see it as a field that will be disappearing anytime soon.

After typing up this list and re-reading it, I realize I still have the same level of passion I always did, and perhaps my soul needed to focus on something else for a while. Now I just have to make the choices of which ones are the most rewarding, and which ones provide the best opportunities for me. In any job interview, there is always the “Do you have any questions for me/us?” Over the past several years, I have compiled a long list of questions. And if I do not like the answer, or if I can tell the interviewer is BSing the answer, abort! Perks and big names are not the key to happiness or a more fulfilled life — becoming a better person and being able to enjoy the process of life is.

Below are some pictures from my neighborhood!

Summary of My First Trip to Strata #strataconf

Ryan Rosario — Thu, 28 Feb 2013 18:00:00 +0000

In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.

The slides for most talks will be available here but not all speakers will share their slides.

This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context switching between listening, tweeting, emailing etc.

In the mornings there were several short plenary talks where people throughout the industry discussed their particular views of Data Science. This was basically a warm-up for what would become very long days. I mostly used this time to catch up on email and review stuff from the previous day. The second day apparently had a lot of gimmicky sales talks, but I was not paying attention apparently. The most interest talk came from Jennifer Pahlka from Code for America. I had first learned about Code for America at the Data Scientist Summit hosted by EMC in 2011. At the time it sounded like a very good idea — we have college graduates that dedicate a couple years to teaching in inner-city schools so it makes sense that we should have some data scientists working on “projects that make a difference in the world” as Jennifer would say. The projects that these data scientists work on involve democratizing data and open data initiatives in local governments. A couple of projects that stood out to me were a project to release some 800+ datasets from the City of Santa Cruz (an amazing city) on a website. Another project involved studying bail amounts and the outcome of a criminal trial. This talk was apropos considering the recent announcement of code.org, the start of a movement to begin teaching computer programming to K-12 students. [You can sign the petition and register as a volunteer here.]

Visualization Strand

I have said in the past that visualization is not my thing. I greatly appreciate interactive graphics and cool infographics that convey strong meaning to non-data scientists but it simply is not my cup of tea yet. However, it is something that I want to invest time into. I decided to attend visualization talks up to my tolerance level (which isn’t very high)… which meant one two.

I attended Chang She’s talk Agile Data Wrangling and Web-based Visualizations. Chang did what I usually do: pack too much into a one-hour talk… but I feel that talks like this really whet the appetite to learn more. He discussed how data science is missing a “blue button” that takes care of data management and then visualization. Using the federal election commission dataset, he showed political donations by party, candidate and state as the motivating example. Chang showed several examples of using pandas (a Python data munging library) to manipulate the data and then passing that data to d3.js using a JSON data format with a web server. I felt that this was just a basic talk on how to combine tools to munge data and then visualize it. It is far from a blue button, but shows how important such processing pipelines are.

Law, Ethics and Open Data Strand

One of the highly acclaimed talks of the day came from Joseph Turian of MetaOptimize, titled Sci vs. Sci: Attack Vectors for Black-Hat Data Scientists and Possible Countermeasures. Every skill has a good use and an evil use and Data Science is no exception. We create models to try to combat fraud, detect spam, measure influence and much more. These “good” uses of skills are called “white hat.” On the other hand, a more evil Data Scientist can circumvent these models to allow their spam to go undetected or game an influence metric such as PageRank. For example, consider a malicious web page that contains code that simply repeats a user’s 1Google query endlessly. To a very stupid search engine, such a web page would game a keyword matching algorithm and the search engine that is based on it. This crap web page would appear as the first result because it appears the most relevant. This is a very elementary example, but one can imagine how sophisticated models can produce nasty results.

Turian believes that most Data Scientists originally come from academia where the skills we learned are mainly “white hat”, but that our use in industry is mainly “grey hat” (somewhere between good and not-so-good). Such “grey hat” methods may involve some sort of data privacy issue such as with ad retargeting. A “black hat” data scientist may be useful in constructing a botnet, using Markov models or other language models to generate human-looking spam text, or to create sock puppets to sway opinion in a large social network. A sock puppet is essentially a social media account that is designed to look like a real genuine human but that has an ulterior motive, mainly to proliferate propaganda or false information. The use of these sock puppets is referred to as “astroturfing” — that is, a fake grassroots movement. One easy example I can think of are the thousands and thousands of Twitter accounts that are created simply to sway opinion about President Obama (search for #tcot and you are likely to find some examples, though many are also legitimate users). Turian cited one unsophisticated example of astroturfing: Newt Gingrinch and his huge jump of followers in a short period of time, which was determined to be fake. In this case, it is alleged that Gingrinch’s campaign paid for followers rather than create an army of sock puppets. Some methods for locating sock puppets are the presence of reply spam (@spam), manual classification, andhoneypots.

Some interesting statistics:

7% of Tweeps (Twitter users) are spam bots.
20% of us accept friend requests from people we do not know.
30% of us have been deceived by chat bots.

Note: MetaOptimize hosts an amazing machine learning Q and A site similar in function to StackExchange/StackOverflow. You can visit it here.

Data Science Strand

IPython Notebooks

The first talk of this series I attended was The IPython Notebook: a Comprehensive Tool for Data Science by Brian Granger at Cal Poly San Luis Obispo and Chronicle Labs. One of the major problems in Data Science is that “code and data do not communicate much.” That is, code is usually placed in one file, and data in another file and an analysis involves the coupling of data and code that must be kept in sync throughout the process. Imagine if all of your work as a Data Scientist could be contained on your physical desktop as separate objects — this is a good analogy for IPython Notebooks. An IPython Notebook functions much like aMathematica notebook, or a Sage notebook. One can analyze data in pandas data frames, use some fancy models from SciPy or scikit-learn, use the general Python language as well as the niceties provided by IPython all in one place. Once the code is written, one can produce plots with matplotlib in place and then distribute the document to others. IPython Notebooks provide a living document of one’s work and allows resilience from change by keeping all of the code in one place. Additionally, the concept of cell magic allows the execution of other languages such as R, Ruby and Julia from within the IPython Notebook! Soon there may be no need to run multiple interpreters or have multiple different open-source notebook projects for each additional language!

Here is the amazing part: by using so-called cell magic, one can push a Python object, say a pandas dataframe directly into R and it is converted into an R dataframe. I do not remember the specifics of why this is possible, but this is huge. ~~This eliminates the need for packages like RPy2 for basic computations between R and Python.~~ [Edit: RPy2 is used under the hood for this conversion. Thanks to Dirk for pointing this out.] Brian mentioned that it also may be possible to eventually allow Python objects to interact with JavaScript libraries such as d3.js for visualization using widgets.

IPython Notebooks support narrative text, headings, graphics and also mathematical typesetting via MathJax. Executing code produces JSON strings that are portable and serializable for saving results without requiring code to be re-executed. The site nbviewer.ipython.com provides an online viewer for IPython notebooks via URL, Git repository URL or Gist URL. This viewer does not require the web service to be installed locally. One current limitation of IPython Notebooks is that they only support a single user and thus cannot be hosted for, say, multiple students to login to their own notebook session in a classroom.

Once ipython and ipython-notebook (the Ubuntu packagename) are installed, one just executes the command ipython notebook in the directory of interest to start up a webserver for working with IPython Notebooks.

Apparently entire textbooks are being written as IPython Notebooks for their beauty, scientific ease and portability.

Adversarial Learning

The final talk I attended was What To Do When Your Machine Learning Gets Attacked by Vishwanath Ramarao. The purpose of this talk was to discuss issues with the bad guys trying to circumvent machine learning models designed to prevent abuse of a system, such as a spammer learning how to get around a spam filter over time. This spammer is called an adversary, and can be a “black hat” data scientist. Some examples of adversarial situations are login fraud (spearfishing, PR embarrassment or financial information), comment/mail spam, sign up fraud, astroturfing, credit card fraud and click fraud. Adversarial learning is the set of techniques that classify data emitted from an adversary.

An adversarial situation arises when the adversary is able to observe the output of the learning system and can change some subset of the features used in that system so that their attempts go unpunished. The goal of adversarial learning is to make it costly for an adversary to change features. The approach towards a solution is labor intensive, but simple to explain. Ramarao essentially said that the best way to combat adversaries is to

engineer features interactively and quickly.
not throw away features as we commonly do. It is possible that some features may be activated as the adversary’s methods evolve.
consider the entire transmission of an adversarial transaction — that is, do not just look at the words in a spam email but also look at the HTTP headers and other communication information passed along with the text.
study anomalies (outliers and high leverage points) and not discard them. Usually such anomalies are adversaries.
permit overfitting when necessary for the reason mentioned in #3.

As a text mining enthusiast, I learned some interesting tricks on fitting machine learning models to text, neither which had anything to do with adversarial learning.

A homoglyph is the translation of a word by replacing some characters with a character that looks similar. For example, p0rn is a homoglyph of porn — theo in porn is replaced with a character that looks similar, the zero 0. Broken words
A broken word is a translation of an intended word with spaces added. For example, the word nigeria could be a feature for a spam detection algorithm. An adversary can bypass the filter by instead writing ni geria.
Hash busters are cases where new words that were not in the lexicon used to train the text model are injected into content. One should use the count of the number of hash busters and use it as a feature in a model. One common hash buster for a naive profanity filter would be the word fcuk instead of the actual word f*ck.

Julia

After being enlightened by this wonderful talk, I am going to write a more substantial post focusing solely on Julia, so for now I will just briefly describe some of the more easy-to-explain content. This talk was presented by Michael Bean from Forio (developers of Julia Studio). As data scientists we love dynamic environments for interactive data munging such as R, or the Python shell (with pandas or SciPy). We typically start with a high level language such as R and then port this code to a compiled or performant language like C, C++ or Java (and maybe Python). This is a large barrier in scientific computing because it requires the data scientist to know two languages: one to experiment, and one to implement. Julia is a scientific computing language that provides the performance of a programming language like C++ and adds technical libraries and accessibility for scientific exploration. Bean cited that Julia’s performance is similar to C++. Julia allows us to complete tasks faster because we remove the need for “glue” code and Julia packages are written in Julia for performance rather than requiring C or Fortran. [R packages can be written solely in R, but for computationally intensive operations, or for packages that will sit in a bottom layer such as data structures etc. there is a huge performance hit.] Once one is familiar with Julia, it is easy to “hack the core” so to speak.

Other features that impress me:

the user can redefine arithmetic operations and construct new data types. Julia uses multiple dispatch which is a programming language feature that uses different implementations of functions depending on the data types passed to the function. For example, if A and B are of type matrix, then Julia will know that A * B is the matrix multiplication operation rather than elementwise multiplication.
common data structures found in computer science are supported natively such as BitArrays and SubArrays as well as types statisticians are already familiar with including Distribution and DataFrame.
support for list comprehensions. For example, to square every element, use [xi^2 for xi in x] instead of a for loop.
every package is a Git repository and thus open-source and easy to access.
some packages support multicore natively.
certain functions that can have a bash (!) appended which tells Julia not to make copies of the object (think in-place sort which is sort!).

Bean showed that the development process with Julia is shorter than languages such as R because production-level re-implementation is not necessary. The runtime is also faster for the few examples he showed. The following is an example of the recursive implementation of generating Fibonacci numbers in both R and Julia

R Code

Julia Code

fib &lt;- function(n)
{
  if (n &lt; 2) {
    return(n)
  } else {
    return(fib(n-1) + fib(n-2))
  }
}
 
start &lt;- Sys.time()
fib(36)
end &lt;- Sys.time()
end - start

fib(n) = n &lt; 2 ? n : fib(n - 1) + fib(n - 2)
@elapsed fib(36)

Runtime: 192 seconds

Runtime: 0.24 second

Connected World Strand

Bit.ly: Deriving an Interest Graph

The first talk in this strand that I attended was by Anna Smith of bit.ly titled Deriving an Interest Graph for Social Data. It should be no surprise that a URL shortening service would have a ton of data to sift through. Anna stated that a lot of her work is one-off analysis. What I liked about Anna’s talk in particular is that the visualizations she used were very basic. There was nothing fancy about what her graphics displayed — they just displayed some insights about the data and that is it.

Bitly extracts a lot of data from each shortened URL including keywords, topics and the probability the click was a human. One can derive a taxonomy and interest graph by analyzing click data among links. The idea is to look at other webpages a user went to from the page related to the shortened URL. It is hypothesized that the next page the user visits is related in content to the current page. On a domain level, a coclick graph uses domains as nodes and the number of clicks between them as edges. From this, we can derive a graph of keywords by using the Jaccard similarity using the number of clicks to a domain with a particular keyword for both sets. The resulting coclick graph has 4.5 million keywords and 9 million edges. By using some basic processing (removing non-English keywords and keywords with low click numbers) and then running a clustering algorithm called DBSCAN, they were able to simplify their graph to 200,000 keyword clusters and 1 million edges.

The Data Science group at bit.ly keeps an updated GitHub repository for their work here.

LinkedIn Endorsements

The last session in this strand that I attended was by Sam Shah and Pete Skomoroch from LinkedIn. This talk discussed the skills endorsement feature of LinkedIn and how they made it successful using science. Sam and Pete credit most of the success to establishing viral loops and using recommendation engines as follows:A endorses B -> B is notified -> B accepts the endorsement and endorses someone else.

Social tagging of skills also accelerated adoption. First users market their skills, and then other skills are recommended for them to add. First, a user thinks about their skills and tags them on their profile. Then, a recommendation system recommends other related skills as well as some potential people for the user to endorse. But this is not the interesting part…

How does LinkedIn maintain a skill dictionary and taxonomy? This is a high unwieldy problem due to human psychology and variations of language usage. One of the biggest issues is in phrase sense disambiguation. The motivating example was the skill angel. If I list angel as a skill on my profile, am I referring to myself as an angel investor or as a spiritual being? The speakers indicated that by using the graph of all the skills listed in addition to angel, we could use agglomerative clustering and then a distance metric to determine which meaning is most likely. This is an example of MS Office, Microsoft Office, Office. All of these concepts refer to the same thing. For this particular problem, LinkedIn used crowdsourcing with Mechanical Turk tasks. An example human interaction task was to ask a participant to find the best Wikipedia article for the particular topic, since Wikipedia tends to already have a strong army that de-duplicates content.

This is all great for users that actively use the Skills feature, but some do not. For those users, a system passes a sliding window over the profile text (n-grams) and emits possible matches basd on the taxonomy, and tossing out words that do not fit into the inferred topics in the profile. For example, if my profile text says “I love working with data, Python, Java and Hadoop.” The words I, with and and will all be tossed as stopwords. Then, I have the following keywords left:love, working, data, Python, Java, HadoopFor all practical purposes, working is probably considered a stopword or low-impact word because it appears so much on LinkedIn profiles. data is probably not an actual skill so both of these words are removed, leavinglove, Python, Java, HadoopUsing LinkedIn’s taxonomy of skills, we would probably deduce that Python, Java and Hadoop are highly related and love is an extreme outlier (for some people love may be an actual skill, but likely not in this context). Finally, this system would tag Python, Java, and Hadoop as skills to add to the profile. For more complex (realistic) examples, LinkedIn would then apply word sense disambiguation and de-duplication. A simple Naive Bayes algorithm is used to generate the actual recommendations. In the event of a completely blank profile, recommended skills are based on title, organization and perhaps social network features.

LinkedIn can also suggest endorsements where the system asks a user to endorse another user for particular skills that the user may know about. Some features used for this recommendation engine include people-skill combinations, school overlap, group overlap, similarity in industry, title similarity, site interactions, and co-interactions. Such a recommendation engine is basically a binary classification problem for link presense.

This talk by LinkedIn was surprisingly candid. Obviously, one cannot employ the methods they discussed because we do not have access to their data or infrastructure, thus such a talk is of no risk to intellectual property. Many companies do not get this and do not allow their employees to speak about anything involving their work.

Conclusion

I am glad I forked out the money to attend Strata and I will likely attend next year. The conference was huge and there was something for every data geek including a ton of food. The conference overall was not as sales-y as I thought it would be, but there were definitely moments particularly in the morning sessions and at the expo. I mainly just collected t-shirts at the expo hall, but it was basically just a giant “my Hadoop distribution is 100x faster than the other guys’.” There was also a really cool sensor lab setup for collection of data using Arduino sensors. There were several sensors placed throughout the conference venue and the data was visualized and place here.

During my time at Strata so far, I have finally had the chance to meet some longtime Twitter friends and reunite with others. It was great meeting Neil Kodner and discussing our common interests as well as meeting Mathieu Bastian and discussing graph processing and the future of Gephi (I need to write a blog post about Gephi soon). I had a chance to talk to Wes McKinney over lunch as well about Python and the pandas community. On the last day, I went to an event hosted by Facebook and met several Facebook engineers and other Twitter friends including Joseph Turian, Bradford Stephens, Daniel Tunkelang and Greg Rahn. Everybody I met has now relocated to the Bay Area and I think I am going to need to follow…

Merry Christmas and Happy Holidays!

Ryan Rosario — Mon, 24 Dec 2012 19:58:02 +0000

Wishing you all a very Merry Christmas, Happy Holidays and Happy New Year!

An update on me. In October, I began working at Riot Games, the developers of League of Legends. It has been an amazing experience and has occupied the majority of my free time as has my dissertation. My New Year’s resolution this year is to dust the cobwebs off this blog!

Have a safe holiday season!

Here in California, I will be having Christmas in the Sand

A New Data Toy — Unboxing the Raspberry Pi

Ryan Rosario — Tue, 09 Oct 2012 17:30:49 +0000

Last week I received two Raspberry Pis in the mail from AdaFruit and just now have some time to play with them. The Raspberry Pi is a minimal computer system that is about the size of a credit card. In the embedded systems community, the excitement is for obvious reasons, but I strongly believe that such a device can help collect and use data to help us make better decisions because not only is it a computer, but it is small and portable.

For development, Raspberry Pi can connect to a television (or other display) via HDMI or composite video (the “yellow” plug for those still stuck in the 1900s haha). A keyboard, mouse and other devices can be connected via two USB ports. A powered hub can provide support for even more devices. There are also various pins for connecting to a breadboard for analyzing analog signals, for a camera or for an external (or touchscreen) display. An SD Card essentially serves as the hard disk and probably a portion of the RAM. The more recent Model B ships with 256MB RAM. Raspberry Pi began shipping in February 2012 and these little guys have been very difficult to get a hold of. I finally got tipped off as to when more became available by following the Raspberry Pi subreddit. Raspberry Pi was originally not designed with geeks in mind. In fact, they were originally designed to teach school children about computers and programming.

The figure below shows the size of my Raspberry Pi versus the size of the credit card I purchased it with (just kidding). The price is also small, at about $35 depending on where you buy it!

So what can you do with it? I imagine almost anything a computer can do. Just remember that you are limited by lightweight CPU, power restrictions and potential heat issues. Raspberry Pi does allow outputting high definition video though. I have not done enough testing to check these though.

Here are some generic ideas:

Realtime informational displays of data and graphics on a large display.
- The Raspberry Pi conforms to some standard that allows it to be mounted (with assistance) to the back of an HDMI display.
- Use the RPi as a dedicated system for pulling data from other systems, doing some lightweight processing (or pull results from another system) and then display the results.
Small, portable data collection and transmission devices.
- Raspbery Pi can be connected to AC power of course, or using a MicroUSB to USB cable, similar to those used to charge Android devices.
- Connect a small (or regular sized) wireless adapter, or 3G/4G dongle for data transfer.
- Connect a Bluetooth dongle for communication with other data collection devices (think GPS receivers etc.).
- Connect an IR receiver via USB for remote control.
- Connect a USB battery backup for times where AC is not available (5V) such as in the field, or when an automobile does not provide power.
Development of data-driven “fat” clients.
- Use Raspberry Pi to make automated decisions using machine learning using your favorite development tools and statistical libraries including R. Obviously, mileage may vary. We are not talking about 8-core Xeon CPUs here…
For use as a “motherboard” (pun?) for collecting and analyzing analog signals using a separate breakout board for Raspberry Pi.

It is important to understand that hardware compatibility is more hit-or-miss than it is with a standard desktop or laptop. Certain chipsets must be matched, and drivers must be compatible with the ARM architecture. To research which items to purchase, I took a look at the RPi Verified Peripherals wiki page.

A similar platform I was eyeing was the Arduino. The biggest win of the Raspberry Pi over Arduino (I believe) is that Raspberry Pi is a mini-computer that can run a standard garden-variety operating system (well, Linux), whereas Arduino is a platform for collecting and transmitting analog and digital signals using its own software.

So, what do I plan to do with my Raspberry Pis? It is kind of secret… OK, not really, but I don’t want to write about it until I have something to show! What will you do with yours?

Adventures at My First JSM (Joint Statistical Meetings) #JSM2012

Ryan Rosario — Mon, 06 Aug 2012 16:30:00 +0000

During the past few decades that I have been in graduate school (no, not literally) I have boycotted JSM on the notion that “I am not a statistician.” Ok, I am a renegade statistician, a statistician by training. JSM 2012 was held in San Diego, CA, one of the best places to spend a week during the summer. This time, I had no excuse not to go, and I figured that in order to get my Ph.D. in Statistics, I have to have been to at least one JSM. The conference itself was 5 days, but I did not think I could hang with statisticians for five days, and more importantly, taking three days off the attend the conference was more reasonable since I work in industry.

I arrived at the conference on the third day, July 31. Unfortunately, upon arriving at the Manchester Grand Hyatt, I was informed that they had screwed up my reservation and that they had overbooked for the night. No wonder I never received a confirmation email despite calling them three times. Grmph. But my faith in humanity was restored by their compensation package: one night FREE next store at the Marriott which was closer to the conference, FREE parking for the entire stay, FREE Internet for the entire stay, and FREE breakfast at their buffet for the entire stay. I also got a free upgrade to a room with a view of the bay and the USS Midway.

My First Day, July 31

After getting lost in the convention center for quite a while, a friend I had met at KDD-2011 was one of the first people I saw and encouraged me to attend the Facebook talk, strangely titled Stat-Us. Anyway, all of the speakers in this session were Data Scientists at Facebook, including Jonathan Chang, the author of the LDA package in R, called lda. Most of their talks discussed cleaning the social graph using machine learning. Facebook actively removes fan pages etc. that are deemed to be duplicates using decision trees and other algorithms with features such as age of the creator, grammar and number of fans. Jonathan Chang discussed disambiguating and clustering places for the Places product. For example, users have created several places to refer to Disneyland: “disneyland”, “Disney Land”, “Happiest Place on Earth” etc. Facebook uses some NLP techniques, but even more effective are their techniques that compare the distribution and demographics of check-ins to each of the places. Of course, seasonality in check-ins is another aspect used in their model. But what about ubiquitous venues such as McDonald’s or Starbucks? By studying the radial density of these establishments, it is easier to disambiguate which location a user is discussing. This also makes it easier to correct places tagged at the wrong location. One other interesting point is sterile computing. Much of the information Facebook data scientists work with is high personal. During some sensitive analyses, data scientists use sterile machines that are not connected to the Internet to perform their analysis. Analysis, diagnostics and graphics to be conducted during the process must be constructed beforehand. Also, data scientists have no access to any of the raw data in this system, only the results of their analysis. All in all, I was excited. I was surprised at the level of detail they provided.

That afternoon I attended one of the many sessions titled Clustering. I had a difficult time choosing among L1 Regression, Prediction in Social Networks and Clustering, but since my dissertation topic involves Latent Dirichlet Allocation, a form of clustering text, I felt this was the most interesting. Most of the talks were very good. The most interesting talks in the session were about analyzing massive structured data using sparse generalized PCA, estimating similarity metrics to evaluate clustering algorithms, clustering autoregressive time series and using clustering to detect network intrusions.

After the Clustering talk, I got to meet John Ramey (Baylor, @ramhiser), one of the speakers and also a Twitter friend. As we were talking, Yihui Xie (Iowa State, @xieyihui) joined us. It wasn’t until we were all in conversation that I realized Yihui is the author of the now popular knitR package! Later, John and I grabbed a drink nearby and talked about R, Python, computing and academia. It was a great conversation because we both are in similar interests in both statistics and computer science and we both have a similar way of using tools of solving problems. This was great to hear because lately I have been working as a data scientist in more of a software engineering capacity than a typical data scientist capacity. Since this field is still new, and since I am still new to industry, I am not sure what is “typical” yet. I also got some great advice about publications and getting back into academia. (My “life plan” is to work in industry in a research capacity and return to academia later in life.) Later in the evening I met up with some colleagues that have since graduated. I also got to meet an intern data scientist from Redfin.

My Second Day, August 1

The next day I had my usual stressful dilemma of trying to pick which sessions to attend. I was faced with the choice of non-parametrics, high dimensional learning, cheating on tests, and networks. My original interest in the field of Statistics was a fascination with Psychometrics, so I attended the “Statistics in Uncovering Administrative Cheating on Tests” first. I only attended the first talk which was by CTB/McGraw-Hill. They were on my short list of companies I wanted to work for when I started graduate school, and some of their products include familiar names like the California Achievement Tests (CAT/6) and California Test of Basic Skills (CTBS) both of which are deprecated and replaced by TerraNova. The speaker addressed three issues: copied answers, fraudulent erasures and stolen test items. Detection of all of these depended on the three parameter logistic model common in item response theory (IRT). Detecting copying simply involved doing pairwise comparisons of multinomial distributions (the response to each item forms a multinomial random variable). Fradulent erasures were more interesting, and occur when a teacher changes a student’s incorrect answer to a correct one. Empirically, students usually change an answer from an incorrect one to a correct one, but sometimes the opposite occurs. McGraw-Hill is able to detect fraudulent erasures as an unusually high number of modified answers from incorrect to correct compared to the student’s ability. Modern optical mark readers (“Scantron“s) report not only the selected answer, but also a parameter that shows other answers that were chosen and erased, and how dark the mark was. Stolen test items can be detected using the amount of time it took a student to mark an answer with respect to their ability. Stolen items will have a statistically significantly large residual where the student answered the item much faster than they should have given their ability. The speaker showed an illuminating example from the GMAT. After this first talk, I ran over to catch the high dimensional learning talk. The most interesting talk was from Peter Hall, where he considered using pairs and groups of interacting variables for machine learning prediction. Although this was interesting, this is very common in machine learning already. I then ran back to the networks talk. At that point, I was exhausted from running back and forth and could only try to listen.

Later in the morning I attended the Internet-Scale Statistical Computing talk. For some reason, the organizers chose a room that was far too small for the interest. I cannot say I am surprised. One of the talks discussed how Google scales R. Although the talk was great, we all know that Google is probably never going to open-source what they are working on, and that was clear from their responses to some of the audience questions. Google has created a package for R called flume which they say is similar to open-source projects Cascading and Clank. They also use an abstraction they call distributed data objects which is probably some way of storing and replicating data across systems using GFS. The final talk was from Saptarshi Guha from Mozilla. Saptarshi developed the package RHIPE which is the original interface to Hadoop from R and is still in active development. Mozilla uses RHIPE to analyze Firefox crash logs and Saptarshi showed an example of using quantile regression in a RHIPE context. After lunch I attended a talk on the LASSO, but I was far too exhausted to hear about theoretical methods and applications to biological data. All I could think of is that song by Phoenix…

If there is one way to bore me, it is to make me listen to a talk about biological and medical applications of statistics.
–Ryan

My Final Day, August 2

I attended the Visualizing Complex Models talk first thing in the morning. These talks basically discussed methods for visualizing specific model types related to the speaker’s research rather than a broad overarching discussion of complex models. Models that were discussed include the hierarchical linear model (HLM), likert scales, generalized linear models (GLM), logic forests and maps. I learned about a technique called logic regression which attempts to predict a conjunction or disjunction of terms using predictors which are themselves conjunctions or disjunctions of logical atoms. I would not be surprised if this has already been done in the artificial intelligence community, but it was very cool to see it discussed from the statistics angle. The most entertaining talk came from Samuel Buttrey from the Naval Postgraduate School. He discussed a system called DaViTo which is basically a Java dashboard graphically displaying statistics about incidents occurring in Afghanistan including where they occurred (maps) and how they vary throughout the day and throughout the year. The beauty was that the backend which computed the statistics and drew the graphs was R. Buttrey had a knack for keeping the audience on its toes like a comedian. At one point he even rapped which is something I have never seen at a conference. For those inquiring minds, it went something like…

Then ya spot a fine woman sittin in your row
She’s dressed in yellow, she says “Hello,
come sit next to me you fine fellow.”

— Bust a Move, Young MC

Um, yeah.

Free Time

JSM was rare in that I always felt like I was in a rush to get from one talk to another and I felt like I had very little free time. I spent most of Thursday afternoon and Friday touring San Diego. I love the beach so I spent most of my time walking through Seaport Village and the neighboring parks. I also visited many of the tiki looking bars in the area. I also spent a few hours at the U.S.S. Midway which was a very interesting but strange experience. Apparently I am claustrophobic because not being able to see the sun while I was on the lower decks and having to crouch on the stairs was kind of scary. I also got lost in the ship’s maze of hallways and they had these dolls and animatronic figures that freaked me out. The flight deck was amazing though and I got to talk to a few of the WWII-era docents. It also provided a great view of San Diego Bay, Coronado Island and the departures and arrivals at San Diego International. It was cool getting to see such a big part of U.S. history.

Conclusion and Differences from Other Conferences

I had a great time at JSM and it was great to see what I have and have not been missing. Still, I prefer Computer Science conferences. I love mathematics, and I love getting into the hairy details… but only of my own research. I found that most JSM talks got too lost in the math and often times the speaker did not really convey the grand point.

JSM has staunch differences from other conferences I have attended logistically:

No free Internet. This is 2012. If you wanted to pay for Internet, it was $12.95 per day in the Convention Center, $12.95 per day at the Hilton (assuming you attended both venues), and then whatever your hotel charged (the Hilton, Marriott and Hyatt all charge around $11 per day). So, unless one used their phone’s Internet, it would have cost maximum 5 * ($12.95 * 2 * + $11) = $184.50, which is approximately two months of my home cable service. Absurd.
No refreshments except at the Hilton. The multiple Starbucks were a nice touch though.
Having the conference separated into two parts requiring a ten minute walk between venues. Part of JSM was in the middle of the convention center, and another part was at the Hilton next door which was about a 10-15 minute walk in the humidity. You would see a constant steam of people, like ants, moving between the Hilton and the Convention Center to catch their next talk. Many people had to leave sessions early or arrived late because of this walk. The irony was that half the Convention Center was empty.
Far too many sessions on biology and medicine.
Not enough talks on computing and big data, though this was no surprise.
Far too many concurrent sessions (up to 50) that were not optimized into tracks, or based on the “type” of audience it would draw. There were times where there were 5 or 6 related talks I wanted to see… and others where nothing sounded interesting.
Far too many special events that required a fee to attend.
All of the poster sessions took up a full session slot.
All of the vendors were in another room. I never got to even go to the Expo because it closed early, and there was no indication of where it was. That was frustrating because I love the Springer and Wiley tables.

Next year JSM will be held in Montreal, Quebec, Canada. If I submit something and it is accepted, I may attend. Otherwise, I am happy I got to go to a JSM after all these years.

Finally, I’d like to thank my employer, GumGum, for allowing me to attend JSM 2012 during the work week.

OpenPaths and a Progressive Approach to Privacy

Ryan Rosario — Sun, 08 Jul 2012 18:00:00 +0000

OpenPaths is a service that allows users with mobile phones to transmit and store their location. It is an initiative by the New York Times that allows users to use their own data, or to contribute their location data for research projects and perhaps startups that wish to get into the geospatial space. OpenPaths brands itself as “a secure data locker for personal location information.” There is one aspect where OpenPaths is very different from other services like Google Latitude: Only the user has access to his/her own data and it is never shared with anybody else unless the user chooses to do so. Additionally, initiatives that wish to use a user’s location data must be asked personally via email (pictured below), and the user has the ability to deny the request.The data shared with each initiative provides only location, and not other data that may be personally identifiable such as name, email, browser, mobile type etc. In this sense, OpenPaths has provided a barebones platform for the collection and storage of location information. Google Latitude is similar, but the data stored on Google’s servers is obviously used by other Google services without explicit user permission.

The service is also opt-in, that is, it does not use hidden files on the user’s phone to track location — instead, the user must launch the OpenPaths mobile application to “opt-in” to location tracking. This is a double-edged sword though. I am in the minority, but I would rather have OpenPaths transparently transmit my location data (like Google Latitude) without me having the launch an app, because then the data is more complete. Despite running the application consistently for the past month, there are some unexplainable gaps in the data. For example, somehow only a few days of data for the month of June are available. Fortunately (or unfortunately), my habits do not change that much. After logging in, the user has the ability to visualize their data, or download it in several formats: JSON, CSV or KML for Google Earth.

The user can then view his or her location history on a map. The points are colored by time of day (morning, afternoon, night), or time of week (weekday, weekend). The gradient could and should be more fine grained to provide for better understanding of the user’s location and habits. The map has an animation setting that shows your movement throughout time. Compared to other services, the data seems very coarse, even with the finest settings. Driving at 70mph, a data point seems to be transmitted every 20 miles or so which is 2-3 times per hour. All in all, the map functionality is cool, but leaves a lot to be desired currently.

OpenPaths also allows data to be collected from FourSquare if you choose to do so. Unfortunately, none of my check-ins showed up on the map. Of course, the user can choose to delete their information entirely, adding to the OpenPath’s idea of “opt-in” location sharing.

All in all, I am very excited of where OpenPaths can go. Currently, there seem to be a few technical issues with the Android app that prevents transmitting location data when the app is running in the background. I hope that OpenPaths can run more transparently and reliably on Android as development continues. The privacy policy and ownership of data are both very important and a model that others should implement. The visualization aspect is not incredibly useful or flashy right now, but the user is free to download their data and create their own visualizations or analysis.

To download an OpenPaths app for your Apple or Android phone, click here.

SIAM Data Mining 2012 Conference

Ryan Rosario — Tue, 15 May 2012 18:00:00 +0000

Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.

The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to run there, wait in line, guide myself through crowds, wait in line, get my food, eat it, and run back to the conference in 90 minutes on a weekend. After lunch on the first two days was another plenary session followed by breakout sessions. The evening of the first two days was reserved for poster sessions. Saturday hosted half-day and full-day workshops.

Below is my summary of the conference. Of course, such a summary is very high level my description may miss things, or may not be entirely correct if I misunderstood the speaker.

Plenary Talks

Bharat Rao from SIEMENS provided the first plenary talk bright and early the first day of the conference. I only got to see the first half as I could not wake up. His talk was about privacy preserving data mining in medicine using matrix factorization. Although privacy has become an important issue in data mining, I do not totally buy that it is entirely necessary. The idea is that observations should not personally identifiable. I personally do not agree that such privacy measures are necessary when only a computer system is using the data, and not an individual person. Besides, with such massive amounts of data, someone digging through gigs and gigs of personally identifiable data to find one person’s data does not seem like a viable threat. My thoughts are similar to those on the Netflix grand challenge dataset lawsuit.

The second plenary talk came from Noshir Contractor. The main point of his work seemed to be how to build effective teams using graphs and data about each of the candidates for such a team. This did not excite me itself, but it was the data his team used that excited me and some of the stuff they learned from it. The first part of the talk discussed research into NSF grants and the types of collaboration that are more likely to lead to the awarding of such grants. His group found that women were more likely to be collaborators on awarded proposals and that multidisciplinary teams were more likely to be funded. Some analogous work involved the detection of “gold farmers” on the MMORPG game Everquest 2. Gold farming involves gathering and selling virtual goods with real cash. Interestingly, Contractor’s group found that the graph signatures present in gold farming are remarkably similar to those present with drug trafficking. There were a few other interesting tidbits that the group found. They found that a great number of players only play with friends and are somewhat disconnected from the rest of the game graph. Also, male-male relationships and female-male graph links were very common, but female-female links were uncommon. Contractor hypothesized that the male-male relationships were obvious (men are more likely to play computer games) and that women often play the game with men because it was the only way for them to get time with their significant others.

The Friday morning talk on transfer learning came from Qiang Yang from Hong Kong University. Transfer learning in this context discussed how to adapt models developed in one domain to data from another domain. Transfer learning seems to be picking up steam in Machine Learning, but anybody within training in Statistics can tell you that it really is just latent variable analysis. Of course, transfer learning applies more to learning classifiers than building descriptive models of data. The speaker’s proposed method is called Transfer Component Analysis (TCA) which is similar to, of course, Principal Component Analysis (PCA). Yang found that semi-supervised TCA was useful for sentiment analysis in a transfer learning context. A common use of transfer learning is mapping a text classifier to an image classifier where we have few labeled instances in the image domain. We can then use unlabeled source data (text) in a semi-supervised way to create a better classifier in the image domain.

The last plenary talk came from Susan Dumais from Microsoft Research who discussed temporal dynamics and information retrieval. The talk basically discussed how to mine concepts important concepts over time from data streams. One part of her research was discovering the staying power of certain words. Susan has noticed four distinct word behaviors based on how the density of the word’s usage changes over time: fast, hybrid, medium, and slow. Susan’s research also studies how often people revisit certain webpages and why. Presumably revisits are an alternative measure of influence to in-links and out-links used in PageRank (remember, Microsoft has its own anti-Google search engine). Studying temporal behavior of web visits and keyword usage is important because current methods consider only a snapshot of the web with very little evolution. Susan stated that a great page is defined as a mixture of bags of words that are formed based on page changes. Such research is important because query relevance changes over time. For example, a query of US Open refers to golf at certain times of the year and tennis at others. The query March Madness should probably return ticket prices before the event, scores during the event, and Wikipedia or sports articles recapping the event after the event.

Social Media

Social media has a session at pretty much every academic conference these days. The speakers in this session used social media data to test their hypotheses and they are always interesting. One talk discussed a feature selection technique for social processes using data from Twitter. The method used in the paper uses user-post relations (favorites, retweets, replies) and user-user relations (following etc.). The second talk used heat diffusion models to model the diffusion, cascading and propagation of ideas. The researchers were interested in also discovering or predicting the “tipping point” (or burst of activity, in their words) or a social phenomenon. Another talk discussed credibility in a social network and how credible and incredible information spreads. This work particularly discussed rumors and fake events such as the untimely death of Justin Bieber. Some of the questions investigated were: how can we filter these fake events out of the timeline? How do such rumors spread? The final talk in this session was a bit of an odd duck: how to build a team using social network analysis. The purpose of that work was to balance skillsets in a team and enhance collaborative compatibility.

Pattern Mining

The Thursday afternoon session I attended had a very generic name considering all of data mining is about finding patterns. Really, it should have been called “association rule mining.” Unfortunately, this session was fairly dry and was my least favorite of the conference. The one talk that really stood out to me discussed how to mine association rules out of long temporal events. Such association rules consisted of “episodes” which were partial orders on the graph of the event. The type of association rules considered were basically motifs — subsequences of interesting events that occurred within a long event.

Kernels and Classification

The first two talks in this session discussed multi-label classification, which is distinct from multi-class classification. In multi-class classification, we have multiple classes and each instance can belong to one, and only one class. In multi-label classification, each instance can belong to one or more classes/labels. Multi-label classification exploits correlation information among labels whereas independent classifiers do not. The first talk discussed how to use multi-label classification when there are multiple objectives. For example, when buying a cell phone, we may want to minimize price, and maximize battery life. The second talk discussed dimension reduction for multi-label classification and coupling feature selection with modeling. Another talk attempted to study the theoretical principles behind pruning and grafting in decision trees. The C4.5 software does pruning and grafting, but its theoretical properties are not well understood. The last talk discussed augmenting matrix factorization with graph information and other metadata prior to building a model. For example, for a movie recommendation problem, one factor would be a movie and another factor would be a user. These factors can be combined into a Bayesian model that can be scaled up better than other existing methods.

Transfer Learning

As I mentioned earlier, the goal of transfer learning is to map a model used in one domain to another similar domain. The classic example is classifying images using models trained on text data and some labeled images — both domains are reduced to a common set of concepts. The talks in this session mainly talked about advances in latent variable analysis. I kept finding myself confused and wondering, “why is this considered groundbreaking?” The work presented in this session basically used existing models for transfer learning. The first few talks discussed using Latent Dirichlet Allocation (LDA) to map data into concepts, and then the third talk discussed Hierarchical Latent Dirichlet Allocation (hLDA) which could be used for taxonomies and hierarchies of concepts. Although Transfer Learning is very useful, I did not find it to be all that groundbreaking. Of course, using text and images as the source and target domains is not incredibly interesting. I think Transfer Learning could be revolutionary if it could be applied to two very different domains.

Full Day Workshop: Text Mining

Of course, if there is a text mining talk, I will attend it. The workshop was led David W. Berry from University of Tennessee, Knoxville. The keynote speaker was Malu Castellanos from Hewlett-Packard Labs. Malu’s talk was amazing. She discussed a live customer intelligence system that is used for intent and sentiment analysis on various channels. Working with text is not easy. She began with a discussion of the many challenges in sentiment analysis including deceitful adjectives (despicable is negative, but Despicable Me is a proper noun that is not negative), dependency relations (wicked as slang for “good” vs. wicked witch), comparisons (x is better than y), spam, sarcasm, coreferences (use of the word it), special expressions and emoticons (LOL, ;-)), and context dependencies (predicable movie is negative whereas predictable weather may be positive). What was particularly illiuminating about Malu’s talk was that she was fairly candid about how complex HP’s sentiment analysis system is. The system does not use one model for sentiment. Different models are used to handle different kinds of tweets and based on their classifications, these tweets are ushered off to other models for further classification. For example, comparative statements are treated distinctly by the system. There may be a naive Bayes step that classifies the text as comparative or not, and then sends the tweet for further processing. She mentioned something about using special processing such as linear programming and generalized additive models (GAM) to take words such as BUT, AND etc. into account. GAMs seem rare to encounter in text mining. Some other features of the system include sentiment intensity (really good vs. good) and clustering similar words by using temporal histograms (tomorrow and 2morrow have similar usage patterns).

The first talk was from David Skillicorn, who recently published a book about mining large datasets. He discussed how to pick documents out of a corpus that are the most interesting. The second talk was given by a brave undergraduate student on query expansion. He did a very good job, but what was strange about this talk was that it used… Latent Semantic Indexing (…from 1990…) rather than one of the more useful and iterative models such as LDA. This brings me to my first personal “weird moment” about this workshop. There was very little discussion about modern (post 2000) topic models. This is very strange to me. Just a few months earlier, topic models were all the rage at KDD 2011. After the lunch break, there were talks about incremental online clustering of documents and discovery of patent trolls. The final sessions of the afternoon discussed extraction of hierarchies for increasing performance of multi-labeled classifiers and automatically evaluating text summarizers. Only one of the presentations in this workshop seemed to be attached to a paper.

I do not want to be critical because I am sure a lot of work goes into planning such events. I just found this workshop to be a bit weird. A lot of the methods used in the papers were quite old fashioned for text mining (LSI, regression) and the applications were also quite old-school (patents and legal documents just scream the old-fashioned use of information retrieval… library cataloging). It also seemed like a disproportionate number of the speakers had a prior relationship with the workshop chair. I am also not used to a workshop with so few associated papers.

Concluding Thoughts

This was a data mining conference so of course I enjoyed it. I must say though that the vibe was very different from some of the other conferences I have attended like KDD and IJCAI. Most of the speakers came from overseas, and as someone with hearing loss, it was very difficult to understand many of the speakers. It also seemed like there were very few people just attending the conference. It seemed like the majority of the people at the conference were presenting, or had a poster etc. and that is different from what I am used to. Because of that, I felt like the usual community feel was a bit missing. Additionally, there was no mention of Hadoop or R, and I found that a bit concerning since every other conference I have been to has speakers that are proud to contribute to those open-source projects. And then there was the weird text mining workshop (could have just been an off-year). Could it be because SIAM is a mathematics group? Not sure. All in all, I still had a great time and learned a lot as always.

Of course, my attendance would not have been possible without sponsorship and support from my company, GumGum. I attended this conference as part of my position as Data Scientist.

Disneyland

Of course, the white elephant in every room of the conference was the fact that Disneyland was only a 5-10 minute walk away. I got a 2-day park hopper pass and spent my lunch hours and evenings at both Disneyland and California Adventure. It really is the Happiest Place on Earth. Just being there I forget about stress and the things that worry me. I had a great time walking around and watching all the kids have fun. At Disneyland I only went on a few rides: Space Mountain, Pirates of the Carribbean, Haunted Mansion, It’s a Small World and the Disneyland Railroad (not really a ride). I also got to ride the Monorail for the first time. At California Adventure I only did the California Screaming roller coaster and Soaring Over California which features my hometown (the part with the orange orchards). Unfortunately, I missed Tom Sawyer Island again. I will have to go there first next time!


The view of California Adventure from my hotel room!	A room just for kids.	Surfer Goofy at the lobby entrance.

My Interview about the Statistics Major

Ryan Rosario — Fri, 16 Mar 2012 20:23:25 +0000

Recently, I participated in an email interview about what being a Statistics major entailed, how I got interested in the field and the future of Statistics. I figured this might be of interest to those that are contemplating majoring in Statistics, or considering a career in Data Science.

Q1: Why did you decide to pursue a major in statistics in college?

A: “When I was a kid, I really enjoyed looking at graphs, plots and maps. My parents and I could not make of what was behind the interest. At the same time, I was also heavily interested in education. My mother was a teacher and the first set of statistics I ever encountered were standardized test scores. I strived to understand what the scores attempted to say about me, and why such scores and tests are so trustworthy. When the stakes increased with the AP and SAT exams, I began reading articles published by the Educational Testing Service and learned a ton about how these tests are constructed to minimize bias, and how scores are comparable across forms. It fascinated me how much science goes into these tests, but in the end of the day they are still just one factor in the whole picture of a student. This niche interest lead me to statistics, psychometrics in particular, and although I no longer study psychometrics, I found what I learned to be incredibly valuable.”

Q2: “I noticed you have bachelor’s, master’s, and doctoral degrees in statistics. How did your graduate study build on what you learned in your undergraduate program?”

A: “For me, the undergraduate and graduate programs were night and day. The undergraduate program focused more on modeling and data analysis. The graduate program focused more on thinking about data and how to develop a scientific “common sense” about how to work with, express and make automated decisions based on data. The graduate program was much more mathematically and computationally intensive than the undergraduate major. My graduate study actually built more on my mathematics major in college because many of the concepts in graduate statistics require knowledge of linear algebra, numerical analysis and real analysis. Fortunately, our statistics major requires upper division math courses.”

Q3: What was the most interesting part of majoring in statistics? What did you find most challenging?

A: “The most interesting part of majoring in statistics was seeing how many fields can grow and transform based on insights from data and statistics. In my case, I found it most interesting seeing how it integrates and interacts with computer science. Every time someone surfs through Facebook, enters a Google search, or looks at an item on Amazon, data about what you are doing are collected and algorithms process this data to enrich the experience by, say,recommending books and offering special deals on Amazon, recommending friends and showing relevant stories on Facebook, and the most groundbreaking of all: returning relevant search results.

The most challenging part for me was the mathematical theory. Although I loved math, I sometimes had trouble connecting the theory to the application, and statistics is such an applied field. I look at it as a rite of passage and once I saw enough theory relevant to my interests, the learning process became easier.”

Q4: How do you apply what you learned in your statistics education in your current line of work?

A: “It is ironic, but it is the more basic concepts of statistics and probability that I use everyday rather than the complicated models I learned. Concepts such as independence, confidence, power, accuracy etc. are important building blocks for building my own models, or for choosing an existing one from those that I learned in school.

I always start with some exploratory analysis such as computing some statistics and making plots that show relationships clearly. Then I set explicit guidelines for the input and output of the model I want to build and note any critical assumptions that are violated or that must be met. I then try several different methods and models and validate their results using common metrics taught in undergraduate statistics before settling on a final model configuration.”

Q5: What skills did you learn in the statistics major that you find useful for work and everyday life?

A: “The training in mathematics I received as part of the statistics major taught me how to think logically, and this is very important in my work in computer science. I think patience was another very important skill I learned. I love what I do, and sometimes I take for granted that others have the same mathematical training that I do because I am so entrenched in it. Through my experience teaching as well as consulting as a student, I gained a better sense of the challenges and difficulties many people face when thinking about and interpreting statistics and how to better communicate results and ideas.”

Q6: Any advice for students who are considering majoring in statistics?

A: “My advice for students majoring in statistics is to choose an additional major or minor that uses statistics and is of interest to the student. I do not consider statistics to be a “standalone” major. When interviewing for a job, employers want to know why an interviewee is passionate about their company. For example, if interviewing for a finance company, the company wants to hear about passion for finance, and see education or experience in such fields. Another way to accomplish this instead of double majoring is to do some internships, projects or research in a field of interest.”

Conclusion: Finally, could you tell me a little about yourself for an intro bio we will include before the Q&A interview? For instance, what university(ies) did you attend, what degree(s) have you earned, what is your current job title, where do you work and for how long (you can be general here, or include a link to your professional website or blog if you have one), and what are you career goals?

A: “I attended University of California, Los Angeles (UCLA) for my B.S. (Statistics, Mathematics of Computation), two M.S. (Statistics and Computer Science) and Ph.D. (Statistics). I currently work for an Internet advertising startup in Santa Monica, CA as Chief Data Scientist/Research Engineer, and have been working in the field for three years. Whenever I get a free moment, I write about statistics, data mining and computer science topics on my blog at http://www.bytemining.com. I plan on dedicating the rest of my life working with and communicating about data and turning online phenomena into knowledge that can be used to progress technology and change the world!”

“Hold Only That Pair of 2s?” Studying a Video Poker Hand with R

Ryan Rosario — Sun, 08 Jan 2012 09:32:00 +0000

Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is “do you count cards?” A blank look comes over their face when I say “no.”

Look, if I am at a casino, I am well aware that the odds are against me, so why even try to think that I can use statistics to make money in this way? Although I love numbers and math, the stuff flows through my brain all day long (and night long), every day. If the goal is to enjoy and have fun, I do not want to sit there crunching probability formulas in my head (yes that’s fun, but it is also work). So that leaves me at the video Poker machines enjoying the free drinks. Another positive about video Poker is that $20 can sometimes last a few hours. So it should be no surprise that I do not agree with using Poker to teach probability. Poker is an extremely superficial way to introduce such a powerful tool and gives the impression that probability is a way to make a quick buck, rather than as an important tool in science and society. The only time that I have used Poker in teaching (besides when required), is to cover the hypergeometric distribution and sampling without replacement.

Since I took Intro Probability Theory, I have always wondered what to do in the following situation. Say a pair of cruddy low cards appear on the first draw. The game only awards money for pairs of jacks or better. If all I have in the hand is a pair of low cards and no face cards, my decision is easy: hold the pair of low cards. But what if there is at least one face card showing (no other pairs)? Pictorially this looks like

The conundrum:

Hold the two low cards and deal, hoping for a three of a kind, or
Hold the two low cards AND one of the face cards, hoping for a three of a kind, OR a pair of Jacks of Better.

Under each of these decisions, which yields the highest probability of winning something and which one yields the highest payout? This problem can be solved exactly by using combinatorics, conditional probability and expectation, but since a video poker game is basically a simulator (though likely biased), I wrote my own simulation. For the answer, scroll to the end!

Data Structure

In most card games, we would want to store the state of the game: the outstanding cards in the deck(s), and the hand(s) of each player. In standard video poker, there is one deck, and one player, so only the player hand needs to be recorded because every card in the deck is either in the hand, or it is not. One obvious way to represent the hand is as an array of denomination/suit tuples in an array. Unfortunately, this data structure requires other data structures to store the possible suits, and possible denominations. It is also more tedious to detect certain kinds of wins. For this simulation, I use a 13 x 4 matrix where each row is a different denomination, and each column is each of the four suits. This matrix allows us to easily see which cards are possible to be dealt. Additionally, this matrix, as well as vector-based languages such as R, make it easy to detect wins. Such a matrix looks like the following for the hand 2♠ 5♣ 8♥ 8♣ A♦

where Cij denotes a card, i is the denomination and j is the suit and H is the player’s hand in question.

Detecting Wins

Poker wins are not disjoint. A three of a kind involving Jacks is also a pair of Jacks or better, etc. When checking wins, I start with the lowest paying win, and move up to Royal Flush, only keeping track of the highest win. Thus, this algorithm detects a four-of-a-kind involving Queens as Jacks or Better, two pairs of Queens, and a three-of-a-kind of Queens, but only counts it as the highest win, the four-of-a-kind.

Pair of Jacks or Better: a pair of Jacks, Queens, Kings or Aces. In A, this is simply the condition that at least one row in rows 10 through 13 has a row sum greater than 1.
Two pair: two pairs of anything. In A, this is the condition that at least two rows have a sum greater than 1.
Three of a kind: three of any card. In A, this is the condition that at least one row has a sum of at least 3.
Straight: all 5 cards can be permuted such that they form an ascending sequence: A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A. This case is interesting and will be discussed in a bit.
Flush: all 5 cards are of the same suit. In A, this is the condition that at least one column has a sum of at least 5.
Full House: one three-of-a-kind, and a pair of anything. In A, this is the condition that a row has sum 3, and another row has sum 2.
Four of a Kind: 4 of any card. In A, this is the condition that a row has sum 4.
Straight Flush: the 5 cards can be permuted to form an ascending sequence and are all of the same suit. In A, this is simply the condition that we have a straight and a flush in the same hand.
Royal Flush: a straight flush with the Ace as the high card. In A, this is simply the condition that we have a straight flush AND the sum of row 13 is 1.

Of course, this “short circuit logic” only works for a game containing 5 cards. Also, note that under my scenario (a pair of low cards is dealt first), it is never possible to have a straight, flush, royal flush, or straight flush as the highest wins. Also, it is not possible to have Jacks or Better as the highest win because we already have one pair (low cards), and if we randomly are drawn a pair of Jacks or Better, we then have two pairs as the highest win.

Detecting the Straight: In A, we have a straight when five successive rows have sum equal to 1. We can do this iteratively, but there is a better way. Note that if all of the row sums are 0 or 1, we can treat the vector of row sums as a binary number and convert it to its integer representation. Each binary number has 13 bits. If we let 2 be the zeroth power, then straights will lead to the following binary and integer representations:

Bug alert: It just occurred to me that there are many more wrap-around straights such as Q, K, A, 2, 3. This will be fixed this evening.

From basic computer science and number theory, every natural number can be written as the sum of distinct powers or 2 and the representation of such an integer is unique. Furthermore, the sum of n successive powers of 2 is divisible by . After some experimentation I came up with the following rule: if all of the row sums are 0/1 and the integer representation of this binary vector is divisible by , then A is a straight. The only straight that does not fit this pattern is the wrap-around straight: J, Q, K, A, 2 which can be checked manually.

The Algorithm

Randomly generate a hand containing a pair of low cards (2-10) and at least one face card.
Hold the pair of low cards. Under strategy 2, hold one (and only one) of the face cards.
Discard the unheld cards from the deck and draw 2 or 3 cards at random from the same deck.
Check for wins.
Increment a win counter.
Repeat steps 1-5 tons of times, recording the percentage of hands that yielded a win, of the n games/hands played.

Results: Hold the Pair of Low Cards Only

My usual strategy is to always hold the low pair and take one face card along for the ride. That way, I hopefully match one of the two denominations I hold. My parents on the other hand, always told me to hold the low pair only, because that gives one more card (degree of freedom) for a win. It turns out they were right. Each game consisted of 1,000 hands. A percentage of these hands yields a win. This percentage is a random variable, so I ran this simulation to play 1,000 games. The table below shows the distribution of the win percentages.

Note that under strategy 1 (hold low pair only), all wins are more likely than under strategy 2! Of course, the estimate in the last column is an average; the mean in this case. The plot below shows the distribution of win percentages for both strategies.

The Code

The code for my simulation is below. Note that it can easily be modified for your own target hands of interest. In my simulation, certain functions were never used because certain winning hands were not possible.

DISCLAIMER: I did this for fun, and it is possible that there are bugs or problems with my code, algorithm or simulation. The results seem correct because I empirically I seem to do about the same using either strategy, and in a gambling perspective, an 8% discrepancy is not likely to set off bells in the head.

Merry Christmas 2011 From Byte Mining!

Ryan Rosario — Sat, 24 Dec 2011 19:28:44 +0000

To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! This year, I am thankful for the community that has sprung up around Data Science and open-source data collection and processing. This blog is almost two years old, and like with Twitter, I have been able to communicate with many data scientists, enthusiasts and some of the most prolific contributors to the data science software community. I am thankful for all of the wonderful people I have met and have yet to meet, and for your comments and reading.

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Ryan Rosario — Mon, 28 Nov 2011 19:00:00 +0000

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

article content and template pages
article content with revision history (huge files)
article content including user pages and talk pages
redirect graph
page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
image metadata
site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers.

For example, below is an excert of Wiki-syntax for a page on data mining.

'''Data mining''' (the analysis step of the '''knowledge discovery in databases''' process,<ref name="Fayyad"> or KDD), 
a relatively young and interdisciplinary field of [[computer science]]<ref name="acm" />
{{cite web|url=http://www.sigkdd.org/curriculum.php |title=Data Mining Curriculum |
publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |accessdate=2011-10-28}}
</ref><ref name=brittanica>{{cite web | last = Clifton | first = Christopher | title = Encyclopedia Britannica: Definition 
of Data Mining | year = 2010 | url = http://www.britannica.com/EBchecked/topic/1056150/data-mining | 
accessdate = 2010-12-09}}</ref> is the process of discovering new patterns from large [[data set]]s 
involving methods at the intersection of [[artificial intelligence]], [[machine learning]], [[statistics]] and 
[[database system]]s.<ref name="acm"> The goal of data mining is to extract knowledge from a data set in a 
human-understandable structure<ref name="acm" /> and involves database and [[data management]], 
[[Data Pre-processing|data preprocessing]], [[statistical model|model]] and [[Statistical inference|inference]] 
considerations, interestingness metrics, [[Computational complexity theory|complexity]] considerations, post-processing 
of found structure, [[Data visualization|visualization]] and [[Online algorithm|online updating]].<ref name="acm" />

I was epicly worried that I would spend weeks writing my own parser and never complete the project I am working on at work. To my surprise, I found a fairly good parser. Since I am working on named entity extraction and ngram extraction, I wanted to only extract the plain text. If we take the above junk and extract only the plain text, we would get

Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young 
and interdisciplinary field of computer science is the process of discovering new patterns from large data sets 
involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. 
The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves 
database and data management, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of found structure, visualization and online updating.

and from this we can remove punctuation (except sentence terminators .?!), convert to lower case and perform other pre-processing text mining steps. There are many, many Wikipedia parsers of various qualities. Some do not work at all, some work only on certain articles, some have been abandoned as incomplete and some are slow as molasses.

I was delighted to stumble upon Wikipedia Extractor, a Python library developed by Antonio Fuschetto, Multimedia Laboratory, Dipartimento di Informatica, Università di Pisa, that extracts plain-text from the Wikipedia XML dump file. The script is heavily object-oriented, and it is very easy to modify and extend for other purposes. For me, it is the easiest parser to use and yields the best quality output although there are other options.

Pros

Very easy to run; it’s just a Python script.
Yields high quality output; no stray wikisyntax garbage.
Highly object-oriented; easy to extend and embed in text mining projects.
Object-oriented style makes it easier to parallelize with lightweight processes (written by the user).
Allows specifying the maximum size of each produced file (good for sending to S3).
It is written in Python.

Cons

Far too slow. Python profilers show major overhead involved in regex search and replace, and string replacement.
Is not perfect, but one of the best I have seen. For some reason, Wikilinks are converted to HTML links. Correcting this required modifying the source code.
Retooling the package to work with Hadoop Streaming is not too difficult, but requires some work and grokery that should be easier.

Wikipedia Extractor is good for offline analysis, but users will probably want something that runs faster. Wikipedia Extractor parsed the entire Wikipedia dump in approximately 13 hours, on one core, which is quite painful. Add in further parsing and the processing time becomes unbearable even on multiple cores. A Hadoop Streaming job using Wikipedia Extractor as well as too much file I/O between Elastic MapReduce and S3 required 10 hours to complete on 15 c1.medium instances.

Ken Weiner (@kweiner) recently re-introduced me to the Cloud9 package by Jimmy Lin (@lintool) of Twitter which fills in some of these gaps. I avoided it at first because Java is not the first language I like to turn to. Cloud9 is written in Java and designed for use with Hadoop MapReduce in mind. There is a method within the package that explicitly extracts the body text of each Wikipedia article. This method calls the Bliki Wikipedia parsing library. One common problem with these Wikipedia parsers is that they often leave syntax still in the output. Jimmy seems to wrap Bliki with his own code to do a better job of extracting high quality text only output. Cloud9 also has counters and functions that detect non-article content such as redirects, disambiguation pages, and more.

Developers can introduce their own analysis, text mining and NLP code to process the article text in the mapper or reducer code. An example job distributed with Cloud9 which simply counts the number of pages in the corpus took approximately 15 minutes to run on 8 cores on an EC2 instance. A job that did more substantial required 3 hours to complete, and once the corpus was refactored as sequence files, the same job took approximately 90 minutes to run.

Conclusion

I am looking forward to playing with Cloud9 some more… I will take 90 minutes over 10 hours any day! Wikipedia Extractor is an impressive Python package that does a very good job of extracting plain text from Wikipedia articles and for that I am grateful. Unfortunately, it is far too slow to be used on a pay-per-use system such as AWS or for quick processing. Cloud9 is a Java package designed with scalability and MapReduce in mind, allowing much quicker and more wallet friendly processing.

LexisNexis Open-Sources its Hadoop Alternative

Ryan Rosario — Sat, 10 Sep 2011 03:33:25 +0000

A month ago, I wrote about alternatives to the Hadoop MapReduce platform and HPCC was included in that article. For more information, see here.

LexisNexis has open-sourced its alternative to Hadoop, called High Performance Computing Cluster. The code is available on GitHub. For years the code was restricted to LexisNexis Risk Solutions. The system contains two major components:

Thor (Thor Data Refinery Cluster) is the data processing framework. It “crunches, analyzes and indexes huge amounts of data a la Hadoop.”
Roxie (Roxy Radid Data Delivery Cluster) is more like a data warehouse and is designed with quick querying in mind for frontends.

The protocol that drives the whole process is the Enterprise Control Language which is said to be faster and more efficient than Hadoop’s version of MapReduce. A picture is a much better way to show how the system works. Below is a diagram from the Gigaom article from which most of this information originates.

To me, Roxie seems much more exciting because it seems to complement (or replace) several technologies currently in the space. I do not know all the details, but it seems to potentially encapsulate technologies such as HBase, Hive, RabbitMQ and MemcacheDB, technologies that are common used to query and speed data to a web frontend.

My opinion on HPCC is mixed. Although Hadoop has already taken off in usage, LexisNexis is a very strong institution and could potentially convince some corporate users to use their system instead — those that do not want to use Microsoft’s Dryad project. I do not see HPCC being a Hadoop killer, just as I do not see Spark or any other alternative to be a Hadoop killer. However, if HPCC does become a strong alternative, I sense this could be trouble for some of the newer players in the Hadoop field such as HortonWorks and MapR. I do not have much of an interest in studying business and competition, but Hadoop Summit 2011 showed that the Hadoop space has become crowded, and small breakthroughs such as another company developing a similar project is enough to add volatility and uncertainty for all involved.

SIGKDD 2011 Conference — Days 2/3/4 Summary

Ryan Rosario — Sat, 27 Aug 2011 18:00:00 +0000

<< My review of Day 1.

I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.

Keynotes

KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year’s conference had a few big names.

Steven Boyd, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. The first keynote, by Steven Boyd, discussed convex optimization. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (“non-negative curvature” as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is “beautiful theory” so his comment was humorous to me. Convex optimization is a very intuitive way to think about regression and techniques such as the lasso. Convex optimization has tons of use cases including parameter estimation (MLE, MAP, least-squares, lasso, logistic SVM and modern L1 optimization). Boyd showed an example of convex optimization for disk head scheduling.

For more information about convex optimization, see the website for Convex Optimization by Boyd and Vandenberghe. The book is available for free as well as lecture slides etc. Even better, the second author is from UCLA! I did not realize that.

Peter Norvig, Internet Scale Data Analysis. It is always great to hear from Peter Norvig. At the very least, you may have seen his name on your Artificial Intelligence introductory textbook Artificial Intelligence: A Modern Approach. Norvig is also well known as the Director of Research at Google. He also spoke at SciPyCon 2009 and was wearing a similarly flashy shirt. Norvig discussed how to get around long latencies in a large scale system. Interestingly, his talk began with a discussion about Google’s interest in its carbon footprint because of course all of Google’s massive systems require a lot of power. The carbon output of 2500 queries is approximately equal to the carbon output in a beer. Norvig noted that most of Google’s most successful engineers are well-versed in distributed systems, and this should come as no surprise. He then introduced MapReduce and showed an example of how Google uses MapReduce to process map tiles for Google Maps. Norvig concluded by mentioning a variety of large systems used by Google including BigTable (column oriented store), and Pregel for graph processing. Pregel is vertex based, and thus programs “think like a vertex” where each vertex responds to actions transmitted over edges.

(There was a keynote by a fellow named David Haussler about cancer genomics. After an exhausting first two days, I skipped this talk as I needed to sleep…and I was not incredibly interested in the topic.)

Judea Pearl, The Mathematics of Causal Inference. Go Bruins! Judea Pearl is a professor at the UCLA Department of Computer Science and teaches a course on his field, Causality, each spring. His talk was essentially the same talk he gives at UCLA at the beginning of the quarter. I attempted to take his course in 2009, but quite frankly, I don’t get it and my mind cannot bend into that realm. I remember sitting in his class and wondering “what is wrong with me?” I love listening to Dr. Pearl speak only because of his sense of humor. Despite his age and the fact that he is slowing down, he had the crowd in hysterics as he struggled with the presentation technology and made intelligent jokes at every chance.

Pearl believes that humans do not communicate with probability, but causality (I do not agree with this entirely). I appreciated that he mentioned that it takes work to overcome the difference in thinking between probability and causality. In statistics, we use some data and a joint distribution to make inferences about some quantity or variable P. In causality, there is an intentional intervention that changes the joint distribution P into another joint distribution P’. Causality requires new language and mathematics (I do not see it). In order to use causality, one must introduce some untestable hypothesis. Pearl mentioned that some non-standard mathematical methods include counterfactuals and structural equation modeling. I do not know how I feel about any of this. For more information about Pearl’s Causality, check out his book.

Data Mining Competitions

One interesting event during KDD 2011 was the panel Lessons Learned from Contests in Data Mining. This panel featured Jeremy Howard (Kaggle), Yehuda Koren (Yahoo!), Tie-Yan Liu (Microsoft), and Claudia Perlich (Media6Degrees). Both Kaggle and Yahoo run data mining competitions: Kaggle has its own series of competitions and Yahoo is a major sponsor of the KDD Cup competition. Perlich has participated and won many data mining competitions. Liu provided a different insight into data mining competitions as an industry observer.

Jeremy Howard gave some insight into the history of data mining competitions. He credited KDD 97 with the formation of the first data mining competition. He announced to the crowd that companies spend 100 billion dollars every year on data mining products and services (not including in-house costs such as employment) and that there are approximately 2 million Data Scientists. The estimate of the number of Data Scientists was based on the number of times R was downloaded, and is an estimate based on David Smith’s (Revolution Computing) blog post. I love R, and every Data Scientist should use it, but there are several problems with this estimate. Not everyone that uses R is a Data Scientist; a large portion of R users are statisticians (“beautiful theory”), teachers, miscellaneous students etc. Second, not all Data Scientists use R. Some are even more creative and write their own tools or use little-adopted software packages. There are also a lot of Data Scientists that use Python instead of R. Howard also announced that over the next year, Kaggle with be starting 1000s of “invitation only” competitions. Personally, I do not care for this type of exclusion even though their intentions are good.

Yehuda Koren introduced the crowd to Yahoo’s involvement in data mining competitions. Yahoo is a major force behind the KDD Cup and the Heritage Foundation competition. Yahoo also won a progress award in the Netflix challenge. Koren then described how data mining competitions help the community. Competitions raise awareness and attract research to a field, end up involving the release of a cool dataset to the community, encourage contribution and education, and provide publicity for participants and winners. Contestants are attracted to competitions for various reasons including fun, competitiveness, fame, the desire to learn more, peer pressure and of course the monetary reward. As with every competition, data mining competitions have rules and Koren stated that rules are very difficult to enforce. I believe that data mining is vague as it is, so competitions would be just as vague. It is important to maximize participation by minimizing the reduction of participation while maximizing fairness and innovation. Some such “rules” include discouraging huge ensembles (which probably overfit anyway), submission frequency, team duplication, team size (the KDD Cup winning team had 25 members). Some obvious keys to success in data mining competitions are ensembles, hard work, team size, innovation vs. fancy models, quick coding and patience.

I felt that Tie-Yan Liu from Microsoft sort of served as the Simon Cowell of the panel, and I feel that his role was necessary. He provided industry insight that provided a bit of a reality check as to what data mining competitions accomplish and do not accomplish. Liu questions if the problems being solved in data mining competitions are really important problems. Part of the problem is that many datasets are censored as to protect privacy. Additionally, the really interesting problems cannot be opened to the public because they involve trade secrets. I consider myself an inclusive guy – I do not like the concept of winners and losers. I was elated that Liu brought up this point: “what about the losers?” Is it bad publicity to “lose” several (or all) competitions? The answer to this question varies person-to-person. I honestly believe that the goal of these competitions is of the open-source nature (fun, share, learn, solve) and not so much to cure cancer. They are great for college students, people that are interested in data science but do not have access to great data. For the rest of us, learning on our own using interesting data is probably better.

Claudia Perlich (Media6Degrees) discussed her experience participating in data mining competitions. She has won several contests. She commented on the distinction between sterile/cleaned data and real data as competitions can include either type. The concept of Occam’s Razor applies to data mining competitions; Perlich won most of her competitions using a linear model, but by using more complex and creative features. Perlich emphasizes that complex features are better than complex models.

Considering the Netflix Prize has been one of the biggest data mining competitions, I was disappointed that they were not represented on the panel since there were several researchers from Netflix at the conference.

Rather than write a few sentences for each topic, I will just bullet the goals of the research discussed in the sessions. Descriptions with a star (*) denote my favorite papers and are cited later.

Text Mining

I attended two of the three text mining sessions. I must say that I am quite topic-modeled and LDAed out! Latent Dirichlet Allocation (LDA) and several variations were part of every talk I heard. That was very exciting and reaffirms that I am in a hot field. Still, nobody has taken my dissertation topic yet (which I have remained quiet about).

Using explicit user feedback to improve LDA and display topics appropriately by combining topic labels, topic n-grams and capitalization/entity detection.* This talk was presented by David Andrzejwski (@davidandrzej). I finally got to meet him and I discussed my dissertation topic with him. I am always entertained by the fact that we all look much different than our Twitter avatars portray.
Using external metadata and topics (LDA) to predict user ratings on items using localized factor models.
Using preferences and relative emphasis of each factor (i.e. how important to you is free wireless Internet in a hotel room?) to predict rating scores.*
Determining the network process that created a piece of text: who copied from whom?
Using a topic model (LDA) with other features such as part-of-speech tag (noun, verb etc.), WordNet features, sentiment/polarity etc.*
Modeling how topics and interests grown over time and understanding the correlations between terms over time.*

Social Network Analysis and Graph Analysis

The Social Networks session conflicted with one of the Text Mining sessions, but since I knew there would be two more, I decided to attend this one instead. I also combined the two Graph Analysis sessions into this section since they are so related. The goals of the research presented in these talks were as follows:

To label venue (Foursquare venues etc.) types (restaurant, bar, park etc.) based on several attributes of the user: user’s friends, user’s weekly and daily schedule using label propagation.
To determine the connections/edges in a social network that are the most critical for propagation of data (an idea, tweet, viral marketing etc.)*
To use tagging (items on Amazon can be tagged with keywords by users) and reviews to predict the success of a new item.
To find a better metric for ranking search engine results by starting with a relevant subgraph rather than a random surfer model. Also models attention span of user.*
Classification of nodes, labeling of nodes and node link prediction using one unified algorithm (C3).*
Ranking using large graphs using a priori information about good/bad nodes and edges.*
The importance of bias in sampling from networks.*

User Modeling

This session I suspect was similar to the Web User Modeling session and focused on recommendation engines and rating prediction.

Using endorsements to measure user bias (retweets, likes, etc.) to perform real time sentiment analysis,
Estimating user reputation using thumbs-up vote rates on Yahoo News comments.
Selecting a set of reviews that encapsulates the most information about a product with the most diverse viewpoints.

Frequent Sets

I did some work with itemset mining at my last job and I was not incredibly interested in the Online Data and Streams session at the time so I attended this talk.

Using background knowledge about transactions to minimize redundancy.
Studying the effects of order on itemset mining.
Mining graphs as frequent itemsets from streams.

Classification

I got stuck in this session because the session I really wanted to attend “Web User Modeling” was full and there was nowhere to sit or stand. This session was more technical and theoretical. The only session that I really enjoyed was about a classifier called CHIRP. I did not follow the details, but this is a paper that I am interested in reading. The authors used a classifier based on Composite Hypercutes on Interated Random Projections to classify spaces that have complex topology (think of classifying items that appear in a bullseye/dartboard pattern).*

Unsupervised Learning

This talk was similar to the classification talk but more practical in my opinion.

Using decision trees for density estimation classifiers.
Clustering cell phone user behavior using “Earth Mover” distance.
Clustering of multidimensional data using mixure modeling with components of different distributions and copulas.*

Favorite Papers

Below is a short bibliograph of papers that were my favorite. There were also a few at the poster session (the first four) that I include here.

Ranking-Based Classification of Heterogeneous Information Networks, Ming Ji, Jiaewi Han, Marina Danilevsky.
Axiomatic Ranking of Network Role Similarity, Ruomong Jin, Victor E. Lee, Hui Hong.
Approximate Kernel k-means: Solutions to Large Scale Kernel Clustering
User-Level Sentiment Analysis Incorporating Social Networks, Chenhao Tan, Lillian Lee, Jie Tang, Lang Jiang, Ming Zhou, Ping Li.
Latent Topic Feedback for Information Retrieval, David Andrzejewski, Lawrence Livermore National La; David Buttler, Lawrence Livermore National Laboratory
Latent Aspect Rating Analysis without Aspect Keyword Supervision, Hongning Wang, UIUC; Yue Lu, University of Illinois; ChengXiang Zhai, UIUC
Conditional Topical Coding: an Efficient Topic Model Conditioned on Rich Features, Jun Zhu, Carnegie Mellon University; Ni Lao, Carnegie Mellon University; Ning Chen, Tsinghua University; Eric Xing, CMU
Tracking Trends: Incorporating Term Volume into Temporal Topic Models, Liangjie Hong, Lehigh University; Dawei Yin, lehigh University; Jian Guo, University of Michigan; Brian Davison, Lehigh University
Diversity in ranking via resistive graph centers, Kumar Dubey, IBM Research; Soumen Chakrabarti, “Indian Institute of Technology, Bombay”; Chiru Bhattacharya, IISc
Collective Graph Identification, Galileo Namata, University of Maryland; Stanley Kok, University of Maryland; Lise Getoor, “University of Maryland, College Park”
Semi-Supervised Ranking on Very Large Graph with Rich Metadata, Bin Gao, Microsoft Research Asia; Tie-Yan Liu, Microsoft Research Asia; Wei Wei, ; Taifeng Wang, Microsft research; Hang Li, Microsoft
Benefits of Bias: Towards Better Characterization of Network Sampling, Arun Maiya, UIC; Tanya Berger-Wolf, University of Illinois at Chicago
CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections, Leland Wilkinson, Systat; Anushka Anand, UIC; Tuan Dang, UIC
Sparsification of Influence Networks, Michael Mathioudakis, University of Toronto; Francesco Bonchi, Yahoo! Research; Carlos Castillo, Yahoo!; Aristides Gionis, Yahoo! Research Barcelona; Antti Ukkonen,
Online heterogeneous mixture modeling with marginal and copula selection, RYOHEI FUJIMAKI, NEC Laboratories America; Yasuhiro Sogawa, ; Satosi Morinaga,

Wrapping Up

I had an awesome time at KDD and wish I could go next year, but it will be held in Beijing. I got to meet a lot of different people in the field that have the same passion for data and that was really cool. I got to meet with recruiters from a few different companies and get some swag from Yahoo and Google.

It was awesome being around such greatness. I ran into Peter Norvig several times, ran into Judea Pearl in the restroom (I already know him), as well as Christos Faloutsos (I am a huge fan) and Ross Quinlan. I stopped at the Springer booth and found a cool book about link prediction with Faloutsos as one of the authors. I went to buy it, handed the lady my credit card, and learned that it was $206 (AFTER conference discount)! Interestingly… Amazon has the same book for $165. I will probably order it anyway.

Here’s hoping that KDD returns to California (or the US) real soon!

<< My review of Day 1.

Candid Shots


Ross Quinlan enjoying a beer during the poster session. What a cool guy!	Christos Faloutsos talking with a student during the poster session.

SIGKDD 2011 Conference — Day 1 (Graph Mining and David Blei/Topic Models)

Ryan Rosario — Mon, 22 Aug 2011 16:41:22 +0000

I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. AdMeld did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That’s good targeting!

Mining and Learning on Graphs Workshop 2011

I had originally planned to attend the 2-day workshop Mining and Learning with Graphs (MLG2011) but I forgot that it started on Saturday and I arrived on Sunday. I attended part of MLG2011 but it was difficult to pay attention considering it was my first time waking up at 7am in a long time. The first talk I arrived for was Networks Spill the Beans by Lada Adamic from the University of Michigan. Adamic’s presented work involved inferring properties of content (the “what”) using network structure alone (using only the “who”: who shares with whom). One example she presented involved questions and answers on a Java programming language forum. The research problem was to determine things such as who is most likely to answer a Java beginner’s question: a guru, or a slightly more experienced user? Another research question asked what dynamic interactions tell us about information flow. For this example, Adamic used data from the virtual world SecondLife. Certain landmarks (such as a bench) can be bookmarked by users and certain gestures (like a kiss) can be studied. This made my ears rise. SecondLife is a treasure trove of cool data. Is there a way to access it? It looks there might be a way to access some of it including monetary valuation, market purchases, and several APIs for different aspects of SecondLife. I will have to look into that later though. Adamic concluded with a discussion of Twitter as a social network, but I was starting to fall asleep from my hectic and early morning. The gist of her talk, and many other talks in this field, was to combine semantic variables (NLP) with topological variables (SNA) to predict som other semantic variables. This talk was very digestible, and very interesting (despite my lack of sleep), but featured some of the worst visualizations I have ever seen (area plots representing correlations across multiple levels of an ordinal variable), but that was minor. Of course, Nathan might disagree ;).

Social network analysis, and network analysis in general, is a field that I really want to sink my teeth into. The difficulty I have is that the discussion of this field seems to involve so much vernacular that is specific to the field that everything seems so much more difficult than it really is.

At this point I took off to lunch. Just across from the Hyatt (a beautiful hotel by the way) is Seaport Village, a beautiful waterfront park containing nice landscaping, shops, restaurants, all with the ocean in the background. There is no beach there — the village backs right up to the water. Across the bay is some type of military complex and Coronado Island. I had a $7 hot dog, followed by a chocolate-covered strawberry and a peanut butter cup from the nearby candy store. It was such a nice day I walked around for a while, grabbed a strawberry shake and then headed back for the next session… the one I had been waiting for!

Afternoon Tutorial: Probabilistic Topic Models
David Blei, Princeton

My dissertation topic is related to Latent Dirichlet allocation (well, topic modeling in general), so I was definitely interested to hear what the father of LDA had to say. Since this was a 3 hour tutorial, I was expecting that Blei would start with the unigram model, and then discuss Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Indexing (pLSI) building up to LDA. Instead, Blei started with LDA and for good reason! In this post, I will not summarize the mechanics of Latent Dirichlet Allocation as that is another post entirely. For some introduction, see here. LDA and its extensions can be used to model the evolution of topics over time, to model the connections among topics, and to predict links among objects in a network. Topic modeling is a case study in machine learning rather than a field in itself; topic modeling draws on several different concepts including Bayesian statistics, time series analysis, hierarchical models, Markov chain monte carlo (MCMC), Bayesian non-parametric statistics and sparsity. In LDA, a document is represented as a mixture of topics (some hypothetical quantity that captures content clustering), and a topic is a distribution over words in a vocabulary.

Again, this is a high-level description of what was discussed. A full mechanical analysis would require dozens of pages. LDA is just a probabilistic model. As such, there are established ways for estimating the parameters of the model as well as the topic assignments. Some of these include mean field variational methods, expectation propagation, Gibbs sampling, collapsed Gibbs sampling, collapsed variational Bayes and online variational Bayes. Each of these estimation methods has its own advantages and disadvantages. Blei showed the LDA and pLSI have a lot in common. Unlike LDA, pLSI uses maximum likelihood estimations (and the EM algorithm) for parameter estimation; pLSI tends to overfit badly. The hyperparameter α adds regularization to the ϴ parameter in the LDA model. [Sorry to refer to these random parameters, but it is difficult to describe without them. See the links mentioned earlier for an overview of LDA.]

Preprocessing. A lot of preprocessing must be performed before computing a topic model. First, we should remove stopwords, which are words that provide absolutely no clues to the content of the text. If we leave stopwords in the corpus when computing the model, we may end up with meaningless topics that are described with only stopwords, due to their high probability. Second, Blei mentioned that stemming is a good idea, but modern stemming algorithms tend to be too aggressive. If resources allow, I think it would be useful to have humans manually strip words to their root words. Multiword phrases such as “black hole” are also an issue. With sufficient resources, one could ask human labelers to identify these phrases and recode them as a single word by replacing the space between words with an underscore. Hanna Wallach (U. Mass) has a paper that describes how to identify multiwork phrases by using n-grams. Blei has a similar paper that discusses an algorithm called TurboTopics. He also mentioned that a standard statistical hypothesis test such as chi-squared, permutation tests, or a nested hypothesis test would also be sufficient, though inefficient. I have not thought of how this would work however. Finally, remove rare words because they can lead to local optima in the likelihood surface probably yielding inefficient computation.

Some hairy details. One of the parameters that makes LDA useful is α. α is a hyperparameter in the LDA model that determines the sparsity of draws from the underlying Dirichlet distribution. α is typically a small number; Blei mentioned that 0.01 is a good a priori value for α. As α gets larger, the distribution of topics tends towards the uniform (each topic equally likely) distribution and as α approaches 0, we get sparser draws, meaning more peaked topic probabilities. Setting α to be ridiculously small (i.e. 0.001) may yield a single topic dominating the model. α can be chosen, or we can fit α to the data using cross-validation or some other method. He also discussed the parameter η.

Open source software. We quickly (flash of an eye) went through a list of some open-source LDA implementations:

LDA-C (variational EM), Blei.
HDP (hierarchical Dirichlet processes), Blei
LDA (R package, collapsed Gibbs), Jonathan Chang, Data Scientist, Facebook.
Lingpipe, alias-i
Mallet (collapsed Gibbs), UMass
Gensim (online and batch LDA), Radim Řehůřek

To my delight, Blei seemed to favor the R package (although Gensim is a nice Python implementation). The R package not only contains LDA, but several other models including RTMs, MMSB and sLDA which will be discussed later. It is supposedly fast as well. The output from the R package can be visualized using the Topic Model Visualizer by Allison Chaney.

The beauty of LDA is that it can be embedded in many more complicated models. Some applications of these extensions include word sense, graphs and hierarchies. Before delving into specifics, there are a couple of changes to the LDA model that motivate the next topics.

The probability of observing word w given a set of topics β and a set of topic labels z is given by P(w|β,z) which is multinomial. The distribution of P(w|ß,z) can be changed depending on what we are modeling. For example, for count data, P(w|β,z) can be Poisson. This drastically changes the model, however. In LDA, P(w|ß,z) is multinomial which is convenient because it is the conjugate prior of the Dirichlet distribution.
The characteristic LDA posterior distribution can be used in more creative ways…

Correlated Topic Model. In LDA, all topics are considered independent of each other, and this is usually unrealistic. CTM allows the topics to be correlated. For example, a paper classified as about calculus is more likely to also be classified as about physics, than it is to be classified as about sewing. Blei mentioned that CTM allows for better prediction, likely because it is more realistic. CTM is also more robust to overfitting. The main distinction from LDA is that ϴ follows the logistic normal distribution instead of the Dirichlet distribution.

Dynamic Topic Model. DTM models how each individual topic changes over time. One example Blei showed involved a topic that could be labeled “technology”. In the late 1700s, this topic contained the words “coal”, “steel” (I am making it up from memory…probably badly…bear with me) and in 2011 contained the words “silicon” and “solar”. The main distinction from LDA is two-fold: assuming the topic at time t is normally distributed with the topic at time t-1 as the mean and some variance. That is,

and

instead of multinomial.

A limitation of DTM is that it does not handle the death of a topic gracefully.

Supervised LDA. In sLDA, we associate each document with an external variable. For example, a document may be a Yelp review containing text. The external variable associated with the Yelp review may be the number of stars in the associated rating. We can use sLDA to use the topics estimated by LDA as regressors to predict this external variable Y. Various types of regression can be performed from standard linear regression to the generalized linear model (GLM).The Yelp example would likely use an ordered logit model for Y.

Relational Topic Models. RTM applies sLDA to every pair of documents in a corpus and attempts to use content to predict connectedness in a graph. For example, given the content on my Facebook profile, one could use sLDA to predict what kind of reaction I would have to an ad (i.e. click or no click) and this could be used for targeted ad serving, or any other type of recommendation engine. Think collaborative filtering! RTM is also good for certains types of data that have spatial/geographic dependencies.

Ideal Point Topic Models were barely touched upon since we were running short on time (although we voted to extend the session by 30 mins and Blei happily obliged). They seem particularly useful in political science for predicting roll call votes.

Bayesian Non-Parametric Models are a hot topic but are too complicated to describe here. In LDA, the number of topics is determined a priori and remains fixed throughout the model. In real life, topics can be “born” and can “die” off and we may not know a priori how many topics to use. One can model the latter situation as a Chinese Restaurant Process where each table is associated with a topic. Furthermore, a Chinese Restaurant Franchise can be used for modeling hierarchies (hLDA). In CRF, there is a corpus level restaurant where each table is a parameter and a topic (called plates). Then, each document has its own Chinese restaurant where each table is associated with a customer in the corpus level Chinese restaurant. Blei recommended a book by Hjort.

Algorithms. The last few minutes were dedicated to discussing inference algorithms for LDA, particularly Gibbs sampling and variational Bayes. Gibbs sampling is very simple to implement, though Blei stated that it does not work for DTM or CTM because the assumptions of conjugacy (multinomial/Dirichlet) are violated. Variational Bayes is more difficult to implement, but handles non-conjugacy in CTM and DTM much better.

Plenary Sessions

The plenary sessions consisted of several thank-yous and awards. The committee provided some humor which gave some humility to the long process of writing and submitting papers. They went over paper acceptance statistics and read some of the funnier comments that reviewers gave, one of which was something like “It is clear that the author did not read this paper before submitting it.” I don’t know how many times I have said that in various situations. The committee handed out awards for best paper and best dissertation. This year’s KDD Cup competition was a contest similar to the Netflix challenge, but involved music recommendation. The winner was the National Taiwan University, for the fourth straight year in a row I am told. The innovation award went to a researcher dear to my heart, Ross Quinlan, who developed the C4.5 decision tree modeling software.

For more information about topic modeling software, see David Blei’s website at http://www.cs.princeton.edu/~blei which contains code for most if not all of these topic models. For the notes from the tutorial, see http://www.cs.princeton.edu/~blei/kdd-tutorial.pdf.

Hadoop Fatigue — Alternatives to Hadoop

Ryan Rosario — Tue, 16 Aug 2011 17:30:00 +0000

It’s been a while since I have posted… in the midst of trying to plow through this dissertation while working on papers for submission to some conferences.

Hadoop has become the de facto standard in the research and industry uses of small and large-scale MapReduce. Since its inception, an entire ecosystem has been built around it including conferences (Hadoop World, Hadoop Summit), books, training, and commercial distributions (Cloudera, Hortonworks, MapR) with support. Several projects that integrate with Hadoop have been released from the Apache incubator and are designed for certain use cases:

Pig, developed at Yahoo, is a high-level scripting language for working with big data and Hive is a SQL-like query language for big data in a warehouse configuration.
HBase, developed at Facebook, is a column-oriented database often used as a datastore on which MapReduce jobs can be executed.
ZooKeeper and Chukwa
Mahout is a library for scalable machine learning, part of which can use Hadoop.
Cascading (Chris Wensel), Oozie (Yahoo) and Azkaban (LinkedIn) provide MapReduce job workflows and scheduling.

Hadoop is meant to be modeled after Google MapReduce. To store and process huge amounts of data, we typically need several machines in some cluster configuration. A distributed filesystem (HDFS for Hadoop) uses space across a cluster to store data so that it appears to be in a contiguous volume and provides redundancy to prevent data loss. The distributed filesystem also allows data collectors to dump data into HDFS so that it is already prime for use with MapReduce. A Data Scientist or Software Engineer then writes a Hadoop MapReduce job.

As a review, the Hadoop job consists of two main steps, a map step and a reduce step. There may optionally be other steps before the map phase or between the map and reduce phases. The map step reads in a bunch of data, does something to it, and emits a series of key-value pairs. One can think of the map phase as a partitioner. In text mining, the map phase is where most parsing and cleaning is performed. The output of the mappers is sorted and then fed into a series of reducers. The reduce step takes the key value pairs and computes some aggregate (reduced) set of data such as a sum, average, etc. The trivial word count exercise starts with a map phase where text is parsed and a key-value pair is emitted: a word, followed by the number “1” indicating that the key-value pair represents 1 instance of the word. The user might also emit something to coerce Hadoop into passing data into different reducers. The words and 1s are sorted and passed to the reducers. The reducers take like key-value pairs and compute the number of times the word appears in the original input.

After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.

Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
Documentation for the bloated Java API is sufficient, but not the most helpful.
HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
Large clusters require a dedicated team to keep it running properly, but that is not surprising.
Writing a Hadoop job becomes a software engineering task rather than a data analysis task.

Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:

BashReduce

Unlike Hadoop, BashReduce is just a script! BashReduce implements MapReduce for standard Unix commands such as sort, awk, grep, join etc. It supports mapping/partitioning, reducing, and merging. The developers note that BashReduce “sort of” handles task coordination and a distributed file system. In my opinion, these are strengths rather than weaknesses. There is actually no task coordination as a master process simply fires off jobs and data. There is also no distributed file system at all, but BashReduce will distribute files to worker machines. Of course, without a distributed file system there is a lack of fault-tolerance among other things.

Intermachine communication is facilitated with simple passwordless SSH, but there is a large cost associated with transferring files from a master machine to its workers whereas with Hadoop, data is stored centrally in HDFS. Additionally, partition/merge in the standard unix tools is not optimized for this use case, thus the developer had to use a few additional C programs to speed up the process.

Compared to Hadoop, there is less complexity and faster development. The result is the lack of fault-tolerance, and lack of flexibility as BashReduce only works with certain Unix commands. Unlike Hadoop, BashReduce is more of a tool than a full system for MapReduce. BashReduce was developed by Erik Frey et. al. of last.fm.

Disco Project

Disco was initially developed by Nokia Research and has been around silently for a few years. Developers write MapReduce jobs in simple, beautiful Python. Disco’s backend is written in Erlang, a scalable functional language with built-in support for concurrency, fault tolerance and distribution — perfect for a MapReduce system! Similar to Hadoop, Disco distributes and replicates data, but it does not use its own file system. Disco also has efficient job scheduling features.

It seems that Disco is a pretty standard and powerful MapReduce implementation that removes some of the painful aspects of Hadoop, but it also likely removes persistent fault tolerance as it relies on a standard filesystem rather than one like HDFS, but Erlang may impose some functionality that provides a “good enough” level of fault tolerance for data.

Spark

Spark is one of the newest players in the MapReduce field. Its purpose is to make data analytics fast to write, and fast to run. Unlike many MapReduce systems, Spark allows in-memory querying of data (even distributed across machines) rather than using disk I/O. It is of no surprise then that Spark out-performs Hadoop on many iterative algorithms. Spark is implemented in Scala, a functional object-oriented language that sits on top of the JVM. Similar to other languages like Python, Ruby, and Clojure, Scala has an interactive propt and users can use Spark to query big data straight from the Scala interpreter.

One wrinkle is that Spark requires installing a cluster manager called Mesos. I had some difficulty installing it on Ubuntu, but the development team was an amazing help, and made a few changes to the source and now it runs well. On the downside, Mesos adds a layer of complexity that we are trying to avoid. On the upside, Mesos allows Spark to co-exist with Hadoop and it can read any data source that Hadoop supports, and it “feels” light, similar to Disco’s server UI.

Spark was developed by the UC Berkeley AMP Lab. Currently, its main users are UC Berkeley researchers and Conviva. Hadoop Summit 2011 featured a talk on Spark by one of the developers, which I wrote about earlier this summer.

GraphLab

GraphLab was developed at Carnegie Mellon and is designed for use in machine learning. GraphLab’s goal is to make the design and implementation of efficient and correct parallel machine learning algorithms easier. Their website states that paradigms like MapReduce lack expressiveness while lower level tools such as MPI present overhead by requiring the researcher to write code that beats a dead horse.

GraphLab has its own version of the map stage, called the update phase. Unlike MapReduce, the update phase can both read and modify overlapping sets of data. Recall that MapReduce requires data to be partitioned. GraphLab accomplishes this by allowing the user to specify data as a graph where each vertex and edge in the graph is associated memory. The update phases can be chained in such a way such that one update function can recursively trigger other update functions that operate on vertices in the graph. This graph-based approach would not only make machine learning on graphs more tractable, but it also improves dynamic iterative algorithms.

GraphLab also has its own version of the reduce stage, called the sync operation. The results of the sync operation are global and can be used by all vertices in the graph. In MapReduce, output from the reducers is local (until committed) and there is a strict data barrier among reducers. The sync operations are performed at time intervals, and there is not as strong of a tie between the update and sync phases. What I mean is that the sync intervals are not necessarily dependent on some prior update completing.

GraphLab’s website also contains the original UAI paper and presentation, a document better explaining the abstraction, and there is even a Google Group for the GraphLab API. To me, GraphLab seems like a very powerful generalization, and re-specification, of MapReduce.

Storm

Recently, Nathan Marz of BackType made waves in the Twitter big data community with a blog post titled Preview of Storm: The Hadoop of Realtime Processing. Within a day, Storm became known as “Real-time Hadoop” to the chagrin of some developers from Apache. Hadoop is a batch-processing system — that is, give it a lot of fixed data and it does something with it. Storm is real-time — it processes data in parallel as it streams.

Marz writes that with their previous system, much time was spent worrying about graphs of queues and workers: where to send and receive messages, deploying workers and queues, and a lack of fault tolerance. Storm abstracts all of these complications away. Storm is written in Clojure, but any programming language can be used to write programs on top of Storm. Storm is fault-tolerant, horizontally scalable, and reliable. Storm is also very fast, with ZeroMQ used as the underlying message passing system.

Nathan Marz is a software developer at BackType, and made waves in 2010 with Cascalog. Cascalog really took off after his presentation at the 2010 Hadoop Summit, and I am delighted I got to see him present it. Storm will be open-sourced soon and I hope to write more about it later.

I included Storm in this post based on its colloquial name “Real-time Hadoop” — it is not clear to me whether or not Storm even uses MapReduce though.

HPCC Systems (from LexisNexis)

Perhaps the project with the least flattering name comes from LexisNexis, which has developed its own framework for massive data analytics. HPCC attempts to make writing parallel-processing workflows easier by using Enterprise Control Language (ECL), a declarative, data-centric language. I should note that SQL, Datalog and Pig are also said to be declarative, data-centric languages. A matter of fact, the development team has a converter for translating Pig jobs to ECL. HPCC is written in C++. Some have commented that this will make in-memory querying much faster because there is less bloated object sizes originating from the JVM. I also prefer C++ simply because it feels closer to human though — we think in terms of objects (object-oriented) at times, and a series of steps (procedural) at other times and use both thought processes together.

HPCC already has its own jungle of technologies like Hadoop. HPCC has two “systems” for processing and serving data: the Thor Data Refinery Cluster, and the Roxy Rapid Data Delivery Cluster. Thor is a data processor, like Hadoop. Roxie is similar to a data warehouse (like HBase) and supports transactions. HPCC uses a distributed file system.

Although details are still preliminary as is the system, this certainly has a “feel” for potentially being a solid alternative for Hadoop, but only time will tell.

With all these alternatives, why use Hadoop?

One word: HDFS. For a moment, assume you could bring all of your files and data with you everywhere you go. No matter what system, or type of system, you login to, your data is intact waiting for you. Suppose you find a cool picture on the Internet. You save it directly to your file store and it goes everywhere you go. HDFS gives users the ability to dump very large datasets (usually log files) to this distributed filesystem and easily access it with tools, namely Hadoop. Not only does HDFS store a large amount of data, it is fault tolerant. Losing a disk, or a machine, typically does not spell disaster for your data. HDFS has become a reliable way to store data and share it with other open-source data analysis tools. Spark can read data from HDFS, but if you would rather stick with Hadoop, you can try to spice it up:

Hadoop Streaming is an easy way to avoid the monolith of Vanilla Hadoop without leaving HDFS, and allows the user to write map and reduce functions in any language that supports writing to stdout, and reading from stdin. Choosing a simple language such as Python for Streaming allows the user to focus more on writing code that processes data rather than software engineering. Once code is written, it is easy to test from the command line:

cat a_bunch_of_files | ./mapper.py | sort | ./reducer.py

And, running and monitoring the job is similar to Vanilla Hadoop. Hadoop Streaming was my first introduction to Hadoop and it was quite pleasant.

Or, you could use a Hadoopified project that better solves the problem. Vanilla Hadoop can do some sophisticated stuff, but it suffers the problems I mentioned at the beginning of the post. Developers have created software that works on HDFS, but is geared toward different audiences. A Data Scientist may prefer Pig or Hive for data analysis whereas a Systems and Software Engineer may prefer a workflow solution (Oozie, Cascading etc.) and a (modern) DBA may want to use HBase. Each of these achieve different goals, but still rely on HDFS.

My Review of Hadoop Summit 2011 #hadoopsummit

Ryan Rosario — Thu, 30 Jun 2011 07:00:45 +0000

I woke up early and cheery Wednesday morning to attend the 2011 Hadoop Summit in Santa Clara, after a long drive from Los Angeles and the Big Data Camp that lasted until 10pm the night before. Having been to Hadoop Summit 2010, I was interested to see how much of the content in the conference had changed.

This year, there were approximately 1,600 participants and the summit was moved a few feet away to the Convention Center rather than the Hyatt. Still, space and seating was pretty cramped. That just goes to show how much the Hadoop field has grown in just one year.

Keynotes

We first heard a series of keynote speeches which I will summarize. The first keynote was from Jay Rossiter, SVP of the Cloud Platform Group at Yahoo. He introduced how Hadoop is used at Yahoo, which is fitting since they organized the event. The content of his presentation was very similar to last year’s. One interesting application of Hadoop at Yahoo was for “retiling” the map of the United States. I imagine this refers to the change in aerial imagery over time. When performed by hand, retiling took 6 weeks; with Hadoop, it took 5 days. Yahoo also uses Hadoop for fraud detection, spam detection, search assist, geotagging data/local indexing, ad targeting, predicting supply and demand and the aggregation and categorization of news stories. Jay also mentioned that Dapper runs models on data with Hadoop for ad personalization. Jay also mentioned that Big Data conferences all over the country are selling out.

Eric Baldeschwieler, the CEO of Hortonworks was next. Hortonworks seems to be a new company that spun off from Yahoo. Their goal is to provide commercial support and a full Apache Hadoop platform for users. Yes, they are very similar to Cloudera, and yes, they are competition. (Hortonworks and MapR both did a good job of not stepping on everyone’s toes in terms of how they presented themselves.) Cloudera provides its own distribution of Hadoop, which is of course similar to the Apache version. Hortonworks’ goal is to provide similar services, but with more transparency by using the Apache Hadoop distribution rather than wrapping its own. Paraphrasing Eric, Hortonworks is open-source from the ground up. A bit later, Sanjay Radia also of Hortonworks discussed Hadoop for the enterprise. Hortonworks has contributed, or is working on security (preventing users from deleting others’ data), service level agreements (SLAs), predictability and a Fair-Share scheduler.

Anant Jhingran, CTO of IBM discussed how Hadoop was used in IBM Watson. It seemed pretty obvious that Hadoop or some form of map-reduce was used in the system, but it did not seem to be highly publicized. Watson learned from 200 million pages of data, about 2-5TB and required between 3000 and 4000 Watts. Anant went quickly through a cool user interface representing a Jeopardy board and stated that the user interface to an artificial intelligence application is just important as the application itself. He also prefers the term IA (intelligence augmentation) over AI, and apparently this is a common distinction. To me, I interpret AI vs. IA to be artificial intelligence vs. knowledge discovery (data mining).

Karthic Ranganathan from Facebook discussed Facebook’s messaging system which was built on HBase, HDFS and MapReduce. Facebook sees 15 billion messages per month, excluding SMS and email, approximately 14TB of data! There are also 120 billion chat messages (25TB), for a grand total of almost 300TB per month. (I may have missed something as these numbers do not add up). Facebook uses HBase for the bodies of small messages, metadata, and for the search index. Facebook uses HBase because of its high write throughput and easy horizontal scalability. Facebook uses another system called Haystack for photos, bodies of large messages and attachments. Of course, HDFS is used for fault tolerance, scalability, checksums for data integrity and its MapReduce abilities. Profiles and services are partitioned by user. Each machine has 16 cores, with 12 1TB hard disks, and 48GB RAM (24GB used for HBase). Some things that Facebook would like contribute and improve: NameNode high availability and a second NameNode, better performance overall, and using flash memory to improve performance. Facebook often adds several columns to a table so that DevOps does not need to take the server offline to add new columns.

Big Data conferences all over the country are selling out.

Breakout Sessions

There were so many great sessions and I can only summarize the ones I attended. Check out the event agenda for abstracts on all sessions.

First I attended Web Crawl Cache – Using HBase to Manage a Copy of the Web. In this talk, we learned about Yahoo’s Web Crawl Cache (WCC) that collects and organizes data from Microsoft as a result of a search deal. These snapshots of the web are not only useful for search, but also for drilling into other avenues such as local assets, influence and language corpora. WCC uses HBase for several reasons: bulk load, MapReduce jobs are efficient, random access reads, a usable consistency model, and it is easy to dynamically add columns (this seems to contradict Karthick’s claim).

It was very difficult to pick a session for the 1:45 to 2:15 time slot. Options included Next Generation Hadoop, Scaling out Realtime Data (Facebook) and Building Kafka (LinkedIn). I admire the work and clout that LinkedIn has built over the past year or two, so I attended Jay Kreps session. LinkedIn’s data pipeline includes a lot of tracking, logging, metrics, messages and queuing. LinkedIn attempted to use messaging systems such as JMS and RabbitMQ. Streaming data is prevalent at LinkedIn such as search trends, click trends, invitation social networks etc. Kafka is LinkedIn’s solution for a distributed message queue; rather than polling for data, users subscribe to a data stream and data sources publish data to it. Kafka is 7000 lines of Scala, a functional and object-oriented language on top of the Java Virtual Machine (JVM). Kafka can produce about 250,000 messages per second (50 MB) and consume 550,000 messages per second (110 MB).

Next I attended another talk by Hortonworks, this time on HCatalog. HCatalog changes the way we think about data in HDFS. No longer do we need to worry about files and directories. Instead, HCatalog seems to add a layer of abstraction on top of HDFS that treats data as a set of tables. Tools such as Pig and Hive use this layer of abstraction, and currently Hive is tightly integrated with HCatalog. Hortonworks intends to add support for HBase and Streaming later this year.

I waited all day to see Matei Zaharia‘s talk on Spark. Zaharia is a graduate student at UC Berkeley and it was a nice change of pace to see a student present some work. Spark is a data processing platform that sits on top of the Mesos cluster management project (also produced by Berkeley). Mesos can handle 10,000s nodes, 100s of concurrent jobs and can be isolated in Linux containers (i.e. OpenVZ). Spark aims to extend MapReduce for iterative algorithms, and interactive low latency data mining. One major difference between MapReduce and Spark is that MapReduce is acyclic. That is, data flows in from a stable source, is processed, and flows out to a stable filesystem. Spark allows iterative computation on the same data, which would form a cycle if jobs were visualized. Resilient Distributed Dataset (RDD) serves as an abstraction to raw data, and some data is kept in memory and cached for later use. This last point is very important; Spark allows data to be committed in RAM for an approximate 20x speedup over MapReduce based on disks. RDDs are immutable and created through parallel transformations such as map, filter, groupBy and reduce. RDD immutability is similar to immutable types in functional programming languages. It does not mean that the dataset cannot change. Instead, it means that a new copy of the dataset is created, with the change included. The user can also perform actions on RDDs such as count, collect, etc. Some applications using Spark are traffic prediction (Berkeley), spam classification (Twitter), kmeans, alternating least squares matrix factorization, and network simulation.

The main takeaway from Hadoop Summit 2010 was Cascalog. I predict the main takeaway from Hadoop Summit 2011 is Spark.

One time at work I had a bizarre issue with corrupted data in HDFS. After that, I began blaming everything on HDFS. The next session Data Integrity and Availability of HDFS was englightening. HDFS takes good care of Yahoo’s data. We can trust Yahoo because if HDFS breaks, Yahoo begins losing money so they know what they are talking about! Yahoo’s goal is to have 60 PB online all the time. The key to HDFS reliability is replication. A replication factor of 3 (3 copies of every file? block?) is appropriate. A replication factor of 2 is also quite robust, but should only be used when there is a backup of the data because the probability of data loss is much higher. Yahoo has had issues with losing blocks (blocks are pieces of data, so lost blocks = data loss). There are a variety of reasons and most of them had nothing to do with HDFS. One cause of lost blocks is a bug in a Hadoop component like Pig, particularly a new version. In one incident, a new version of Pig opened a lot of files without closing them, and created a lot of abandoned files. In the speaker’s anecdotal case study, none of the incidents of data loss were caused by HDFS proper. Other causes of data loss encoutered were exhausting disk space, users hammering HDFS, and “other.” The speaker noted that NameNode high availability (a hot topic) would have only helped in 8 of the 36 incidents studied. Some ways of preventing data loss include resource allocation, selecting good tenants of a cluster, and fixing hardware errors quickly.

If your job isn’t running, it’s not likely caused by HDFS.

Bill Graham of CBS Interactive gave an interesting talk about using Hadoop to build a graph of users a content. Surprisingly, CBSi has quite a large arsenal of MapReduce enabled technologies: Chukwa, Pig, Hive, HBase, Cascading, Sqoop and Oozie. CBSi uses only 100 nodes with 500 TB of disk space for processing data associated with 235 million uniques (individuals, roughly). Mapping users to content should be easy, right? Well, some users have multiple identities, including anonymous identities. The goal is to create a holistic graph that “matches” all of the identities efficiently for uses such as ad targeting. CBSi’s needs in a Hadoop platform: rapid experimentation and data mining, and to power new site features and ad optimization. The main vehicle for representing data is a Pig RDF that allows for a kind of graph based join so to speak. CBSi hopes to add Oozie, Azkaban, HCatalog and Hama (graph processing) to its arsenal.

MapR was a very prominent sponsor of Hadoop Summit. M. C. Srivas presented a technical discussion of MapR’s capabilities and how it differs from Apache Hadoop. MapR is a full distribution of Hadoop and is 100% compatible with the Apache distribution and projects such as Pig and Oozie. MapR is fast and boasts high availability by rethinking the NameNode. The NameNode is a bottleneck since 60% of file operations are metadata. The NameNode and its limitations limit the size of a cluster. To resolve some problems with the NameNode, MapR turns every server into a metadata server. Since metadata is seldom retrieved, it is paged to disk so more RAM can be used for MapReduce proper. MapR distributes NameNode functionality and provides full random read and write semantics as well as export to NFS. With the distributed NameNode, runaway tasks no longer take down the NameNode. MapR has some lofty performance goals. While HDFS can handle 10-50PB, MapR can handle 1010 EB (exabytes). While HDFS can handle 2000 nodes in a cluster, MapR can handle 10,000 or more. It was mentioned at BigDataCamp that MapR does not rely on HDFS at all.

The final session I attended was Avery Ching’s talk on Giraph: Large-scale Graph Processing on Hadoop. Unfortunately, Avery jumped right into the technical details of Giraph without giving a high level overview of the problem Giraph solves. Also, his slides were in 10 point font and I could not read them. Combine this with the fact that my brain was exhausted, so I wanted to head to the bar. Vanilla Hadoop incurs too much overhead for graph data processing. Yahoo used MPI in the past for graph data but it had no fault tolerance and was too generic. Giraph is a library for iterative graph processing. Giraph is fault-tolerant and dynamic. Giraph takes a vertex centric approach to graph data. I found this interesting because most of my work is edge centric. Overall, Giraph is similar in goal to Pregel, but available to non-Googlers and has no single point of failure (except those incurred by Hadoop).

Now I have to catch my breath with some wine, beer and cheese at the nice happy hour reception afterwards. It was a long day, and a great day at Hadoop Summit 2011 and I will of course be back next year. I have no clue what is in store for me next year. Will the NameNode be removed as the single point of failure? Will other open-source software start integrating Hadoop? We shall see…

And now it is time to head back to Los Angeles.

Have a happy and safe Fourth of July!

Big Data Camp 2011 #BigDataCamp

Ryan Rosario — Wed, 29 Jun 2011 06:35:45 +0000

It has been a while since I have been to Silicon Valley, but Hadoop Summit gave me the opportunity to go. To make the most of the long trip, I also decided to check out BigDataCamp held the night before from 5:30 to 10pm. Although the weather was as predicted, I was not prepared for the deluge of pouring rain in the end of June. The weather is one of the things that is preventing me from moving up to Silicon Valley.

The food/drinks/networking event must have been amazing because it was very difficult to get everyone to come to the main room to start the event! We started with a series of lightning talks from some familiar names and some unfamiliar ones.

Chris Wensel, the developer of Cascading, is also the founder of Concurrent, Inc. Cascading is an alternate API for Map-Reduce written in Java. With Cascading, developers can chain multiple map-reduce jobs to form an ad hoc workflow. Cascading adds a built-in planner to manage jobs. Cascading usually infers Hadoop, but Cascading can run on other platforms including EMC Greenplum and the new MapR project. RazorFish and BestBuy use Cascading for behavioral targeting. Flightcaster uses a domain specific language (DSL) written in Clojure on top of Cascading for large data processing jobs. Etsy uses a DSL written in JRuby as a layer on top of Cascading. Of course, the big player is BackType. Cascalog combines Cascading with the Datalog language to provide a declarative language for working with data and map-reduce. Wensel noted that one disadvantage of Pig and Hive that Cascading addresses is that Pig and Hive lack a physical planner. Workflow managers such as Oozie and Azaband can run Cascading jobs as part of a workflow. Version 2.0 of Cascading removes Hadoop as a dependency and will allow users to run Cascading jobs on data that is in RAM rather than on disk.

James Falgout from Pervasive DataRush presented the second lightning talk. Pervasive’s products seem to use this “dataflow” paradigm that attempts to fill in features that are missing in map-reduce. The basic description compared dataflow to the Unix shell pipeline with message passing. James showed an example dataflow that a user could configure visually. Pervasive is working on integrating dataflow with Hive.

Guy Harrison from Quest Software introduced their system Toad for Cloud Databases. Toad attempts to merge data from several different data sources for analysis such as Hive, MongoDB, and Cassandra. Unfortunately, Guy’s thick Australian accent made his humorous talk unintelligible to me (hearing loss?).

Steve Wooledge from AsterData (now part of Teradata) discussed the company’s product goal of taking a standard relational database system and integrating map-reduce on top of it. Such a system is flexible and allows both SQL-like access as well as programmatic access to data. This hybrid row-oriented and column-oriented datastore can be used for path and pattern matching, text processing and graph traversal among the usual tasks. nPath is a product that enhances a system with transactional data analytics (click analysis, sessionalization).

Andrew Yu from EMC presented some of EMC’s data analytics products. I wrote about EMC in an earlier blog post so I will spare the details. EMC offers a data warehouse product as well as a hybrid, pre-configured system containing its Greenplum warehouse and map-reduce built-in.

Ben Lee from Foursquare discussed how big data is used at Foursquare and gave some statistics about its service. This was by far the most interesting talk to me. Foursqaure offers realtime suggestions of places to visit based on the user’s history, and the user’s friends’ histories based on day of week and time of day. Foursquare has 10 million users, 50 million venues, and 750 million check-ins. There are over 3 million check-ins per day. 10,000 developers use Foursquare’s API. MongoDB is the main datastore and Scala is used for the front end. Back end data processing uses Hadoop (both vanilla, and Streaming) as well as Flume, Elastic MapReduce, and S3. Ben displayed an awesome visualization of check-in data; researchers took check-ins from New York City and performed sentiment analysis on the text attached to the check-in. The visualization suggested that people were the “happiest” in Manhattan.

Paul Baclace introduced some software called Phatvis that allows developers to visualize map-reduce jobs. It is his hope that the visualization can be used to fine tune Hadoop parameters based on evidence from prior jobs. The source can be found here

Of course, the fun in every “unconference” is the circus known as scheduling the sessions. Some of the proposed sessions:

Big Data 101 / Intro to Hadoop
Extract MapReduce Data into Relational Database High Performance Database
“ETL was Yesterday” What’s next?
Operations of Hadoop Clusters
SQL / NoSQL Why not Both? (Aster)
Geodata
Big Data Retention / Compression
Business Intelligence and Hadoop
Data Management Lifecycle
Distributions of Hadoop
Hadoop for Bioinformatics and Healthcare

The topics did not seem exciting this time, and seemed to have a lot of overlap with presentations at Hadoop Summit, but I found two (we could only attend two) that stood out.

Session 1: Operating a Hadoop Cluster

Thank goodness managing a Hadoop cluster is not in my job description (only small clusters I use for research). Charles Wimmer, the lead of the Operations track for Hadoop Summit, lead this discussion and much of the discussion dovetailed off of incidents that occurred at Yahoo. A popular topic of discussion was backup. There is no such thing as “backing up” a Hadoop cluster we agreed. Any data that is important should be replicated, preferrably 3 times, or transmitted in parallel over a pipe to multiple data centers. One strict limitation of replication is that if some new release of Hadoop, or some new Hadoop distribution contains a bug that corrupts the data, all replicates may also be corrupted.

Discussion then turned to hardware. Yahoo uses high-density storage nodes with 6 drives each containing 2-3TB of space. Charles mentioned that a common problem with Hadoop is that it is difficult to keep the CPUs busy especially in a server with 8 Nehalem processors (8 CPUs or 8 cores?). The major reason for this is that the main bottleneck in map-reduce jobs is the network I/O required in the shuffle phase as data comes out of the mappers. The map phase is the most CPU bound phase. Wimmer, and several others, made one thing clear: use SATA, not SAS. Apparently SATA and SAS drives have similar read performance (I believe I misheard that) for practical purposes. The original Google map-reduce was based on commodity hardware and quantity is more important than quality (within reason). For this reason, SATA provides a lot more space for your data. The same amount of space is an order of magnitude more expensive for SAS drives.

The next topic of discussion was the NameNode as the single point of failure. Apparently the MapR system does not use the HDFS, and recovering from a lost NameNode is not as severe as it is for Hadoop. Hadoop 0.20.2 also supposedly introduces sharding, called NameNode federation, where the namespace is divided over several NameNodes.

Hadoop has some issues with certain types of scalability, particularly with the JobTracker. When a large job with a large number of mappers and reducers finish quickly, the TaskTrackers send an influx of messages to the JobTracker and it gets overwhelmed. To prevent users from thrashing a cluster, use a capacity scheduler to put hard caps on queues. There was also some high level discussion of QoS-like functionality among users and sophisticated monitoring of jobs. Map-Reduce NextGen improves scalability by allocating a JobTracker to each individual job whose purpose is solely to monitor resource allocation. The biggest feature Charles would like to see is high availability NameNodes.

Yahoo boasts an impressive 22 clusters each containing between 400 and 4200 nodes. A fellow from AOL indicated that AOL has a cluster of size close to 1000. Is AOL coming back from the dead?

Thank goodness managing a Hadoop cluster is not in my job description…

Session 2: Geodata

I do not get the opportunity to work with geographical data often, so I was curious to see what these folks had to say. The discussion was lead by a fellow named Brian from OSGeo. The largest point that I took away from this talk was that not an incredible amount of thought has been dedicated to Big Geodata, particularly how to store and process it. PostgreSQL and PostGIS are a few ways to store and analyze manageable amounts of data, but not large data. MongoDB is one solution but has its issues. A fellow from Foursquare mentioned that MongoDB cannot shard across geographic data, but I could not hear precisely what he said to that effect. The biggest challenge seems to be a lack of an indexer capable of indexing a large amount of geospatial data aside from the standard RTree implementation. I believe that geodata as well as streaming data and multimedia are some of the biggest unsolved problems in Big Data.

Anyways, on to Hadoop Summit!

Google — Is Search-by-Multimedia on the Way?

Ryan Rosario — Tue, 21 Jun 2011 17:00:00 +0000

Recently, I have been thinking about alternate ways of specifying search queries other than with text. A couple of weeks ago I came across a piece of music that I could not identify. I thought it would be a huge win for a search engine to allow me to upload this piece, and it would present me with matches, or near matches to other pieces that sound similar, or have similar characteristics. Some services already exist. Shazam allows a user to place a microphone near playing music and it will identify the artist and song. Some uses of search-by-sound:

Music identification (“solved” – Shazam)
Music personalizaton and recommendation (“solved” – Pandora)
Identification of the source of a sound (i.e. a species of bird, a musical instrument, an inanimate object)
MP3 and media file search
Finding material that violates copyright

As our motivating example, consider we find some really cool graphic on the web and we want to know where it likely originated (i.e. art, a meme). In such a search engine, we could upload the graphic and get results containing the exact image, or images that are very similar such as variations of the image (crop, resize, borders, different effects), modifications of the image (consider Obama-izing [the campaign logo] someone’s picture), and semantically similar images (different photos of the same object or person). Wouldn’t this be cool? A billion-dollar idea, right?

Well, Google apparently beat me (and millions of others I’m sure) to it with its search-by-image feature on Google Images. I uploaded a photo of myself to see what I would get. We see my school website (where the image originated), as well as several other sites that use my Gravatar. Not too bad.

On the results page, users can also provide some type of labeled data to Google. I am not exactly sure what it is used for yet, but note the text in the search bar: “Describe this image.” Upon entering my name, Google found another photo that looks almost identical to the first one — a variation.

Below are the “visually related” images that were presented to me (before I labeled my photo in the search bar):

I see Steve Jobs (I am honored), but 7 out of 16 images are women, and of the men, we look nothing alike. I know, I know, “visually related” refers to similarity in pixels between images, but I expected more. In these images, we see a lot of red and blue hues.

Let’s try something that will generate many more hits: a popular meme…

The image I uploaded was originally posted on Amazon S3, and is linked to by the above two web pages. Google does a much better job when using a URL rather than uploading an image for obvious reasons. More interestingly, the “visually similar” images show variations and modifications of the same image, based on pixel similarity.

And we get also see web pages containing a copy of the image (not linked to the original S3 file):

But this Isn’t Good Enough Yet!

Google “Search-by-Image” is an awesome first step, and I look forward to seeing more as it is undoubtedly coming. For search-by-image (or search-by-multimedia) to be useful, it must also take “semantic” or conceptual knowledge into account, just like with text search. That is, if I upload a photo of myself, I should get back other photos of myself from various (hopefully authorized) sources. Or, if we upload a photo of the Eiffel Tower, we should get back integrated search results containing other images of the Eiffel Tower as well as text results with information about the Eiffel Tower, and perhaps a tourist’s video or documentary.

One may at first believe that the O RLY search used some semantic knowledge; however, all of the images share a large number of pixels and these images are likely just “visually similar” as stated. Using semantic knowledge, one may see results of other famous owls used in memes in addition to the variations and modifications of the O RLY owl.

All of the data collected by such a system would also provide a hell of a corpus for image and multimedia classification. Researchers could construct classifiers for detecting spammy multimedia, knockoff multimedia (second, third generation grain in images, waveform distortion in audio), pornographic content, as well as augmenting labeled and unlabeled multimedia with metadata. For example, suppose we take a picture of what I think is a rhodendron (inside joke for readers). With such a large corpus, I can upload the photo and have Google (or some other AwesomeSearch) retag the image as that of a hydrangea instead.

Uses of search-by-multimedia with semantic knowledge:

Cross referencing objects or people on various different sites.
Product search when textual information (or QR code) is not known
Catching criminals
Cataloging media
Methods for multimedia spam detection
Geolocation without use of GPS or WiFi, and location search
Augmentation of metadata and tagging of objects, people, etc.
Detecting adult, inappropriate or illegal content.
Identification of actions from images, video or audio and retrieval of related information

Of course, search-by-multimedia poses the same challenges that we face in big data today:

choosing and boosting the proper features
collecting a significant and correctly labeled corpus
fast processing of large datasets with new and existing machine learning algorithms
efficient indexing and retrieval algorithms to match queries with probably results
these things are easier said than done, but a lot of fun.

Search-by-multimedia is a very interesting concept and is exciting to think about. In this age of big data and technology, anything will possible. I look forward to the day where anything on the Internet can be found, no matter its content or medium.

To check out Google’s search-by-image, click here and then click on the camera icon.

Want to Build a Research Server?

Ryan Rosario — Tue, 31 May 2011 17:00:00 +0000

I am usually pretty reserved with cash, but after working full-time for six months, I finally decided to spend some of my money on building a new research development server. This process was long overdue and the reason it took me so long to commit to this project was all of the new technology developed since building my last server. This “new technology” can be pretty confusing unless one specializes in computer architecture. I want to share what I have learned throughout this process, while giving some background. These are only my opinions, and I may be wrong on some things as I am not a hardware expert. I encourage you to read and learn more on your own.

The CPU/Processor

If you are reading this article, I probably do not need to explain what the CPU/processor does. For high performance computing, you will want to get a CPU that is very “fast” and also has multiple cores. The definition of the word “fast” is in the eyes of the beholder and typically refers to more than just clock speed (GHz). In the constant war between AMD and Intel, I stick with Intel. AMD processors are powerful, but they seem to have more of a market with gamers. Intel is my preference, but I have not yet run into anyone that feels strongly towards AMD for high-performance computing (HPC). There are two main processor lines under Intel: standard, and Xeon. Standard processors are your run of the mill CPUs that are found in consumer desktop machines. Xeon processors are designed for non-consumer server, workstation and embedded systems use. I do not consider researchers as “consumers,” we are producers, so the Xeon family is better suited to our needs. On the other hand, you may find that a standard CPU will fit your needs for your particular research or use case. Xeon processors typically have more cache and more multiprocessing capabilities…and they are a lot more expensive. For high-performance computing, I strongly suggest Intel Xeon.

After months of research, I have concluded that multiple Intel Xeon processors are better than one Intel Core i7. As of the time of this writing, it seems that i7 processors cannot be doubled (or tripled etc.) up like Xeons can. Like the AMD, the i7 seems to be favored by gamers and those needing a richer multimedia experience.

In 2011, most CPUs in new systems have multiple cores. Each core can essentially run one process each. A system with n cores can run n processes simultaneously. Many CPUs are hyperthreading enabled, meaning that each core can actually run 2 threads simultaneously, bringing the total number of threads to 2n. But can’t the system already run multiple processes concurrently? We can run Firefox, TweetDeck, Thunderbird etc. concurrently, right? In practice, it seems that the CPU is processing multiple threads simultaneously. If we could slow down time to the micro level, one would see that the CPU works on one process at a time, then does a context switch to another process. Theoretically, this gives the illusion that the CPU is running multiple processes simultaneously.

While Intel makes great products, its inventory is a nightmare to navigate. There are several things that you must know to ballpark a particular CPU model.

the model number (the most reliable!)
the brand name specifies a group of CPU models satisfying similar use cases (Core [i3/i5/i7/i9], Core 2 Duo, Quad Core, Pentium, Xeon).
the architecture/subarchitecture — specifies a type of processor within a brand, each containing many series (Nehalem, Westmere, Sandy Bridge are common ones these days)
the chipset (not commonly referred to, examples: Tylersburg, Cougar Point, Panther Point)
the platform which refers to a set of models (e.g. Harpertown, Jasper Forest, Gainestown, Prescott, Gulftown). Models within a series are typically only differentiated by clock speed (GHz).
the socket type specifies the shape and size of the CPU. The CPU and the motherboard must have the same socket type (i.e. LGA1366, Socket 775)

As if this is not confusing enough, each Intel Xeon model number is prefixed with a letter for different use cases. The letter distinguishes CPUs with differing thermal dissipation power (TDP). (source)

W stands for “Workstation” and is meant to be installed in pairs. This designation does not seem very common anymore. They typically run the fastest (clock speed) and the hottest. They require significant cooling.
E is “mainstream (rack mount)” and the standard model of CPU. Although it is “standard,” there is nothing wrong with it performancewise, but will run hot even when idle.
X stands for “performance” and are similar to E but provide for extra overclocking capabilities and have lower idle power draw.
L stands for “power optimized” and are low voltage CPUs (60W or less) that are typically only used for data centers or rack servers. They typically do not come in the higher clock speeds etc.

For the Intel Xeon, model numbers indicate what configuration it is compatible with on the motherboard (source):

3xxx Xeons are designed to be used by themselves, as the only CPU on the motherboard.
5xxx Xeons are designed to be used in pairs; two CPUs on the motherboard.
7xxx Xeons are designed to be used in pairs, or in larger groups.

The 2 CPUs that I purchased are model Intel Xeon E5645. The Intel Xeon E5645 is part of the Gulftown platform of the Xeon family. It uses the Westmere subarchiture which is the 32 nm shrink of the Nehalem architecture spec and connects to the system bus using socket LGA1366. (This is the same architecture used for the i7-9xx series to make it more confusing) The E means that it is a “mainstream” CPU. Since it is a 5000 model, it is installed with another identical CPU on the same board.

The number of cores is important. Most chips in current desktops contain 2 or 4 cores. Higher end systems and servers may have 6, 8 or 10 cores per chip. Xeons with 8 and 10 cores per unit debuted in Q2 of 2011 and are very expensive (about $2000 for 8 cores). They also require a brand new socket type (LGA1367), which means a new, expensive motherboard. A CPU with more cores allows an application to perform several units of work per task; these processors allow higher bandwidth.

The clock speed (GHz) used to be the deciding factor for most people, until Moore’s Law broke down. Higher clock speed possibly allows a single process to complete faster. Since games typically use a limited number of threads and require quick performance, a single i7 is a good choice. The i7 has multiple cores, and also has a very high clock speed.

The cache size and speed is also important. The cache allows very high speed access to memory locations that are frequently accessed by copying the data from RAM into the CPU cache. Modern systems typically have three levels of cache: L1, L2 and L3. L1 cache is said to be the “closest” to the CPU, meaning the CPU queries the L1 cache first when performing a memory access. The L1 cache is the smallest. The L2 and L3 caches are accessed next in order, and L3 cache is larger than L2 cache. Very simply put, CPUs with larger caches (especially L1) are better.

Newer processors report CPU throughput as gigatransfers per second (GT/sec) which, like GHz, quantifies some measure of “speed.” Using GT/s, one can compute the number of bits the CPU can transfer per second as

Think of the cores vs. clock speed decision as a highway. Suppose the clock speed indicates the maximum speed limit on a single lane highway. A faster CPU corresponds to a single lane highway with a high speed limit. You will get to your destination faster. On the other hand, consider a one-lane vs. a two-lane highway, both with identical speed limits. If one lane is too busy for you, take the other lane. An increase in the number of cores increases the number of choices of lanes you can transition to. On the single-lane highway, you would need to slow down and wait for the cars in front you to move forward. By switching lanes, you may get to your destination faster, or you may not, but more driving is completed overall.

Review of 2011 Data Scientist Summit

Ryan Rosario — Fri, 13 May 2011 23:34:53 +0000

Some time over the past 6 weeks I randomly saw a tweet announcing the “Data Scientist Summit” and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.

The Pre-Party

The Venetian and all close hotels were booked, so I ended up at the Aria; a new experience. The hotel is beautiful and very ritzy. I had heard that the rooms were very technologically advanced but I wasn’t prepared for the recorded welcome message, music and automatic shades opening upon entry to the room. The Aria is a geek’s paradise. Everything is computerized. Key cards are “waved” rather than swiped, lights are turned on/off and dimmed by use case (“sleep”, “read” etc.), rather than manually. There are no paper “Do Not Disturb” signs; rather, a switch on the wall (or via TV) toggles an indicator light outside the door. And the best part… Internet is FREE!


The rhododendrons hydrangeas are real!	Work desk panel contains Ethernet, power, USB, VGA, audio.	Cables, provided you want to pay the minibar charge.

Data Scientist Summit, Day 1

I arrived to the conference room and quickly took my seat. Seated in the close vicinity were several familiar faces. I also finally got a chance to meet Drew Conway (@drewconway) and David Smith (@revodavid), both happened to sit in the row in front of me. The keynote by Thorton May provided a lot of humor that kicked off a very energetic event. In the second session, we heard from data scientists and team from Bloom Studios, 23andMe, Kaggle and Google. I was happy to see somebody from Google present, as they never seem to attend these type events (neither does Facebook).

There has been a lot of buzz about 23andMe and Kaggle in the past few months. It is hard to keep up with all of the buzz, so it was great to hear from the companies themselves. 23andMe provides users with a kit containing a test tube into which the user spits. The kit is then sent back to 23AndMe labs which analyzes something like 500,000 to a million different markers (I am not a biologist) and can provide information about what markers are present such as: predisposition to diabetes or cancer etc. In 2011, it costs about $5,000 to do this analysis whereas 10 to 20 years ago the figure was in the millions. 23andMe goes a step further. They understand that genetics have a strong association with particular conditions, but that they are not necessarily causal. For example, someone with a predisposition to diabetes will not necessarily contract the disease. 23andMe wants to integrate other data into their models to help predict how likely a patient is to contract a certain condition, given their genetics.

Kaggle is a community-based platform for individuals and organizations to submit datasets and open them up to the Data Science community for analysis…as a competition. I love the geekiness of this endeavor, and it continues where the Netflix Prize left off. Kaggle has some awesome prizes for winning the competition such as $3M for the Heritage Health Prize. There are other freebies as well, such as Revolution R Enterprise free for competitors.

As a disclaimer, I am not a huge visualization guy. I see its importance and usefulness in educating end-users about statistical results, and there are quite a few infographics that are exciting to me. However, there are many times when a boring ol’ boxplot works better than a Processing applet. So, it takes quite a bit to get me excited about cutting-edge graphics. The Immersive Data Visualization session by Dr. JoAnn Kuchera-Morin from UC Santa Barbara did exactly that. They have created a large metal sphere, called AlloSphere, containing a bridge in the center where researchers/analysts stand. Their data is projected, for the eye, throughout the ball in a 3D, or 3D-like world. Of course, data can be represented several ways to the eye: color, size, shape, texture, etc. AlloSphere also represents data using the other senses, particularly sound. In her presentation, JoAnn took us on a 3D tour of her colleague’s brain (fMRI). Of course, we could “see” the inside of the brain, but we could also hear the blood pressure change in different parts of the brain, indicating differing activities. There were some other demonstrations of studies from physics, but I cannot comment on those because I lost interest (physics has always been my worst subject). I attended UC Santa Barbara for one year after high school, so I am particularly proud of what they have done.

Of all the presentations on the first day, Data Scientist DNA was my favorite. In this panel, Anthony Goldbloom of Kaggle, Joe Hellerstein from UC Berkeley, David Steier from Deloitte and Roger Magoulas from O’Reilly Media discussed what makes a good Data Scientist or “data ninja” as stated in the program. All were in agreement that candidates should have an understanding of Probability and Statistics, although someone on the panel suggested that a “basic” background was all that was needed; I disagree with that. A Data Scientist should also be a proficient programmer in some language, either compiled or interpreted and understand at least one statistical package. More importantly, the panel stressed that above and beyond knowledge, it is imperative that a Data Scientist be willing to learn new tools, technologies and languages on the job. Dr. Hellerstein suggested some general guidelines in classes students should take: Statistics (I argue for a full year of upper division statistics, and graduate study), Operating Systems, Database Systems and Distributed Computing. My favorite quote from the panel came from David Steirer, “you don’t just hire a Data Scientist by themselves, you hire them onto a team.” I could not agree more. Finally, the moderator of the panel suggested that Roger Magoulas may have been the one to coin the term “big data” in 2005, but a Twitter follower found evidence that the term has been used since as early as 2000 (Thanks Amund! @atveit).

The last session of the day was given by Jennifer Pahlka from Code for America titled Imagining and Enabling a Better World. Pahlka started her talk by stating that the milennial generation is the most “pro-government” generation of the modern day. Regardless of politics, millennials see potential in the goverment and that it can be used for good. Jennifer compared Code for America to Teach for America for Data Scientists. The goal of Code for America is to put together very bright minds to tackle local, state and federal government issues using data. Pahlka brilliantly stated, “we don’t need guns, we need geeks. We are trying to create a geek army.”

During the end of day cocktail reception, I scored two posters of data visualizations: “super powers” and “game controllers over the years.” The other two posters offered were “beers” and “rappers.” I also had a chance to quickly meet Tim O’Reilly, Founder and CEO of O’Reilly Media, whose books are my favorite for learning programming languages and technologies (the animal books).

Data Scientist Summit, Day 2

Personally, I enjoyed the second day more than the first day but that may have been due to the fact that I got sleep the night before.

It seemed that the highlight of the morning was the talk by Jonathan Harris titled The Art and Science of Storytelling. He introduced his project “We Feel Fine” which is a conglomeration of emotions. His project aims to capture the status of the human condition. This was more of the touchy-feely kind of presentation which is different from most of the Data Science talks. He showed beautiful user interfaces and great examples of fluid user experience. Some statistics that caught my eye regard human emotion over time. It seemed that people experienced loneliness earlier in the week than later in the week. Joy and sadness were approximately inversely related throughout the week and hours of the day, but I cannot remember the direction of the trends. The most interesting graphics involved the difference between “feeling fat” and “being fat.” States like California and New York were hot spots for “feeling fat”, but they are actually some of the skinniest states. Instead, the region between the Gulf of Mexico and the Great Lakes was actually the fattest, but did not feel that way. A graphic for “I feel sick” showed a hotspot in Nevada which I thought was very interesting (nuclear fallout? alochol poisoning in Vegas?). The interesting part of this discussion was that it showed the vast geography of the field called Data Science. Some Data Scientists are more of the visualization and human connection variety, and others (where I consider myself) are more of the classic geeks that like to write code and dig into the data to get a noteworthy result. Well, I guess there isn’t much difference between both camps after all. As Jonathan would probably say, Data Science is about storytelling.

The next few sessions got a bit blurry (as is Data Science); they talked about various interconnected topics. Pete Skomoroch from LinkedIn, Sharon Franks Chiarella from Amazon Mechnical Turk, Gil Elbaz from Factual and Toby Segaram from Google discussed the fact that you can’t turn data into a story without joining the data with, well, other data. Another major topic discussed was how to get labeled data, and this is where Mechnical Turk stands out as a data resource. The next talk was humorously titled Hadoop – The Data Scienist’s Dream. I know some people that would gouge their eyes out when seeing that title. Really, Map-Reduce is the Data Scientist’s dream, but yeah, yeah, I know, Hadoop is the first widely accepted implementaton. Paul Brown from Booz Allen Hamilton and Martin Hall from Karmasphere discussed how Hadoop is typically being used in production and the briefly mentioned how Hadoop’s cousins make the Hadoop ecosystem more powerful.

The last session in this trifecta was titled The Data Scientist’s Toolset – The Recipes that Win. Representatives from various companies were panelists: SAS, Informatica, Cloudscale, Revolution Analytics, and Zementis. I felt that this discussion was lacking. The strength of the Data Science community stems from open-source technology I believe, and except for Revolution Analytics, none of the companies have a strong reputation in the open-source community yet. Discussion seemed to focus too much on enterprise analytics (SQL, SAS, Greenplum etc.) and Hadoop, and not enough on analysis and visualization. All in all, this panel was a bit too “enterprisey” for me. Some Twitterers felt that they were pushing their products too much. This was surprising because I felt the exact opposite, unless they were picking up on the “enterprisey” vibe. The panelists were asked what one tool for data science they would choose of they were on a desert island. The panelists responded with the following tools, “Perl, C++, Java, R [sic, thanks David], SQL and Python.” I was disappointed that SQL was mentioned without a countermention for NoSQL because not all data fits in a nice rectangle called a table. By itself, SQL is very limited. Python and R I definitely agree with. Perl is dated, but still has a use in the Data Scientist’s toolbox if the user is not familiar with Python, and doesn’t want to be. I was baffled by the C++ response and the lack of overlap in the other responses. But these are my opinions only.

The Summit Spotlight, Secrets of Attribution – The Stories Beyond the Last-Click discussed how researchers are trying to use data to “give credit” to not only the site that referred the user to a resource via a click, but all of the sites in the path that lead to that click, the so-called “conversion path” in SEO land. The final session, Building Data Science Firepower – Taking the Leap was very similar to the Data Scientist DNA talk but added in some food for thought. There are two philosophies for hiring and working with Data Scientists. The first is to hire a strong data science team, and the second is to enhance each team with Data Scientists.

2011 EMC Data Hero Awards

At the end of the summit, the recipients of the EMC Data Hero Awards were announced. I missed some of the honorable mentions, but here goes:

Consumer Services, LinkedIn.
Energy, Silver Springs Networks.
Heath Care, Jeffrey Brenner, The Camden Coalition.
Life Sciences, The Broad Institute of MIT and Harvard.
Media, CMU Create Lab.
Public Services, Global Virus Forecasting Initiative.
Technology Application, IBM Watson Computing System.
Technology IT Infrastructure, Apache Foundation, Hadoop.
Visionary, Vivek Kundra, CIO, Data.gov.

Vivek Kundra was not present at the summit, but recorded a message to the attendees, which was really cool. He stated that in 2009, there were only 47 government datasets publicly available; in 2011, there are close to 400,000 datasets available to the public.

There were some interesting honorable mentions. Zynga received an honorable mention for Consumer Services. As a player of Farmville and Cityville, I can see the plethora of data that Zynga must work with. Additionally, Zynga has some very creative ways for advertising for brands such as McDonald’s and Tostitos (with Farmville items for both companies), 7-11’s new slurpee (seeds for the Goji Berry), and GagaVille. Zynga also participates in community service: “Sweet Seeds for Haiti” (pay to plant special seeds, with proceeds to Haiti) just to name one.

Tableau Software also received an honorable mention for the Media category. Tableau develops data visualization software, and is picking up huge steam in the data viz community.

The conference ended with an awesome video created by EMC called “I Am a Data Scientist” featuring several EMC Data Scientists, most of which I happened to have lunch with!

Overall Impression

All in all, the Data Scientist Summit was an eye-opening and empowering event, and it was only planned in six weeks. There was a great sense of community and collaboration among those in attendance. I work as a Data Scientist professionally because I love it. The one fact that I tend to overlook is that Data Scientists are in high demand and short supply. I was reminded of how important our work as Data Scientists is.

This was the first annual Data Scientist Summit, and I will no doubt be back. With that said, discussion of technical topics had a bit of an introductory flavor to them, which made the discussion of the technology seem dated. For example, “Vanilla” Hadoop was introduced as a tool for processing vast amounts of data. I would expect that most Data Scientists have worked with Hadoop, or at least know what it is. Hadoop is somewhat old news in terms of “cutting-edge technology.” Tools like Pig, Cascalog, HBase, Hive, Cascading, etc. would have been better discussion topics. I was also disappointed with how little coverage that data mining tools there was (except for Hadoop, NoSQL, and enterpise databases). It seemed as if R had gone M.I.A. and I was surprised that there was such little discussion of visualization tools like Tableau, Processing, Gephi, D3, Polymaps, etc.

The Data Scientist Summit set a very solid foundation for the future. I felt like the modus operandi was “here is why Data Science is cool” and “here is why others should be interested.” Although this is not a groundbreaking discussion, it sets the stage for future conferences and solidification of the community. The people that probably got the most value out of the technical discissions were people looking to switch careers, or enter Data Science.

Without a doubt I will be at next year’s Data Scientist Summit!

My thoughts and opinion on this blog do not reflect those of my employer, the Rubicon Project.

EC2 Trials and Tribulations, Part 1 (Web Crawling)

Ryan Rosario — Wed, 11 May 2011 17:00:00 +0000

Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user “boots” up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.

Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and “gotchas” that are important to keep in mind when using EC2, and in this post, with using parallelism in your code to accomplish large tasks.

Monitor your Instances

Monitoring your instances has two important benefits. First, to make sure that you are not maxing out resources on the machine. EC2 is “elastic.” With some clever programming, you can boot up more machines if you notice resources becoming scarce on your current machines, and then decommission them later when they are not needed. I did not do this at first, and I ran into several issues.

Disk Space. The concept of a “disk” is very confusing in EC2. The AMI forms a disk, sort of. Above and beyond the OS and any other software and packages you may install as part of the AMI, you can use whatever free space is remaining to store output files. The total disk space used by the AMI seems to be configured at the moment the AMI is constructed. Thus, it is not a good idea to store files in the instance. I did this. Fortunately, I found out before it was too late that my “disk” was filling up. I wrote a cron job to copy all of my output files to /mnt every five minutes. Use /mnt to store your files as it has lots and lots of space; HOWEVER if you terminate your instance, the files are gone. This is still true if you use space within the instance. Once your job completes, upload your files to S3. s3cmd allows access to S3 from the command line, and with the modification here, you can upload and download files in parallel (a life saver for big batches). Another option is to create an EBS volume, mount it, and write files directly to the EBS volume. EBS space is much more expensive than S3 space.

Memory. On my first attempt, I maxed out memory to the point that the OS killed 6 of my 8 processes. This caused a huge blow in the performance of my crawler and rendered the extra money I spent on an extra large instance wasted. Monitor your job’s memory using top. If memory usage seems to grow too fast to your liking, consider using a memory profiler to make sure that there are no memory leaks in your code. I have found that long running Python processes eat up a lot of RAM, even if there are no explicit growths of data structures.

Additionally, maxing out RAM means that the disk will begin to swap. This is devastating to performance because this extra grinding of the disk decreases the total I/O throughput your job can handle. This is crucial for crawlers as files need to be written to disk quickly.

If after profiling you find that your job is still using too much RAM, consider caching, or using a high memory EC2 instance.

I/O Throughput. How fast your job consumes and produces data is a good way to determine if something is going amiss in your job, or with the other resources you are using. When I started my crawling job, I was crawling n pages per hour, but after twelve hours, the throughout decreased exponentially until it got so slow that I had to add more instances. One way to monitor throughput is to save the results of ls -latr –full-time to disk and extract the date/time of each file. Using a tool like R, you can quickly plot your I/O throughput over time using an aggregate(). A decrease in I/O throughput can be the result of many things: 1) swapping from exhausting RAM, 2) low disk space, 3) network congestion within AWS, 4) poor resource performance (if crawling, the resource would be the website being crawled), 5) hammering an external resource and/or HTTP throttling, 6) congestion in the Internet. For crawling, you may want to consider using several smaller instances rather than fewer larger instances. This way, you will be accessing the resource from many IPs and the result of being throttled should be lessened. Additionally, use instances that have “High” I/O performance; some are rated “Moderate” or “Low.”

CPU. A general rule of thumb is that you can run n processes in parallel, for n cores. Additionally, if each core supports hyperthreading then the number of processes you can run is approximately 2n. If you run more than the suggested number, the price of context switching can slow down your performance. If you find the need to routinely exceed this guideline, use an instance with more cores.

When running parallel code, routinely do a ps aux | grep processName to make sure the correct number of processes is running. If any were killed, this will be noted in /var/syslog with a reason.

Financial metrics. Are you getting your money’s worth? Are you really using all of the cores you are paying for? Are you really using all of the memory you are paying for? This is up to you and your budget to dictate. But do not get carried away and assume that you must stay with the same instance size. Most AMIs can run on different instance types (except 64bit AMIs are restricted to m1.large and bigger).

Quarantine Essential Services

My crawler used Redis as a work queue. Each process could easily write new thread IDs and page numbers to the queue as they are discovered, and read thread IDs and page numbers from the queue as each process is ready to crawl a page. One problem that I faced was that I coupled the crawling operation with queue management into the same script, and ran the Redis server on a server where a crawler was running. This coupling posed two challenges. First, it can sustain nasty bugs. Whenever a process was created on the master Redis node, my code would wipe the Redis queue clean to prepare it for crawling (bug!). Flushing the queue, and the initial population of the queue should have taken place in two separate scripts. Due to my major bug, I wiped the entire queue clean in the middle of the crawl. Fortunately, I followed the advice in the next section.

Second, I had to be careful that my processes did not exceed RAM limitations. Because Redis is mainly an in-memory key-value store, it itself can hog up most of the RAM in the instance. For this reason, it is best to quarantine essential services such as queues to their own instances.

Document Everything

Log everything. Log every resource you are going to use (URLs for a crawl) and log everything that was done and any problems that arise. Using the directory structure (ls) as well as a log of what work was already performed, I was able to reconstruct and repopulate the work queue and essentially start where I left off. For my crawling operation, I wrote the following events to logs, each with a timestamp.

Starting the crawl.
Logging in to the site being crawled.
Clearing and populating the queue.
Visiting a thread’s first page.
Discovering the number of pages of posts in the thread/inserting to the queue.
HTTP redirects, when a thread has been moved.
Visiting a thread ID that does not exist.
Inadvertent logouts, marking work to be redone.
Queuing inconsistencies.

An activity log verbosely documented everything that occurred without logging actual data. An inventory manifest indicated which URLs/forum posts had valid content and how many pages of content were associated with them. A standard directory listing indicated what work had been done. By cross referencing the manifest and the directory listing, it is easy to see which posts had not yet been processed. A system log prepared by the operating system also documents critical failures for you, such as lack of disk space or processes being killed.

When writing your logs, use the advice in the next section!

Take Care to not Clobber Files and Objects

It’s been said over and over again. Each process should hold as much of its own real estate as possible. When two or more processes write to the same object, corruption can occur unless there is a locking mechanism in place. If two processes write to the same file at the same time, you will notice garbled entries in your logs. This did not affect my crawled data because each file was written by a single process. The same can be true for reading data as well. When spawning multiple processes, I shared the same Redis connection with all of the processes. If two processes read from the queue at the same time, one process would get the correct data (a thread ID and page number) and the other would get “OK”, which was the result of the first process’ fetch operation. This is mostly my fault, but partially redis-py‘s fault for filling some buffer between Python and Redis with meaningless information (“OK”).

Each process should write is own log files. When opening a file, you can use the following:

import os
 OUT = open("mylogfile-%s.log" % str(os.getpid()), "w")
 ...
 OUT.close()

Crawler Specific: Set an Upper Bound

Crawling is fun, but you must practice moderation or it is easy to attempt to boil the ocean. When I first started, I would run a crawl, have it crash, and then deem the data out of date and start over from the beginning and crawl until there was nothing possible left to crawl. It is good to set an upper bound: “I will crawl 10 days worth of data”, or “I will only use threads created prior to May 1, 2011.”

One of the keys to success with EC2 is to get over the penny pinching. If you have a project, just take the plunge and do it on EC2 (if required). The amount you spend on the first few projects will save you more on future projects.