My Day at ACM Data Mining Camp III

My first time at ACM Data Mining Camp was so awesome, that I was thrilled the make the trip up to San Jose for the November 2010 version. In July, I gave a talk at the Emerging Technologies for Online Learning Symposium conference with a faculty member in the Department of Statistics, at the Fairmont. The place was amazing, and I told myself I would save up to stay there. This trip gave me an opportunity to check it out, and pretend that I am posh for a weekend . The night I arrived I had the best dinner and drinks at this place called Gordon Biersch. I had the best garlic fries and BBQ burger I have ever had. I ate it with a Dragonfruit Strawberry Mojito, the Barbados Rum Runner, and finished off with a Long Island Iced Tea, so the drinks were awesome as well. Anyway, to the point of this post…

The next morning I made the short trek to the PayPal headquarters for a very long 9am-8pm day. Since I came up here for the camp, I wanted to make the most of it and paid the $30 for the morning session, even though I [...]

Exciting Tools for Big Data: S4, Sawzall and mrjob!

This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.

Yahoo’s S4: Distributed Stream Computing Platform

First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately.

S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.

Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:

emit another event that will be consumed by another PE, or
publish some result

Streaming data is different from non-streaming data in that the user does not know how much data will be transmitted, and at what rate. Analysis on [...]