My Day at ACM Data Mining Camp III

My first time at ACM Data Mining Camp was so awesome, that I was thrilled the make the trip up to San Jose for the November 2010 version. In July, I gave a talk at the Emerging Technologies for Online Learning Symposium conference with a faculty member in the Department of Statistics, at the Fairmont. The place was amazing, and I told myself I would save up to stay there. This trip gave me an opportunity to check it out, and pretend that I am posh for a weekend ;). The night I arrived I had the best dinner and drinks at this place called Gordon Biersch. I had the best garlic fries and BBQ burger I have ever had. I ate it with a Dragonfruit Strawberry Mojito, the Barbados Rum Runner, and finished off with a Long Island Iced Tea, so the drinks were awesome as well. Anyway, to the point of this post…

The next morning I made the short trek to the PayPal headquarters for a very long 9am-8pm day. Since I came up here for the camp, I wanted to make the most of it and paid the $30 for the morning session, even though I had […]

Exciting Tools for Big Data: S4, Sawzall and mrjob!

This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.

Yahoo’s S4: Distributed Stream Computing Platform

First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately.

S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.

Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:

emit another event that will be consumed by another PE, or
publish some result

Streaming data is different from non-streaming data in that the user does not know how much data will be transmitted, and at what rate. Analysis on […]

Taking R to the Limit, Part II - Large Datasets in R

For Part I, Parallelism in R, click here.

Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.

Slides

My edited slides are posted on SlideShare, and available for download here.

Taking R to the Limit (High Performance Computing in R), Part 2 — Large Datasets, LA R Users' Group 8/17/10

View more presentations from Ryan Rosario.

Topics included:

bigmemory, biganalytics and bigtabulate
ff
HadoopStreaming
brief mention of Rhipe

Code

The corresponding demonstration code is here.

Data

Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).

Large datasets:

On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset […]

My Experience at Hadoop Summit 2010

This week I had the opportunity the trek up north to Silicon Valley to attend Yahoo’s Hadoop Summit 2010. I love Silicon Valley. The few times I’ve been there the weather was perfect (often warmer than LA), little to no traffic, no road rage and people overall seem friendly and happy. Not to mention there are so many trees it looks like a forest!

The venue was the Hyatt Regency Great America which seemed like a very posh business hotel. Walking into the lobby and seeing a huge crowd of enthusiasts and an usher every two steps was overwhelming!

After being welcomed by the sound of the vuvuzela, we heard about some statistics about the growth of Hadoop over the years. Apparently in 2008 there were 600 attendees to Hadoop Summit and in 2010 attendance grew to 1000. The conference started with three hours of keynotes and sessions that everyone attended in one room. The tone was somewhat corporate in nature, but there were also several gems in there about how Yahoo uses Hadoop:

A user’s click streams are mapped to latent interests, similar to mapping words to interests in topic models like LSA and LDA. Yahoo uses Hadoop to recompute a user’s […]

Lessons Learned from EC2

A week or so ago I had my first experience using someone else’s cluster on Amazon EC2. EC2 is the Amazon Elastic Compute Cloud. Users set up a virtual computing platform that runs on Amazon’s servers “in the cloud.” Amazon EC2 is not just another cluster. EC2 allows the user to create a disk image containing an operating system and all of the software they need to perform their computations. In my case, the disk image would contain Hadoop, R, Python and all of the R and Python packages I need for my work. This prevents the user (and the provider) from having to worry about providing or upgrading software and having compatibility issues.

No subscription is required. Users pay for the amount of resources used for the computing session. Hourly prices are very cheap, but accrue quickly. Additionally, Amazon charges for pretty much everything single thing you can do with an OS: transferring data to/from the cloud per GB, data storage per GB, CPU time per hour per core etc.

This is somewhat of a tangent, but EC2 was a brilliant business move in my opinion.

Anyway, life gets a bit more difficult when the EC2 instance you’re working with […]

My Experience at ACM Data Mining Camp #DMcamp

My parents and I made plans to visit San Jose and Saratoga on my grandmother’s birthday, March 19, since that is where she grew up. I randomly saw someone tweet about the ACM Data Mining Camp unconference that happened to be the next day, March 20, only a couple of miles from our hotel in Santa Clara. This was an opportunity I could not pass up.

Upon arriving at eBay/PayPal’s “Town Hall” building, I was greeted by some very hyper people! Surrounding me were a lot of people my age and my interest. I finally felt like I was in my element. The organizers of the event also had a predetermined Twitter hashtag for the event #DMCAMP, and also set up a blog where people could add material and write comments about the sessions. I felt like a kid in a candy shop when I saw the proposed sessions for the breakout sessions.

Some of the proposed topics I found really interesting:

Anonamly Detection
Natural Language Processing
Collaborative Filtering and a Netflix Paper
CPC Optimization for Events
Data Mining Programming Tools
Structured Tags
Status of Mahout
Machine Learning with Parallel Processors
Sentiment Analysis
Parallel R

About half of these actually made it onto the schedule. Unfortunately, I was only able to attend 4 […]