## SIAM Data Mining 2012 Conference

Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.

The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to […]

## “Hold Only That Pair of 2s?” Studying a Video Poker Hand with R

Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is “do you count cards?” A blank look comes over their face when I say “no.”

## Exciting Tools for Big Data: S4, Sawzall and mrjob!

This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.

Yahoo’s S4: Distributed Stream Computing Platform

First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately.

S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.

Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:

emit another event that will be consumed by another PE, or
publish some result

Streaming data is different from non-streaming data in that the user does not know how much data will be transmitted, and at what rate. Analysis on […]

## Accessing R from Python using RPy2

This past Tuesday I had the opportunity to present a short talk (a bit long) related to text mining at the Los Angeles R Users’ Group. Since I do most of my text mining in Python, I took this opportunity to discuss RPy2, an interface to R from Python. My slides are below:

Accessing R from Python using RPy2
View more presentations from Ryan Rosario.

Using Python with R with an example using web mining.
Web mining using pure R rather than Python.

Code for demonstration is here:

offtopic_demo.py is a pure Python script that extracts data from a web forum and dumps it to disk. To actually use it, you will need to register for an account.
RPy2_demo.py reads the data from the forum from disk and calls R from Python to perform some basic analysis.
curljson_demo.R grabs some JSON data from the Twitter Search API using RCurl and converts it to R lists using rjson.

Video:

Running the code requires some packages that you need to install.

twill package for web browsing, that installs a Python package for you. Requires the mechanize package […]