My Day at ACM Data Mining Camp III

My first time at ACM Data Mining Camp was so awesome, that I was thrilled the make the trip up to San Jose for the November 2010 version. In July, I gave a talk at the Emerging Technologies for Online Learning Symposium conference with a faculty member in the Department of Statistics, at the Fairmont. The place was amazing, and I told myself I would save up to stay there. This trip gave me an opportunity to check it out, and pretend that I am posh for a weekend ;). The night I arrived I had the best dinner and drinks at this place called Gordon Biersch. I had the best garlic fries and BBQ burger I have ever had. I ate it with a Dragonfruit Strawberry Mojito, the Barbados Rum Runner, and finished off with a Long Island Iced Tea, so the drinks were awesome as well. Anyway, to the point of this post…

The next morning I made the short trek to the PayPal headquarters for a very long 9am-8pm day. Since I came up here for the camp, I wanted to make the most of it and paid the $30 for the morning session, even though I had […]

UCLA Statistics: Analyzing Thesis/Dissertation Lengths

As I am working on my dissertation and piecing together a mess of notes, code and output, I am wondering to myself “how long is this thing supposed to be?” I am definitely not into this to win the prize for longest dissertation. I just want to say my piece, make my point and move on. I’ve heard that the shortest dissertation in my program was 40 pages (not true). I heard someone from another school that their dissertation was over 300 pages. I am not holding myself to a strict limit, but I wanted a rough guideline. As a disclaimer, this blog post is more “fun” than “business.” This was just an analysis that I was interested in and felt that it was worth sharing since it combined Python, web scraping, R and ggplot2. It is not meant to be a thorough analysis of dissertation lengths or academic quality of the Department.

The UCLA Department of Statistics publishes most of its M.S. theses and Ph.D. dissertations on a website. It is not complete, especially for the earlier years, but it is a good enough population for my use.

Using this web page, I was able to extract information about each […]

Taking R to the Limit, Part II - Large Datasets in R

For Part I, Parallelism in R, click here.

Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.

Slides

My edited slides are posted on SlideShare, and available for download here.

Taking R to the Limit (High Performance Computing in R), Part 2 — Large Datasets, LA R Users' Group 8/17/10

View more presentations from Ryan Rosario.

Topics included:

bigmemory, biganalytics and bigtabulate
ff
HadoopStreaming
brief mention of Rhipe

Code

The corresponding demonstration code is here.

Data

Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).

Large datasets:

On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset […]

Hitting the Big Data Ceiling in R

As a true R fan, I like to believe that R can do anything, no matter how big, how small or how complicated: there is some way to do it in R. I decided to approach my large, sparse matrix problem with this attitude. But here I sit a broken man.

There is no “native” big data support built into R, even if using the 64bit build of R. Before venturing on this endeavor, I consulted with my advisor who reassured me that R uses the state of the art for sparse matrices. That was enough for me.

My Problem

For part of my Masters thesis, I wrote code to extract all of the friends and followers out to network degree 2 to construct a “small-world” snapshot of a user via their relationships. In a graph, nodes and edges grow exponentially as the degree increases. The number of nodes was on the order of 300,000. The number of edges I predict will be around 900,000. The code is still running. This means that a dense matrix would have size . Some of you already know how this story is going to end…

The matrix is very sparse.

Very sparse.

The raw data graph.log consists of an […]

Opening Statements on Markov Chain Monte Carlo

This quarter I am TAing UCLA’s Statistics 102C. Introduction to Monte Carlo Methods for Professor Qing Zhou. This course did not exist when I was an undergraduate, and I think it is pretty rare to teach Monte Carlo (minus the bootstrap if you count that) or MCMC to undergrads. I am excited about this class because to me, MCMC turns Statistics on its head. It felt like a totally different paradigm compared to the regression and data analysis paradigm that I was used to at the time. It also exposes students to the connection between Statistics/MCMC and other fields such as Computer Science, Genetics/Biology, etc.

I usually do not have much to talk about during week 1, especially if my class is the second day of the quarter. Today was an exception because I wanted to excite the class about this topic.

Some examples I discussed:

the general recipe for Monte Carlo methods
the bootstrap as an example of resampling, and R loops
computing and mention of Buffon’s Needle
scheduling/timetabling and occupancy/matching problems using stochastic search (simulated annealing, Tabu search etc.)
mention of genetic algorithms and swarm intelligence
PageRank as a Markov process
drawing a random sample of web pages using Random Walk Metropolis-Hastings
short inventory of fields […]

My Experience at ACM Data Mining Camp #DMcamp

My parents and I made plans to visit San Jose and Saratoga on my grandmother’s birthday, March 19, since that is where she grew up. I randomly saw someone tweet about the ACM Data Mining Camp unconference that happened to be the next day, March 20, only a couple of miles from our hotel in Santa Clara. This was an opportunity I could not pass up.

Upon arriving at eBay/PayPal’s “Town Hall” building, I was greeted by some very hyper people! Surrounding me were a lot of people my age and my interest. I finally felt like I was in my element. The organizers of the event also had a predetermined Twitter hashtag for the event #DMCAMP, and also set up a blog where people could add material and write comments about the sessions. I felt like a kid in a candy shop when I saw the proposed sessions for the breakout sessions.

Some of the proposed topics I found really interesting:

Anonamly Detection
Natural Language Processing
Collaborative Filtering and a Netflix Paper
CPC Optimization for Events
Data Mining Programming Tools
Structured Tags
Status of Mahout
Machine Learning with Parallel Processors
Sentiment Analysis
Parallel R

About half of these actually made it onto the schedule. Unfortunately, I was only able to attend 4 […]

Exact Complexity of Mergesort, and an R Regression Oddity

It’s nice to be back after a pretty crazy two weeks or so.

Let me start off by stating that this blog post is simply me pondering and may not be correct. Feel free to comment on inaccuracies or improvements!

In preparation for an exam and my natural tendencies to be masochistic, I am forcing myself to find the exact complexities of some sorting algorithms and I decided to start with a favorite – mergesort. Mergesort divides an array or linked list first into two halves (or close to it) and then recursively divides the successive lists into halves until it ends up with two lists containing 1 element each – the base case. The elements are then compared and switched so that they are in order, and form their own list.

At successive levels we compare the last element of the first sublist to the first element of the second sublist and merge them together to form another list. This process continues up the recursion tree until the entire original list is sorted. For a more comprehensive and precise description, see this article on mergesort.

Easy: The Worst Case

The worst case is easy as any CS student will tell you. Looking […]

Mining Tuition Data for US Colleges and Universities, and a Tangent

I wrote this script for the UCLA Statistical Consulting Center. I don’t know all of the specifics, but one of our faculty members has this idea that we can help our paper, The Daily Bruin, with their graphics or something to that effect. I don’t quite understand because our paper has never really been big on graphics for data, but apparently some undergraduates are going to work on this.

Anyway, we need datasets that are of interest to UCLA students so that our undergraduates can create cool graphics that will stun the readers. Some of the data we were considering:

parking data for one week; gate entries, to correlate with some other variable (weather was mentioned. ugh)
Registrar study list/class schedule information for every student (anonymized of course) from Fall 2008. $50 for programmer time. (I could have done it quickly, for free! …if I worked in their office and it was legal, I mean.)
9/11 pager intercepts.
tuition data for US colleges and universities over ten years.

The tuition data was presented in a bunch of tables presented on several pages. Unfortunately, the type of school is not reported. Due to this limitation, I had to execute separate queries to access each year of data, […]

Advanced Graphics in R

Each quarter the UCLA Statistical Consulting Center hosts minicourses twice per week in R and LaTeX. Tonight was my turn to present.

I presented Advanced Graphics in R. This was the same presentation I gave at the LA R Users’ Group in August will a fellow consultant. She and I had trouble coming together to make one presentation, so we shared our outlines, and we deemed her outline was deemed “Intermediate Graphics in R” with some ggplot, and mine was deemed “Advanced.” It seems to work.

My slides are here, and the handout version is here. The corresponding code is here.

Topics include:

Customizing graphics with par parameters
Using attributes of graphic objects
Basic graphics devices
Math typesetting for R graphics.
an example of a movie (here, but there is some funkiness with it)

Many think that “advanced” graphics would be lattice or ggplot. We chose to address those packages in their own minicourses.

My advisor gave me some good advice on writing R code that fits well in Beamer slides and lstlisting:

use local variables and introduce them.
don’t use function names as variable names (I violated this one here).