Exciting Tools for Big Data: S4, Sawzall and mrjob!

This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.

Yahoo’s S4: Distributed Stream Computing Platform

First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately.

S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.

Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:

emit another event that will be consumed by another PE, or publish some result

Streaming data is different from non-streaming data in that the user does not know how much data will […]

UCLA Statistics: Analyzing Thesis/Dissertation Lengths

As I am working on my dissertation and piecing together a mess of notes, code and output, I am wondering to myself “how long is this thing supposed to be?” I am definitely not into this to win the prize for longest dissertation. I just want to say my piece, make my point and move on. I’ve heard that the shortest dissertation in my program was 40 pages (not true). I heard someone from another school that their dissertation was over 300 pages. I am not holding myself to a strict limit, but I wanted a rough guideline. As a disclaimer, this blog post is more “fun” than “business.” This was just an analysis that I was interested in and felt that it was worth sharing since it combined Python, web scraping, R and ggplot2. It is not meant to be a thorough analysis of dissertation lengths or academic quality of the Department.

The UCLA Department of Statistics publishes most of its M.S. theses and Ph.D. dissertations on a website. It is not complete, especially for the earlier years, but it is a good enough population for my use.

Using this web page, I was able to extract information […]

Opening Statements on Markov Chain Monte Carlo

This quarter I am TAing UCLA’s Statistics 102C. Introduction to Monte Carlo Methods for Professor Qing Zhou. This course did not exist when I was an undergraduate, and I think it is pretty rare to teach Monte Carlo (minus the bootstrap if you count that) or MCMC to undergrads. I am excited about this class because to me, MCMC turns Statistics on its head. It felt like a totally different paradigm compared to the regression and data analysis paradigm that I was used to at the time. It also exposes students to the connection between Statistics/MCMC and other fields such as Computer Science, Genetics/Biology, etc.

I usually do not have much to talk about during week 1, especially if my class is the second day of the quarter. Today was an exception because I wanted to excite the class about this topic.

Some examples I discussed:

the general recipe for Monte Carlo methods the bootstrap as an example of resampling, and R loops computing and mention of Buffon’s Needle scheduling/timetabling and occupancy/matching problems using stochastic search (simulated annealing, Tabu search etc.) mention of genetic algorithms and swarm intelligence PageRank as a Markov process drawing a random sample of web […]

Some Code for Dumping Data from Twitter Gardenhose

Gardenhose is a Streaming API feed that continuously sends a sample (roughly 15% according to Ryan Sarver at the 140tc in September 2009) of all tweets to feed recipients. This is some code for dumping the tweets to files named by date and hour. It is in PHP which is not my favorite language, but works nonetheless. I received a few requests to post it, so here it is.

<?php //gardenhosedump.php $username = ”; $password = ”; while(true) { $file = fopen("http://" . $username . ":" . $password . "@stream.twitter.com/1/statuses/sample.json","r"); while($data = fgets($file)) { $time = @date("YmdH"); if ($newTime!=$time) { @fclose($file2); $file2 = fopen("{$time}.txt","a"); } fputs($file2,$data); $newTime = $time; } //need to close the file, but only if it is open! try { @fclose($file); } catch (MyException $e) {} try { @fclose($file2); } catch (MyException $e) {} } ?>

Mining Tuition Data for US Colleges and Universities, and a Tangent

I wrote this script for the UCLA Statistical Consulting Center. I don’t know all of the specifics, but one of our faculty members has this idea that we can help our paper, The Daily Bruin, with their graphics or something to that effect. I don’t quite understand because our paper has never really been big on graphics for data, but apparently some undergraduates are going to work on this.

Anyway, we need datasets that are of interest to UCLA students so that our undergraduates can create cool graphics that will stun the readers. Some of the data we were considering:

parking data for one week; gate entries, to correlate with some other variable (weather was mentioned. ugh) Registrar study list/class schedule information for every student (anonymized of course) from Fall 2008. $50 for programmer time. (I could have done it quickly, for free! …if I worked in their office and it was legal, I mean.) 9/11 pager intercepts. tuition data for US colleges and universities over ten years.

The tuition data was presented in a bunch of tables presented on several pages. Unfortunately, the type of school is not reported. Due to this limitation, I had to execute separate queries […]