Some Code for Dumping Data from Twitter Gardenhose

Gardenhose is a Streaming API feed that continuously sends a sample (roughly 15% according to Ryan Sarver at the 140tc in September 2009) of all tweets to feed recipients. This is some code for dumping the tweets to files named by date and hour. It is in PHP which is not my favorite language, but works nonetheless. I received a few requests to post it, so here it is.

<?php //gardenhosedump.php $username = ”; $password = ”; while(true) { $file = fopen("http://" . $username . ":" . $password . "@stream.twitter.com/1/statuses/sample.json","r"); while($data = fgets($file)) { $time = @date("YmdH"); if ($newTime!=$time) { @fclose($file2); $file2 = fopen("{$time}.txt","a"); } fputs($file2,$data); $newTime = $time; } //need to close the file, but only if it is open! try { @fclose($file); } catch (MyException $e) {} try { @fclose($file2); } catch (MyException $e) {} } ?>

Lessons Learned from EC2

A week or so ago I had my first experience using someone else’s cluster on Amazon EC2. EC2 is the Amazon Elastic Compute Cloud. Users set up a virtual computing platform that runs on Amazon’s servers “in the cloud.” Amazon EC2 is not just another cluster. EC2 allows the user to create a disk image containing an operating system and all of the software they need to perform their computations. In my case, the disk image would contain Hadoop, R, Python and all of the R and Python packages I need for my work. This prevents the user (and the provider) from having to worry about providing or upgrading software and having compatibility issues.

No subscription is required. Users pay for the amount of resources used for the computing session. Hourly prices are very cheap, but accrue quickly. Additionally, Amazon charges for pretty much everything single thing you can do with an OS: transferring data to/from the cloud per GB, data storage per GB, CPU time per hour per core etc.

This is somewhat of a tangent, but EC2 was a brilliant business move in my opinion.

Anyway, life gets a bit more difficult when the EC2 instance […]

My Experience at ACM Data Mining Camp #DMcamp

My parents and I made plans to visit San Jose and Saratoga on my grandmother’s birthday, March 19, since that is where she grew up. I randomly saw someone tweet about the ACM Data Mining Camp unconference that happened to be the next day, March 20, only a couple of miles from our hotel in Santa Clara. This was an opportunity I could not pass up.

Upon arriving at eBay/PayPal’s “Town Hall” building, I was greeted by some very hyper people! Surrounding me were a lot of people my age and my interest. I finally felt like I was in my element. The organizers of the event also had a predetermined Twitter hashtag for the event #DMCAMP, and also set up a blog where people could add material and write comments about the sessions. I felt like a kid in a candy shop when I saw the proposed sessions for the breakout sessions.

Some of the proposed topics I found really interesting:

Anonamly Detection Natural Language Processing Collaborative Filtering and a Netflix Paper CPC Optimization for Events Data Mining Programming Tools Structured Tags Status of Mahout Machine Learning with Parallel Processors Sentiment Analysis Parallel R

About half of these actually […]