Byte Mining

My Experience at Hadoop Summit 2010 #hadoopsummit

This week I had the opportunity the trek up north to Silicon Valley to attend Yahoo’s Hadoop Summit 2010. I love Silicon Valley. The few times I’ve been there the weather was perfect (often warmer than LA), little to no traffic, no road rage and people overall seem friendly and happy. Not to mention there are so many trees it looks like a forest!

The venue was the Hyatt Regency Great America which seemed like a very posh business hotel. Walking into the lobby and seeing a huge crowd of enthusiasts and an usher every two steps was overwhelming!

After being welcomed by the sound of the vuvuzela, we heard about some statistics about the growth of Hadoop over the years. Apparently in 2008 there were 600 attendees to Hadoop Summit and in 2010 attendance grew to 1000. The conference started with three hours of keynotes and sessions that everyone attended in one room. The tone was somewhat corporate in nature, but there were also several gems in there about how Yahoo uses Hadoop:

A user’s click streams are mapped to latent interests, similar to mapping words to interests in topic models like LSA and LDA. Yahoo uses Hadoop […]

Hitting the Big Data Ceiling in R

As a true R fan, I like to believe that R can do anything, no matter how big, how small or how complicated: there is some way to do it in R. I decided to approach my large, sparse matrix problem with this attitude. But here I sit a broken man.

There is no “native” big data support built into R, even if using the 64bit build of R. Before venturing on this endeavor, I consulted with my advisor who reassured me that R uses the state of the art for sparse matrices. That was enough for me.

My Problem

For part of my Masters thesis, I wrote code to extract all of the friends and followers out to network degree 2 to construct a “small-world” snapshot of a user via their relationships. In a graph, nodes and edges grow exponentially as the degree increases. The number of nodes was on the order of 300,000. The number of edges I predict will be around 900,000. The code is still running. This means that a dense matrix would have size . Some of you already know how this story is going to end…

The matrix is very sparse.

Very sparse.

The […]

Some LaTeX Gems – Part 1: TikZ, Loops and more

This logo means that the blog post is about something I have found interesting, but does not apply directly to the exact purpose of this blog.

Note: These commands have been tested in pdflatex. I am not sure if they work in other distributions.

Over the past couple of months, I have been assisting with editing some papers and also doing some projects in LaTeX. Seeing other peoples’ code has taught me some interesting things. Here are a few of them.

Arrays, Lists, and Loops

Indeed, it is possible to write loops in LaTeX, with a few limitations. I have used this in some of the following ways

creating a perfectly aligned bubble answer sheet for scoring by a desktop scanner. constructing midterms using a different dataset (form) for each student. creating sheets of business cards for our consulting center / mail merge.

Some other ways arrays and loops can be used with some clever programming:

creating multiple forms of an exam using item from different files. reports such as rosters, invoices, pre-printed forms, etc.

Arrays. To create an array in LaTeX, you will need the arrayjob package that can be downloaded from here. Then, to create the […]

Anecdotal Evidence that Facebook Stores all Clicks?

This is not really news. A few months ago, news broke that Facebook recorded each user’s clicks and profile views in a database. Of course, I am not at all surprised. I would be more surprised if they didn’t store every single click.

By now, most people have some sense as to how Facebook’s recommendation system works. It typically performs what one of my professors called “completing the triangle.” If users and are friends, and users and are friends, Facebook may hypothesize that and should also be friends. Of course, Facebook’s algorithm is not that naive. Consider a slightly more realistic example in the graph below. I must provide a picture, otherwise I will end up using “recursive language” (i.e. “friends of a friend of a friend that’s friends with…”). The red lines represent existing friendships. This graph consists of two triangles, one containing one man and the two women, and another containing one woman and the two men. Facebook would most likely conclude that the two people with spiky hair should be friends, denoted by the green dashed line.

On Facebook, I am a member of several different network clusters, as most people are. Some include high […]

Opening Statements on Markov Chain Monte Carlo

This quarter I am TAing UCLA’s Statistics 102C. Introduction to Monte Carlo Methods for Professor Qing Zhou. This course did not exist when I was an undergraduate, and I think it is pretty rare to teach Monte Carlo (minus the bootstrap if you count that) or MCMC to undergrads. I am excited about this class because to me, MCMC turns Statistics on its head. It felt like a totally different paradigm compared to the regression and data analysis paradigm that I was used to at the time. It also exposes students to the connection between Statistics/MCMC and other fields such as Computer Science, Genetics/Biology, etc.

I usually do not have much to talk about during week 1, especially if my class is the second day of the quarter. Today was an exception because I wanted to excite the class about this topic.

Some examples I discussed:

the general recipe for Monte Carlo methods the bootstrap as an example of resampling, and R loops computing and mention of Buffon’s Needle scheduling/timetabling and occupancy/matching problems using stochastic search (simulated annealing, Tabu search etc.) mention of genetic algorithms and swarm intelligence PageRank as a Markov process drawing a random sample of web […]

Some Code for Dumping Data from Twitter Gardenhose

Gardenhose is a Streaming API feed that continuously sends a sample (roughly 15% according to Ryan Sarver at the 140tc in September 2009) of all tweets to feed recipients. This is some code for dumping the tweets to files named by date and hour. It is in PHP which is not my favorite language, but works nonetheless. I received a few requests to post it, so here it is.

<?php //gardenhosedump.php $username = ”; $password = ”; while(true) { $file = fopen("http://" . $username . ":" . $password . "@stream.twitter.com/1/statuses/sample.json","r"); while($data = fgets($file)) { $time = @date("YmdH"); if ($newTime!=$time) { @fclose($file2); $file2 = fopen("{$time}.txt","a"); } fputs($file2,$data); $newTime = $time; } //need to close the file, but only if it is open! try { @fclose($file); } catch (MyException $e) {} try { @fclose($file2); } catch (MyException $e) {} } ?>

Lessons Learned from EC2

A week or so ago I had my first experience using someone else’s cluster on Amazon EC2. EC2 is the Amazon Elastic Compute Cloud. Users set up a virtual computing platform that runs on Amazon’s servers “in the cloud.” Amazon EC2 is not just another cluster. EC2 allows the user to create a disk image containing an operating system and all of the software they need to perform their computations. In my case, the disk image would contain Hadoop, R, Python and all of the R and Python packages I need for my work. This prevents the user (and the provider) from having to worry about providing or upgrading software and having compatibility issues.

No subscription is required. Users pay for the amount of resources used for the computing session. Hourly prices are very cheap, but accrue quickly. Additionally, Amazon charges for pretty much everything single thing you can do with an OS: transferring data to/from the cloud per GB, data storage per GB, CPU time per hour per core etc.

This is somewhat of a tangent, but EC2 was a brilliant business move in my opinion.

Anyway, life gets a bit more difficult when the EC2 instance […]

My Experience at ACM Data Mining Camp #DMcamp

My parents and I made plans to visit San Jose and Saratoga on my grandmother’s birthday, March 19, since that is where she grew up. I randomly saw someone tweet about the ACM Data Mining Camp unconference that happened to be the next day, March 20, only a couple of miles from our hotel in Santa Clara. This was an opportunity I could not pass up.

Upon arriving at eBay/PayPal’s “Town Hall” building, I was greeted by some very hyper people! Surrounding me were a lot of people my age and my interest. I finally felt like I was in my element. The organizers of the event also had a predetermined Twitter hashtag for the event #DMCAMP, and also set up a blog where people could add material and write comments about the sessions. I felt like a kid in a candy shop when I saw the proposed sessions for the breakout sessions.

Some of the proposed topics I found really interesting:

Anonamly Detection Natural Language Processing Collaborative Filtering and a Netflix Paper CPC Optimization for Events Data Mining Programming Tools Structured Tags Status of Mahout Machine Learning with Parallel Processors Sentiment Analysis Parallel R

About half of these actually […]

Be Careful Searching Python Dictionaries!

For my talk on High Performance Computing in R (which I had to reschedule due to a nasty stomach bug), I used Wikipedia linking data, an adjacency list of articles and the articles to which they link. This data was linked from DataWrangling and was originally created by Henry Haselgrove. The dataset is small on disk, but I needed a dataset that was huge, very huge. So, without a simple option off the top of my head, I took this data and expanded a subset of it into an incidence matrix, occupying 2GB in RAM. Perfect!

The subsetting was a bit of a problem because I had to maintain the links within the subgraph induced by the same. This required me to search dictionary objects for keys. This is where things went awry. Usually, I try to be efficient as possible. Since I was just producing a little example, and would never ever run this code otherwise, I wasn’t as careful.

The data were presented as a follows

1. First, I looked at from, and if from was in the chosen subset, keep it and proceed to 2, otherwise, throw it out. 2. Then, take the to nodes […]

Exact Complexity of Mergesort, and an R Regression Oddity

It’s nice to be back after a pretty crazy two weeks or so.

Let me start off by stating that this blog post is simply me pondering and may not be correct. Feel free to comment on inaccuracies or improvements!

In preparation for an exam and my natural tendencies to be masochistic, I am forcing myself to find the exact complexities of some sorting algorithms and I decided to start with a favorite – mergesort. Mergesort divides an array or linked list first into two halves (or close to it) and then recursively divides the successive lists into halves until it ends up with two lists containing 1 element each – the base case. The elements are then compared and switched so that they are in order, and form their own list.

At successive levels we compare the last element of the first sublist to the first element of the second sublist and merge them together to form another list. This process continues up the recursion tree until the entire original list is sorted. For a more comprehensive and precise description, see this article on mergesort.

Easy: The Worst Case

The worst case is easy as any CS […]