
For Part I, Parallelism in R, click here.
Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.
Slides
My edited slides are posted on SlideShare, and available for download here.
Topics included:
- bigmemory, biganalytics and bigtabulate
- ff
- HadoopStreaming
- brief mention of Rhipe
Code
The corresponding demonstration code is here.
Data
Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).
Large datasets:
- On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset for you called airline.csv in the directory in which it is executed. I would just post it here, but it is very large and I only have so much bandwidth!
- The Twitter dataset appears to no longer be available. Instead, use anna.txt which comes with HadoopStreaming. Simply replace twitter.tsv with anna.txt.
Video
The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.
For Part I, Parallelism in R, click here.
Tuesday night I had the opportunity to present on high performance computing in R, and the Los Angeles R Users’ Group. There was so much to talk about that I had to split my talk into two parts. The first part was parallelization and the second part will be big data (and a bit left over from parallelization including Hadoop).
My slides are posted on SlideShare, and available for download here.
Taking R to the Limit (High Performance Computing in R), Part 1 — Parallelization, LA R Users' Group 7/27/
View more presentations from Ryan Rosario.
The corresponding demonstration code is here.
Topics included:
Rmpi
snow
snowfall and sfCluster
multicore
foreach
brief mention of CUDA and GPUs
Video of the presentation with my commentary:
The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.
For Part 2, Large Datasets in R, click here.
This week I had the opportunity the trek up north to Silicon Valley to attend Yahoo’s Hadoop Summit 2010. I love Silicon Valley. The few times I’ve been there the weather was perfect (often warmer than LA), little to no traffic, no road rage and people overall seem friendly and happy. Not to mention there are so many trees it looks like a forest!
The venue was the Hyatt Regency Great America which seemed like a very posh business hotel. Walking into the lobby and seeing a huge crowd of enthusiasts and an usher every two steps was overwhelming!
After being welcomed by the sound of the vuvuzela, we heard about some statistics about the growth of Hadoop over the years. Apparently in 2008 there were 600 attendees to Hadoop Summit and in 2010 attendance grew to 1000. The conference started with three hours of keynotes and sessions that everyone attended in one room. The tone was somewhat corporate in nature, but there were also several gems in there about how Yahoo uses Hadoop:
A user’s click streams are mapped to latent interests, similar to mapping words to interests in topic models like LSA and LDA. Yahoo uses Hadoop to recompute a [...]
As a true R fan, I like to believe that R can do anything, no matter how big, how small or how complicated: there is some way to do it in R. I decided to approach my large, sparse matrix problem with this attitude. But here I sit a broken man.
There is no “native” big data support built into R, even if using the 64bit build of R. Before venturing on this endeavor, I consulted with my advisor who reassured me that R uses the state of the art for sparse matrices. That was enough for me.
My Problem
For part of my Masters thesis, I wrote code to extract all of the friends and followers out to network degree 2 to construct a “small-world” snapshot of a user via their relationships. In a graph, nodes and edges grow exponentially as the degree increases. The number of nodes was on the order of 300,000. The number of edges I predict will be around 900,000. The code is still running. This means that a dense matrix would have size . Some of you already know how this story is going to end…
The matrix is very sparse.
Very sparse.
The raw data graph.log consists of [...]
This logo means that the blog post is about something I have found interesting, but does not apply directly to the exact purpose of this blog.
Note: These commands have been tested in pdflatex. I am not sure if they work in other distributions.
Over the past couple of months, I have been assisting with editing some papers and also doing some projects in LaTeX. Seeing other peoples’ code has taught me some interesting things. Here are a few of them.
Arrays, Lists, and Loops
Indeed, it is possible to write loops in LaTeX, with a few limitations. I have used this in some of the following ways
creating a perfectly aligned bubble answer sheet for scoring by a desktop scanner.
constructing midterms using a different dataset (form) for each student.
creating sheets of business cards for our consulting center / mail merge.
Some other ways arrays and loops can be used with some clever programming:
creating multiple forms of an exam using item from different files.
reports such as rosters, invoices, pre-printed forms, etc.
Arrays. To create an array in LaTeX, you will need the arrayjob package that can be downloaded from here. Then, to create the array we use the command \newarray\ArrayName where ArrayName is the name of [...]
This is not really news. A few months ago, news broke that Facebook recorded each user’s clicks and profile views in a database. Of course, I am not at all surprised. I would be more surprised if they didn’t store every single click.
By now, most people have some sense as to how Facebook’s recommendation system works. It typically performs what one of my professors called “completing the triangle.” If users and are friends, and users and are friends, Facebook may hypothesize that and should also be friends. Of course, Facebook’s algorithm is not that naive. Consider a slightly more realistic example in the graph below. I must provide a picture, otherwise I will end up using “recursive language” (i.e. “friends of a friend of a friend that’s friends with…”). The red lines represent existing friendships. This graph consists of two triangles, one containing one man and the two women, and another containing one woman and the two men. Facebook would most likely conclude that the two people with spiky hair should be friends, denoted by the green dashed line.
On Facebook, I am a member of several different network clusters, as most people are. [...]
This quarter I am TAing UCLA’s Statistics 102C. Introduction to Monte Carlo Methods for Professor Qing Zhou. This course did not exist when I was an undergraduate, and I think it is pretty rare to teach Monte Carlo (minus the bootstrap if you count that) or MCMC to undergrads. I am excited about this class because to me, MCMC turns Statistics on its head. It felt like a totally different paradigm compared to the regression and data analysis paradigm that I was used to at the time. It also exposes students to the connection between Statistics/MCMC and other fields such as Computer Science, Genetics/Biology, etc.
I usually do not have much to talk about during week 1, especially if my class is the second day of the quarter. Today was an exception because I wanted to excite the class about this topic.
Some examples I discussed:
the general recipe for Monte Carlo methods
the bootstrap as an example of resampling, and R loops
computing and mention of Buffon’s Needle
scheduling/timetabling and occupancy/matching problems using stochastic search (simulated annealing, Tabu search etc.)
mention of genetic algorithms and swarm intelligence
PageRank as a Markov process
drawing a random sample of web pages using Random Walk Metropolis-Hastings
short inventory of [...]
Gardenhose is a Streaming API feed that continuously sends a sample (roughly 15% according to Ryan Sarver at the 140tc in September 2009) of all tweets to feed recipients. This is some code for dumping the tweets to files named by date and hour. It is in PHP which is not my favorite language, but works nonetheless. I received a few requests to post it, so here it is.
<?php
//gardenhosedump.php
$username = ”;
$password = ”;
while(true) {
$file = fopen("http://" . $username . ":" . $password . "@stream.twitter.com/1/statuses/sample.json","r");
while($data = fgets($file))
{
$time = @date("YmdH");
if ($newTime!=$time)
{
@fclose($file2);
$file2 = fopen("{$time}.txt","a");
[...]
A week or so ago I had my first experience using someone else’s cluster on Amazon EC2. EC2 is the Amazon Elastic Compute Cloud. Users set up a virtual computing platform that runs on Amazon’s servers “in the cloud.” Amazon EC2 is not just another cluster. EC2 allows the user to create a disk image containing an operating system and all of the software they need to perform their computations. In my case, the disk image would contain Hadoop, R, Python and all of the R and Python packages I need for my work. This prevents the user (and the provider) from having to worry about providing or upgrading software and having compatibility issues.
No subscription is required. Users pay for the amount of resources used for the computing session. Hourly prices are very cheap, but accrue quickly. Additionally, Amazon charges for pretty much everything single thing you can do with an OS: transferring data to/from the cloud per GB, data storage per GB, CPU time per hour per core etc.
This is somewhat of a tangent, but EC2 was a brilliant business move in my opinion.
Anyway, life gets a bit more difficult when the EC2 instance you’re working [...]
My parents and I made plans to visit San Jose and Saratoga on my grandmother’s birthday, March 19, since that is where she grew up. I randomly saw someone tweet about the ACM Data Mining Camp unconference that happened to be the next day, March 20, only a couple of miles from our hotel in Santa Clara. This was an opportunity I could not pass up.
Upon arriving at eBay/PayPal’s “Town Hall” building, I was greeted by some very hyper people! Surrounding me were a lot of people my age and my interest. I finally felt like I was in my element. The organizers of the event also had a predetermined Twitter hashtag for the event #DMCAMP, and also set up a blog where people could add material and write comments about the sessions. I felt like a kid in a candy shop when I saw the proposed sessions for the breakout sessions.
Some of the proposed topics I found really interesting:
Anonamly Detection
Natural Language Processing
Collaborative Filtering and a Netflix Paper
CPC Optimization for Events
Data Mining Programming Tools
Structured Tags
Status of Mahout
Machine Learning with Parallel Processors
Sentiment Analysis
Parallel R
About half of these actually made it onto the schedule. Unfortunately, I was only able to attend [...]
|
|