Merry Christmas and Happy Holidays!

Wishing you all a very Merry Christmas, Happy Holidays and Happy New Year!

An update on me. In October, I began working at Riot Games, the developers of League of Legends. It has been an amazing experience and has occupied the majority of my free time as has my dissertation. My New Year’s resolution this year is to dust the cobwebs off this blog!

Have a safe holiday season!

Here in California, I will be having Christmas in the Sand

My Interview about the Statistics Major

Recently, I participated in an email interview about what being a Statistics major entailed, how I got interested in the field and the future of Statistics. I figured this might be of interest to those that are contemplating majoring in Statistics, or considering a career in Data Science.

Q1: Why did you decide to pursue a major in statistics in college?

A: “When I was a kid, I really enjoyed looking at graphs, plots and maps. My parents and I could not make of what was behind the interest. At the same time, I was also heavily interested in education. My mother was a teacher and the first set of statistics I ever encountered were standardized test scores. I strived to understand what the scores attempted to say about me, and why such scores and tests are so trustworthy. When the stakes increased with the AP and SAT exams, I began reading articles published by the Educational Testing Service and learned a ton about how these tests are constructed to minimize bias, and how scores are comparable across forms. It fascinated me how much science goes into these tests, but in the end of the day they are still just one factor [...]

Merry Christmas 2011 From Byte Mining!

To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! This year, I am thankful for the community that has sprung up around Data Science and open-source data collection and processing. This blog is almost two years old, and like with Twitter, I have been able to communicate with many data scientists, enthusiasts and some of the most prolific contributors to the data science software community. I am thankful for all of the wonderful people I have met and have yet to meet, and for your comments and reading.

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

article content and template pages
article content with revision history (huge files)
article content including user pages and talk pages
redirect graph
page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
image metadata
site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why [...]

EC2 Trials and Tribulations, Part 1 (Web Crawling)

Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user “boots” up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.

Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and “gotchas” that are important to keep in mind when using EC2, and in this post, with [...]

Merry Christmas from Byte Mining!

To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! I will be spending the holidays coding on my new Motorola Droid X (goodbye AT&T!).

Taking R to the Limit, Part I - Parallelization in R

Tuesday night I had the opportunity to present on high performance computing in R, and the Los Angeles R Users’ Group. There was so much to talk about that I had to split my talk into two parts. The first part was parallelization and the second part will be big data (and a bit left over from parallelization including Hadoop).

My slides are posted on SlideShare, and available for download here.

Taking R to the Limit (High Performance Computing in R), Part 1 — Parallelization, LA R Users' Group 7/27/
View more presentations from Ryan Rosario.

The corresponding demonstration code is here.

Topics included:

Rmpi
snow
snowfall and sfCluster
multicore
foreach
brief mention of CUDA and GPUs

Video of the presentation with my commentary:

The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.

For Part 2, Large Datasets in R, click here.

Anecdotal Evidence that Facebook Stores all Clicks?

This is not really news. A few months ago, news broke that Facebook recorded each user’s clicks and profile views in a database. Of course, I am not at all surprised. I would be more surprised if they didn’t store every single click.

By now, most people have some sense as to how Facebook’s recommendation system works. It typically performs what one of my professors called “completing the triangle.” If users and are friends, and users and are friends, Facebook may hypothesize that and should also be friends. Of course, Facebook’s algorithm is not that naive. Consider a slightly more realistic example in the graph below. I must provide a picture, otherwise I will end up using “recursive language” (i.e. “friends of a friend of a friend that’s friends with…”). The red lines represent existing friendships. This graph consists of two triangles, one containing one man and the two women, and another containing one woman and the two men. Facebook would most likely conclude that the two people with spiky hair should be friends, denoted by the green dashed line.

On Facebook, I am a member of several different network clusters, as most people are. Some [...]

Some Python Nooks and Crannies

I spent this weekend reading Learning Python (Second Edition for Python 2.3!) by Mark Lutz. Python is my favorite programming language, but my experience with it has been mostly anecdotal; I come up with my own solutions and functions and I Google whatever I do not know. I decided to spend a couple of days with this incredibly out-of-date book to formalize my knowledge of the base Python language. It was fairly easy reading because I already had experience with about 80% of the constructs discussed. But it was fun to learn some things that I have not used, and some things that I did not even know existed. I want to share some of these gems here. Pardon me if all of this stuff is obvious to you .

Populating a String with a Dictionary

>> data = {}
>> data['first_name'] = “Ryan”
>> data['age'] = 21 #some programming humor
>> print “Hello, my name is %(first_name)s and I am %(age)d years old.” % data
Hello, my name is Ryan and I am 21 years old.
Notice that we can also put a function after the last % sign above, as long as the function returns a dictionary.
The s and the d after the dictionary [...]

What to Expect?

In 2007, I was introduced to Twitter via the written qualifying exam towards my Ph.D.. At first, I did not know what to do with it. After a good year or so (maybe even sooner) passed, I began to follow some very interesting people that share the same interests as me. It has transformed my academic experience. It is great to run across tweets promoting conferences and newly released papers in my field. One of my favorite parts about Twitter, aside from interacting with tweeps, is the ability for me to quickly post a status update on what I am doing and I can even refer to it later. I consider it a platform for collaboration because I see what others are doing via tweets as well as linked blogs, whether it is a Twitter user, or some offline user. I quickly realized that 140 characters were not enough to solidify my thoughts and participate in the community. Thus, I decided to start this blog so I can share cool things I have found in my research/work with others anywhere on the web and communicate in more than 140 characters.

Here are some things that I am very interested in and [...]