Merry Christmas from Byte Mining!

To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! I will be spending the holidays coding on my new Motorola Droid X (goodbye AT&T!).

Some Lessons in Production Development (Hadoop) – Part 1

Wow. I can’t believe it has been a month since I have posted. On December 1, I started a new chapter in my life, working full time as a Data Scientist at the Rubicon Project. Needless to say, that has been keeping me occupied, as well as thinking about working on my dissertation. For the time, I am getting settled in here.

When I accepted this position, one of my hopes/expectations would be to become professionally competent and confident in C, Java, Python, Hadoop, and the software development process rather than relying on hobby and academic knowledge. That is something a degree cannot help with. It has been a great experience, although very frustrating, but that is expected when jumping into development professionally.     

I am writing this post to chronicle what I have learned about using Hadoop in production and how it majorly differs from its use in my research and personal analysis. To start, I was asked to check out a huge stack of code from a Subversion repository. But then what?

But you’re a Computer Scientist! This should be easy!

The first part is true, but there is a stark difference between a garden variety computer […]

My Day at ACM Data Mining Camp III

My first time at ACM Data Mining Camp was so awesome, that I was thrilled the make the trip up to San Jose for the November 2010 version. In July, I gave a talk at the Emerging Technologies for Online Learning Symposium conference with a faculty member in the Department of Statistics, at the Fairmont. The place was amazing, and I told myself I would save up to stay there. This trip gave me an opportunity to check it out, and pretend that I am posh for a weekend ;). The night I arrived I had the best dinner and drinks at this place called Gordon Biersch. I had the best garlic fries and BBQ burger I have ever had. I ate it with a Dragonfruit Strawberry Mojito, the Barbados Rum Runner, and finished off with a Long Island Iced Tea, so the drinks were awesome as well. Anyway, to the point of this post…

The next morning I made the short trek to the PayPal headquarters for a very long 9am-8pm day. Since I came up here for the camp, I wanted to make the most of it and paid the $30 for the morning session, even though I […]

Exciting Tools for Big Data: S4, Sawzall and mrjob!

This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.

Yahoo’s S4: Distributed Stream Computing Platform

First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately.

S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.

Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:

emit another event that will be consumed by another PE, or publish some result

Streaming data is different from non-streaming data in that the user does not know how much data will […]

Accessing R from Python using RPy2

This past Tuesday I had the opportunity to present a short talk (a bit long) related to text mining at the Los Angeles R Users’ Group. Since I do most of my text mining in Python, I took this opportunity to discuss RPy2, an interface to R from Python. My slides are below:

Accessing R from Python using RPy2 View more presentations from Ryan Rosario.

Download/view slides here. Topics include

Using Python with R with an example using web mining. Web mining using pure R rather than Python.

Code for demonstration is here:

offtopic_demo.py is a pure Python script that extracts data from a web forum and dumps it to disk. To actually use it, you will need to register for an account. RPy2_demo.py reads the data from the forum from disk and calls R from Python to perform some basic analysis. curljson_demo.R grabs some JSON data from the Twitter Search API using RCurl and converts it to R lists using rjson.

Video:

Running the code requires some packages that you need to install.

twill package for web browsing, that installs a Python package for you. Requires the mechanize package as well. […]

Transactions, and Pondering their Use in Casinos

A couple of weeks ago, Bradford Cross of FlightCaster posted in Measuring Measures that transactions are the next big data category. I argue that they already are, and from reading his blog post, he seems to suggest this as well but I will admit that I think I missed his point. There are some clear examples of transactions and their importance:

Itemset Mining. Cross discusses this in his article. Financial transactions on sites like Amazon contain items (merchandise). Using these transactions, Amazon built a recommendation engine to recommend new items to customers on their website, and even customize deals for customers via email and on the site. Wireless Localization. Fantasyland at The Magic Kingdom in Walt Disney World was to undergo a big overhaul to provide a personalized experience on transactions through the park. An RFID chip would be included in a ticket (or some type of document) and the visitor’s information from a survey would be transmitted to the attraction’s intelligent system. Such a system would also provide Disney a wealth of information about what attractions certain audiences visit, when, how often, and even what items a visitor may purchase during the day. Website Conversion Path Optimization. A visit […]

Lists of English Words

When I was a kid, I went through an 80s music phase…well, some things never change. “People just love to play with words…” Know that song? Anyway…

One of the biggest pains of text mining and NLP is colloquialism — language that is only appropriate in casual language and not in formal speech or writing. Words such as informal contractions (“gonna”, “wanna”, “whatcha”, “ain’t”, “y’all”) are colloquialisms and are everywhere on the Web. There is also a great deal of slang common on the Web including acronyms/emoticons (“LOL”, “WTF”) and smilies that add sentiment to text. There is also a less used slang called leetspeak that replaces letters with numbers (“n00b” rather than “noob”, “pwned” instead of “owned” and “pr0n” instead of “porn”).

There are also regionalisms which are a pain for semantic analysis but not so much for probabilistic analysis. Some examples are pancakes (“flapjacks”, “griddlecakes”) or carbonated beverages (“soda”, “pop”, “Coke”). Or, little did I know, “maple bars” vs. “Long Johns”. Now I am hungry. There are also words that have a formal and informal meeting such as “kid” (a young goat, or a child…same thing).

Source: http://popvssoda.com/

Linguists consider colloquialisms different than slang. Slang is informal language […]

UCLA Statistics: Analyzing Thesis/Dissertation Lengths

As I am working on my dissertation and piecing together a mess of notes, code and output, I am wondering to myself “how long is this thing supposed to be?” I am definitely not into this to win the prize for longest dissertation. I just want to say my piece, make my point and move on. I’ve heard that the shortest dissertation in my program was 40 pages (not true). I heard someone from another school that their dissertation was over 300 pages. I am not holding myself to a strict limit, but I wanted a rough guideline. As a disclaimer, this blog post is more “fun” than “business.” This was just an analysis that I was interested in and felt that it was worth sharing since it combined Python, web scraping, R and ggplot2. It is not meant to be a thorough analysis of dissertation lengths or academic quality of the Department.

The UCLA Department of Statistics publishes most of its M.S. theses and Ph.D. dissertations on a website. It is not complete, especially for the earlier years, but it is a good enough population for my use.

Using this web page, I was able to extract information […]

Taking R to the Limit, Part II – Large Datasets in R

For Part I, Parallelism in R, click here.

Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.

Slides

My edited slides are posted on SlideShare, and available for download here.

Taking R to the Limit (High Performance Computing in R), Part 2 — Large Datasets, LA R Users' Group 8/17/10

View more presentations from Ryan Rosario.

Topics included:

bigmemory, biganalytics and bigtabulate ff HadoopStreaming brief mention of Rhipe

Code

The corresponding demonstration code is here.

Data

Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).

Large datasets:

On-Time Airline Performance data from […]

Taking R to the Limit, Part I – Parallelization in R

Tuesday night I had the opportunity to present on high performance computing in R, and the Los Angeles R Users’ Group. There was so much to talk about that I had to split my talk into two parts. The first part was parallelization and the second part will be big data (and a bit left over from parallelization including Hadoop).

My slides are posted on SlideShare, and available for download here.

Taking R to the Limit (High Performance Computing in R), Part 1 — Parallelization, LA R Users' Group 7/27/

View more presentations from Ryan Rosario.

The corresponding demonstration code is here.

Topics included:

Rmpi snow snowfall and sfCluster multicore foreach brief mention of CUDA and GPUs

Video of the presentation with my commentary:

The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.

For Part 2, Large Datasets in R, click here.

[…]