Review of 2011 Data Scientist Summit

Some time over the past 6 weeks I randomly saw a tweet announcing the “Data Scientist Summit” and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.

The Pre-Party

The Venetian and all close hotels were booked, so I ended up at the Aria; a new experience. The hotel is beautiful and very ritzy. I had heard that the rooms were very technologically advanced but I wasn’t prepared for the recorded welcome message, music and automatic shades opening upon entry to the room. The Aria is a geek’s paradise. Everything is computerized. Key cards are “waved” rather than swiped, lights are turned on/off and dimmed by use case (“sleep”, “read” etc.), rather than manually. There are no paper “Do Not Disturb” signs; rather, a switch on the wall (or via TV) toggles an indicator light outside the door. And the best part… Internet is FREE!

The rhododendrons hydrangeas are real!
Work […]

My Day at ACM Data Mining Camp III

My first time at ACM Data Mining Camp was so awesome, that I was thrilled the make the trip up to San Jose for the November 2010 version. In July, I gave a talk at the Emerging Technologies for Online Learning Symposium conference with a faculty member in the Department of Statistics, at the Fairmont. The place was amazing, and I told myself I would save up to stay there. This trip gave me an opportunity to check it out, and pretend that I am posh for a weekend ;). The night I arrived I had the best dinner and drinks at this place called Gordon Biersch. I had the best garlic fries and BBQ burger I have ever had. I ate it with a Dragonfruit Strawberry Mojito, the Barbados Rum Runner, and finished off with a Long Island Iced Tea, so the drinks were awesome as well. Anyway, to the point of this post…

The next morning I made the short trek to the PayPal headquarters for a very long 9am-8pm day. Since I came up here for the camp, I wanted to make the most of it and paid the $30 for the morning session, even though I had […]

Lists of English Words

When I was a kid, I went through an 80s music phase…well, some things never change. “People just love to play with words…” Know that song? Anyway…

One of the biggest pains of text mining and NLP is colloquialism — language that is only appropriate in casual language and not in formal speech or writing. Words such as informal contractions (“gonna”, “wanna”, “whatcha”, “ain’t”, “y’all”) are colloquialisms and are everywhere on the Web. There is also a great deal of slang common on the Web including acronyms/emoticons (“LOL”, “WTF”) and smilies that add sentiment to text. There is also a less used slang called leetspeak that replaces letters with numbers (“n00b” rather than “noob”, “pwned” instead of “owned” and “pr0n” instead of “porn”).

There are also regionalisms which are a pain for semantic analysis but not so much for probabilistic analysis. Some examples are pancakes (“flapjacks”, “griddlecakes”) or carbonated beverages (“soda”, “pop”, “Coke”). Or, little did I know, “maple bars” vs. “Long Johns”. Now I am hungry. There are also words that have a formal and informal meeting such as “kid” (a young goat, or a child…same thing).


Linguists consider colloquialisms different than slang. Slang is informal language used by a specific […]

Opening Statements on Markov Chain Monte Carlo

This quarter I am TAing UCLA’s Statistics 102C. Introduction to Monte Carlo Methods for Professor Qing Zhou. This course did not exist when I was an undergraduate, and I think it is pretty rare to teach Monte Carlo (minus the bootstrap if you count that) or MCMC to undergrads. I am excited about this class because to me, MCMC turns Statistics on its head. It felt like a totally different paradigm compared to the regression and data analysis paradigm that I was used to at the time. It also exposes students to the connection between Statistics/MCMC and other fields such as Computer Science, Genetics/Biology, etc.

I usually do not have much to talk about during week 1, especially if my class is the second day of the quarter. Today was an exception because I wanted to excite the class about this topic.

Some examples I discussed:

the general recipe for Monte Carlo methods
the bootstrap as an example of resampling, and R loops
computing and mention of Buffon’s Needle
scheduling/timetabling and occupancy/matching problems using stochastic search (simulated annealing, Tabu search etc.)
mention of genetic algorithms and swarm intelligence
PageRank as a Markov process
drawing a random sample of web pages using Random Walk Metropolis-Hastings
short inventory of fields […]

My Experience at ACM Data Mining Camp #DMcamp

My parents and I made plans to visit San Jose and Saratoga on my grandmother’s birthday, March 19, since that is where she grew up. I randomly saw someone tweet about the ACM Data Mining Camp unconference that happened to be the next day, March 20, only a couple of miles from our hotel in Santa Clara. This was an opportunity I could not pass up.

Upon arriving at eBay/PayPal’s “Town Hall” building, I was greeted by some very hyper people! Surrounding me were a lot of people my age and my interest. I finally felt like I was in my element. The organizers of the event also had a predetermined Twitter hashtag for the event #DMCAMP, and also set up a blog where people could add material and write comments about the sessions. I felt like a kid in a candy shop when I saw the proposed sessions for the breakout sessions.

Some of the proposed topics I found really interesting:

Anonamly Detection
Natural Language Processing
Collaborative Filtering and a Netflix Paper
CPC Optimization for Events
Data Mining Programming Tools
Structured Tags
Status of Mahout
Machine Learning with Parallel Processors
Sentiment Analysis
Parallel R

About half of these actually made it onto the schedule. Unfortunately, I was only able to attend 4 […]