Summary of My First Trip to Strata #strataconf

In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.

The slides for most talks will be available here but not all speakers will share their slides.

This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context [...]

SIAM Data Mining 2012 Conference

Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.

The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to [...]

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

article content and template pages
article content with revision history (huge files)
article content including user pages and talk pages
redirect graph
page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
image metadata
site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why [...]

SIGKDD 2011 Conference -- Days 2/3/4 Summary

<< My review of Day 1.

I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.

Keynotes

KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year’s conference had a few big names.

Steven Boyd, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. The first keynote, by Steven Boyd, discussed convex optimization. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (“non-negative curvature” as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is “beautiful theory” [...]

SIGKDD 2011 Conference -- Day 1 (Graph Mining and David Blei/Topic Models)

I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. AdMeld did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That’s good targeting!

Mining and Learning on Graphs Workshop 2011

I had originally planned to attend the 2-day workshop Mining and Learning with Graphs (MLG2011) but I forgot that it started on Saturday and I arrived on Sunday. I attended part of MLG2011 but it was difficult to pay attention considering it was my first time waking up at 7am in a long time. The first talk I arrived for was Networks Spill the Beans by Lada Adamic from the University of Michigan. Adamic’s presented work involved inferring properties of content (the “what”) using network structure alone (using only the “who”: who shares with whom). One example she presented involved questions and answers on a Java programming language forum. The research problem was to determine things such as who is most likely to answer a Java beginner’s question: a guru, or a slightly more experienced user? Another research question asked what dynamic interactions tell us about information flow. [...]

Google -- Is Search-by-Multimedia on the Way?

Recently, I have been thinking about alternate ways of specifying search queries other than with text. A couple of weeks ago I came across a piece of music that I could not identify. I thought it would be a huge win for a search engine to allow me to upload this piece, and it would present me with matches, or near matches to other pieces that sound similar, or have similar characteristics. Some services already exist. Shazam allows a user to place a microphone near playing music and it will identify the artist and song. Some uses of search-by-sound:

Music identification (“solved” – Shazam)
Music personalizaton and recommendation (“solved” – Pandora)
Identification of the source of a sound (i.e. a species of bird, a musical instrument, an inanimate object)
MP3 and media file search
Finding material that violates copyright

As our motivating example, consider we find some really cool graphic on the web and we want to know where it likely originated (i.e. art, a meme). In such a search engine, we could upload the graphic and get results containing the exact image, or images that are very similar such as variations of the image (crop, resize, borders, different effects), modifications of the image (consider Obama-izing [...]

Lists of English Words

When I was a kid, I went through an 80s music phase…well, some things never change. “People just love to play with words…” Know that song? Anyway…

One of the biggest pains of text mining and NLP is colloquialism — language that is only appropriate in casual language and not in formal speech or writing. Words such as informal contractions (“gonna”, “wanna”, “whatcha”, “ain’t”, “y’all”) are colloquialisms and are everywhere on the Web. There is also a great deal of slang common on the Web including acronyms/emoticons (“LOL”, “WTF”) and smilies that add sentiment to text. There is also a less used slang called leetspeak that replaces letters with numbers (“n00b” rather than “noob”, “pwned” instead of “owned” and “pr0n” instead of “porn”).

There are also regionalisms which are a pain for semantic analysis but not so much for probabilistic analysis. Some examples are pancakes (“flapjacks”, “griddlecakes”) or carbonated beverages (“soda”, “pop”, “Coke”). Or, little did I know, “maple bars” vs. “Long Johns”. Now I am hungry. There are also words that have a formal and informal meeting such as “kid” (a young goat, or a child…same thing).

Source: http://popvssoda.com/

Linguists consider colloquialisms different than slang. Slang is informal language used by a specific [...]