Summary of My First Trip to Strata #strataconf

In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.

The slides for most talks will be available here but not all speakers will share their slides.

This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when […]

OpenPaths and a Progressive Approach to Privacy

OpenPaths is a service that allows users with mobile phones to transmit and store their location. It is an initiative by the New York Times that allows users to use their own data, or to contribute their location data for research projects and perhaps startups that wish to get into the geospatial space. OpenPaths brands itself as “a secure data locker for personal location information.” There is one aspect where OpenPaths is very different from other services like Google Latitude: Only the user has access to his/her own data and it is never shared with anybody else unless the user chooses to do so. Additionally, initiatives that wish to use a user’s location data must be asked personally via email (pictured below), and the user has the ability to deny the request.The data shared with each initiative provides only location, and not other data that may be personally identifiable such as name, email, browser, mobile type etc. In this sense, OpenPaths has provided a barebones platform for the collection and storage of location information. Google Latitude is similar, but the data stored on Google’s servers is obviously used by other Google services without explicit user permission.

The service is […]

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

article content and template pages article content with revision history (huge files) article content including user pages and talk pages redirect graph page-to-page link lists: redirects, categories, image links, page links, interwiki etc. image metadata site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may […]

SIGKDD 2011 Conference — Days 2/3/4 Summary

<< My review of Day 1.

I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.

Keynotes

KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year’s conference had a few big names.

Steven Boyd, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. The first keynote, by Steven Boyd, discussed convex optimization. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (“non-negative curvature” as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they […]

My Review of Hadoop Summit 2011 #hadoopsummit

I woke up early and cheery Wednesday morning to attend the 2011 Hadoop Summit in Santa Clara, after a long drive from Los Angeles and the Big Data Camp that lasted until 10pm the night before. Having been to Hadoop Summit 2010, I was interested to see how much of the content in the conference had changed.

This year, there were approximately 1,600 participants and the summit was moved a few feet away to the Convention Center rather than the Hyatt. Still, space and seating was pretty cramped. That just goes to show how much the Hadoop field has grown in just one year.

Keynotes

We first heard a series of keynote speeches which I will summarize. The first keynote was from Jay Rossiter, SVP of the Cloud Platform Group at Yahoo. He introduced how Hadoop is used at Yahoo, which is fitting since they organized the event. The content of his presentation was very similar to last year’s. One interesting application of Hadoop at Yahoo was for “retiling” the map of the United States. I imagine this refers to the change in aerial imagery over time. When performed by hand, retiling took 6 weeks; with Hadoop, it took […]

Big Data Camp 2011 #BigDataCamp

It has been a while since I have been to Silicon Valley, but Hadoop Summit gave me the opportunity to go. To make the most of the long trip, I also decided to check out BigDataCamp held the night before from 5:30 to 10pm. Although the weather was as predicted, I was not prepared for the deluge of pouring rain in the end of June. The weather is one of the things that is preventing me from moving up to Silicon Valley.

The food/drinks/networking event must have been amazing because it was very difficult to get everyone to come to the main room to start the event! We started with a series of lightning talks from some familiar names and some unfamiliar ones.

Chris Wensel, the developer of Cascading, is also the founder of Concurrent, Inc. Cascading is an alternate API for Map-Reduce written in Java. With Cascading, developers can chain multiple map-reduce jobs to form an ad hoc workflow. Cascading adds a built-in planner to manage jobs. Cascading usually infers Hadoop, but Cascading can run on other platforms including EMC Greenplum and the new MapR project. RazorFish and BestBuy use Cascading for behavioral targeting. Flightcaster uses a domain specific […]

Google — Is Search-by-Multimedia on the Way?

Recently, I have been thinking about alternate ways of specifying search queries other than with text. A couple of weeks ago I came across a piece of music that I could not identify. I thought it would be a huge win for a search engine to allow me to upload this piece, and it would present me with matches, or near matches to other pieces that sound similar, or have similar characteristics. Some services already exist. Shazam allows a user to place a microphone near playing music and it will identify the artist and song. Some uses of search-by-sound:

Music identification (“solved” – Shazam) Music personalizaton and recommendation (“solved” – Pandora) Identification of the source of a sound (i.e. a species of bird, a musical instrument, an inanimate object) MP3 and media file search Finding material that violates copyright

As our motivating example, consider we find some really cool graphic on the web and we want to know where it likely originated (i.e. art, a meme). In such a search engine, we could upload the graphic and get results containing the exact image, or images that are very similar such as variations of the image (crop, resize, borders, different effects), […]

Review of 2011 Data Scientist Summit

Some time over the past 6 weeks I randomly saw a tweet announcing the “Data Scientist Summit” and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.

The Pre-Party

The Venetian and all close hotels were booked, so I ended up at the Aria; a new experience. The hotel is beautiful and very ritzy. I had heard that the rooms were very technologically advanced but I wasn’t prepared for the recorded welcome message, music and automatic shades opening upon entry to the room. The Aria is a geek’s paradise. Everything is computerized. Key cards are “waved” rather than swiped, lights are turned on/off and dimmed by use case (“sleep”, “read” etc.), rather than manually. There are no paper “Do Not Disturb” signs; rather, a switch on the wall (or via TV) toggles an indicator light outside the door. And the best part… Internet is FREE!

The rhododendrons hydrangeas are real! Work desk panel contains Ethernet, […]

EC2 Trials and Tribulations, Part 1 (Web Crawling)

Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user “boots” up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.

Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and “gotchas” that are important to keep in mind when using EC2, and in this post, […]

Web Mining Pitfalls

Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.

Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the RSS Advisory Board developed a convention for the formatting of web pages so that browsers can automatically discover the links to the site’s RSS feeds. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.

Always Have a Plan B, C, D, …

One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I […]