In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.
The slides for most talks will be available here but not all speakers will share their slides.
This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context [...]
Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include
article content and template pages
article content with revision history (huge files)
article content including user pages and talk pages
page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.
As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:
There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why [...]
<< My review of Day 1.
I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.
KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year’s conference had a few big names.
Steven Boyd, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. The first keynote, by Steven Boyd, discussed convex optimization. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (“non-negative curvature” as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is “beautiful theory” [...]
Some time over the past 6 weeks I randomly saw a tweet announcing the “Data Scientist Summit” and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.
The Venetian and all close hotels were booked, so I ended up at the Aria; a new experience. The hotel is beautiful and very ritzy. I had heard that the rooms were very technologically advanced but I wasn’t prepared for the recorded welcome message, music and automatic shades opening upon entry to the room. The Aria is a geek’s paradise. Everything is computerized. Key cards are “waved” rather than swiped, lights are turned on/off and dimmed by use case (“sleep”, “read” etc.), rather than manually. There are no paper “Do Not Disturb” signs; rather, a switch on the wall (or via TV) toggles an indicator light outside the door. And the best part… Internet is FREE!
The rhododendrons hydrangeas are real!
Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user “boots” up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.
Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and “gotchas” that are important to keep in mind when using EC2, and in this post, with [...]
Numpy and SciPy are packages for numerical computation and scientific computing, for Python.
One wrinkle with NumPy/SciPy that needs to be ironed out is the difficulty of installation on certain OSes, and particularly, architectures.The SciPy SuperPack has done a good job of taking care of this issue, but it has not yet been updated for 2.7.1 and manually hacking away at its script has not worked for me.
I cannot take credit for the instructions in this article. A brave warrior, Jeremy Conlin, somehow managed to figure out how to install 64-bit NumPy and SciPy, with 64-bit Python 2.7.1 on Snow Leopard; he posted the directions to the SciPy User mailing list on February 24. I followed the directions, and miraculously they worked. I am reproducing them here for Google bait.
Install Python 2.7.1
1. Download the universal Mac 2.7.1 installer here (Python 2.7.1 Mac OS X 64-bit/32-bit x86-64/i386 Installer). Typically, Python will be installed to /Library/Frameworks/Python.framework/Versions/2.7/, but may be in other locations.
2. Verify that your new version of Python is 64-bit enabled. Note: Python installations typically do not get toggled as the default Python, so find the location of the 2.7.1 Python executable. On my machine, it is /Library/Frameworks/Python.framework/Versions/2.7/bin/python. python2.7 should also work.
Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.
Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the RSS Advisory Board developed a convention for the formatting of web pages so that browsers can automatically discover the links to the site’s RSS feeds. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.
Always Have a Plan B, C, D, …
One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery [...]
Wow. I can’t believe it has been a month since I have posted. On December 1, I started a new chapter in my life, working full time as a Data Scientist at the Rubicon Project. Needless to say, that has been keeping me occupied, as well as thinking about working on my dissertation. For the time, I am getting settled in here.
When I accepted this position, one of my hopes/expectations would be to become professionally competent and confident in C, Java, Python, Hadoop, and the software development process rather than relying on hobby and academic knowledge. That is something a degree cannot help with. It has been a great experience, although very frustrating, but that is expected when jumping into development professionally.
I am writing this post to chronicle what I have learned about using Hadoop in production and how it majorly differs from its use in my research and personal analysis.
To start, I was asked to check out a huge stack of code from a Subversion repository. But then what?
But you’re a Computer Scientist! This should be easy!
The first part is true, but there is a stark difference between a garden variety computer scientist and one that converts from [...]
This week, a few different big data processing tools were released to the open-source community. I know, I know, this is probably the 1000th blog post about this, and perhaps the train has left the station without me, but here I am.
Yahoo’s S4: Distributed Stream Computing Platform
First off, it must be said. S4 is NOT real-time map-reduce! This is the meme that has been floating around the Internets lately.
S4 is a distributed, scalable, partially fault-tolerant, pluggable platform that allows users to create applications that process unbounded streaming data. It is not a Hadoop project. A matter of fact, it is not even a form of map-reduce. S4 was developed at Yahoo for personalization of search advertising products. Map-reduce, so far, is not a great platform for dealing with streaming/non-stored data.
Pieces of data, apparently called events, are sent and consumed by a Processing Element (yes, PE, but not the kind that requires you to sweat). The PEs can do one of two things:
emit another event that will be consumed by another PE, or
publish some result
Streaming data is different from non-streaming data in that the user does not know how much data will be transmitted, and at what rate. Analysis on [...]
As I am working on my dissertation and piecing together a mess of notes, code and output, I am wondering to myself “how long is this thing supposed to be?” I am definitely not into this to win the prize for longest dissertation. I just want to say my piece, make my point and move on. I’ve heard that the shortest dissertation in my program was 40 pages (not true). I heard someone from another school that their dissertation was over 300 pages. I am not holding myself to a strict limit, but I wanted a rough guideline. As a disclaimer, this blog post is more “fun” than “business.” This was just an analysis that I was interested in and felt that it was worth sharing since it combined Python, web scraping, R and ggplot2. It is not meant to be a thorough analysis of dissertation lengths or academic quality of the Department.
The UCLA Department of Statistics publishes most of its M.S. theses and Ph.D. dissertations on a website. It is not complete, especially for the earlier years, but it is a good enough population for my use.
Using this web page, I was able to extract information about each [...]