Summary of My First Trip to Strata #strataconf

In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.

The slides for most talks will be available here but not all speakers will share their slides.

This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context switching between listening, tweeting, emailing etc.


In the mornings there were several short plenary talks where people throughout the industry discussed their particular views of Data Science. This was basically a warm-up for what would become very long days. I mostly used this time to catch up on email and review stuff from the previous day. The second day apparently had a lot of gimmicky sales talks, but I was not paying attention apparently. The most interest talk came from Jennifer Pahlka from Code for America. I had first learned about Code for America at the Data Scientist Summit hosted by EMC in 2011. At the time it sounded like a very good idea — we have college graduates that dedicate a couple years to teaching in inner-city schools so it makes sense that we should have some data scientists working on “projects that make a difference in the world” as Jennifer would say. The projects that these data scientists work on involve democratizing data and open data initiatives in local governments. A couple of projects that stood out to me were a project to release some 800+ datasets from the City of Santa Cruz (an amazing city) on a website. Another project involved studying bail amounts and the outcome of a criminal trial. This talk was apropos considering the recent announcement of code.org, the start of a movement to begin teaching computer programming to K-12 students. [You can sign the petition and register as a volunteer here.]

Visualization Strand

I have said in the past that visualization is not my thing. I greatly appreciate interactive graphics and cool infographics that convey strong meaning to non-data scientists but it simply is not my cup of tea yet. However, it is something that I want to invest time into. I decided to attend visualization talks up to my tolerance level (which isn’t very high)… which meant one two.

I attended Chang She‘s talk Agile Data Wrangling and Web-based Visualizations. Chang did what I usually do: pack too much into a one-hour talk… but I feel that talks like this really whet the appetite to learn more. He discussed how data science is missing a “blue button” that takes care of data management and then visualization. Using the federal election commission dataset, he showed political donations by party, candidate and state as the motivating example. Chang showed several examples of using pandas (a Python data munging library) to manipulate the data and then passing that data to d3.js using a JSON data format with a web server. I felt that this was just a basic talk on how to combine tools to munge data and then visualize it. It is far from a blue button, but shows how important such processing pipelines are.

Law, Ethics and Open Data Strand

One of the highly acclaimed talks of the day came from Joseph Turian of MetaOptimize, titled Sci vs. Sci: Attack Vectors for Black-Hat Data Scientists and Possible Countermeasures. Every skill has a good use and an evil use and Data Science is no exception. We create models to try to combat fraud, detect spam, measure influence and much more. These “good” uses of skills are called “white hat.” On the other hand, a more evil Data Scientist can circumvent these models to allow their spam to go undetected or game an influence metric such as PageRank. For example, consider a malicious web page that contains code that simply repeats a user’s 1Google query endlessly. To a very stupid search engine, such a web page would game a keyword matching algorithm and the search engine that is based on it. This crap web page would appear as the first result because it appears the most relevant. This is a very elementary example, but one can imagine how sophisticated models can produce nasty results.

Turian believes that most Data Scientists originally come from academia where the skills we learned are mainly “white hat”, but that our use in industry is mainly “grey hat” (somewhere between good and not-so-good). Such “grey hat” methods may involve some sort of data privacy issue such as with ad retargeting. A “black hat” data scientist may be useful in constructing a botnet, using Markov models or other language models to generate human-looking spam text, or to create sock puppets to sway opinion in a large social network. A sock puppet is essentially a social media account that is designed to look like a real genuine human but that has an ulterior motive, mainly to proliferate propaganda or false information. The use of these sock puppets is referred to as “astroturfing” — that is, a fake grassroots movement. One easy example I can think of are the thousands and thousands of Twitter accounts that are created simply to sway opinion about President Obama (search for #tcot and you are likely to find some examples, though many are also legitimate users). Turian cited one unsophisticated example of astroturfing: Newt Gingrinch and his huge jump of followers in a short period of time, which was determined to be fake. In this case, it is alleged that Gingrinch’s campaign paid for followers rather than create an army of sock puppets. Some methods for locating sock puppets are the presence of reply spam (@spam), manual classification, andhoneypots.

Some interesting statistics:

  1. 7% of Tweeps (Twitter users) are spam bots.
  2. 20% of us accept friend requests from people we do not know.
  3. 30% of us have been deceived by chat bots.

Note: MetaOptimize hosts an amazing machine learning Q and A site similar in function to StackExchange/StackOverflow. You can visit it here.

Data Science Strand

IPython Notebooks

The first talk of this series I attended was The IPython Notebook: a Comprehensive Tool for Data Science by Brian Granger at Cal Poly San Luis Obispo and Chronicle Labs. One of the major problems in Data Science is that “code and data do not communicate much.” That is, code is usually placed in one file, and data in another file and an analysis involves the coupling of data and code that must be kept in sync throughout the process. Imagine if all of your work as a Data Scientist could be contained on your physical desktop as separate objects — this is a good analogy for IPython Notebooks. An IPython Notebook functions much like aMathematica notebook, or a Sage notebook. One can analyze data in pandas data frames, use some fancy models from SciPy or scikit-learn, use the general Python language as well as the niceties provided by IPython all in one place. Once the code is written, one can produce plots with matplotlib in place and then distribute the document to others. IPython Notebooks provide a living document of one’s work and allows resilience from change by keeping all of the code in one place. Additionally, the concept of cell magic allows the execution of other languages such as R, Ruby and Julia from within the IPython Notebook! Soon there may be no need to run multiple interpreters or have multiple different open-source notebook projects for each additional language!

Here is the amazing part: by using so-called cell magic, one can push a Python object, say a pandas dataframe directly into R and it is converted into an R dataframe. I do not remember the specifics of why this is possible, but this is huge. This eliminates the need for packages like RPy2 for basic computations between R and Python. [Edit: RPy2 is used under the hood for this conversion. Thanks to Dirk for pointing this out.] Brian mentioned that it also may be possible to eventually allow Python objects to interact with JavaScript libraries such as d3.js for visualization using widgets.

IPython Notebooks support narrative text, headings, graphics and also mathematical typesetting via MathJax. Executing code produces JSON strings that are portable and serializable for saving results without requiring code to be re-executed. The site nbviewer.ipython.com provides an online viewer for IPython notebooks via URL, Git repository URL or Gist URL. This viewer does not require the web service to be installed locally. One current limitation of IPython Notebooks is that they only support a single user and thus cannot be hosted for, say, multiple students to login to their own notebook session in a classroom.

Once ipython and ipython-notebook (the Ubuntu packagename) are installed, one just executes the command ipython notebook in the directory of interest to start up a webserver for working with IPython Notebooks.

Apparently entire textbooks are being written as IPython Notebooks for their beauty, scientific ease and portability.

Adversarial Learning

The final talk I attended was What To Do When Your Machine Learning Gets Attacked by Vishwanath Ramarao. The purpose of this talk was to discuss issues with the bad guys trying to circumvent machine learning models designed to prevent abuse of a system, such as a spammer learning how to get around a spam filter over time. This spammer is called an adversary, and can be a “black hat” data scientist. Some examples of adversarial situations are login fraud (spearfishing, PR embarrassment or financial information), comment/mail spam, sign up fraud, astroturfing, credit card fraud and click fraud. Adversarial learning is the set of techniques that classify data emitted from an adversary.

An adversarial situation arises when the adversary is able to observe the output of the learning system and can change some subset of the features used in that system so that their attempts go unpunished. The goal of adversarial learning is to make it costly for an adversary to change features. The approach towards a solution is labor intensive, but simple to explain. Ramarao essentially said that the best way to combat adversaries is to

  1. engineer features interactively and quickly. 
  2. not throw away features as we commonly do. It is possible that some features may be activated as the adversary’s methods evolve. 
  3. consider the entire transmission of an adversarial transaction — that is, do not just look at the words in a spam email but also look at the HTTP headers and other communication information passed along with the text.
  4. study anomalies (outliers and high leverage points) and not discard them. Usually such anomalies are adversaries.
  5. permit overfitting when necessary for the reason mentioned in #3.

As a text mining enthusiast, I learned some interesting tricks on fitting machine learning models to text, neither which had anything to do with adversarial learning.

  • homoglyph is the translation of a word by replacing some characters with a character that looks similar. For example, p0rn is a homoglyph of porn – theo in porn is replaced with a character that looks similar, the zero 0. Broken words
  • A broken word is a translation of an intended word with spaces added. For example, the word nigeria could be a feature for a spam detection algorithm. An adversary can bypass the filter by instead writing ni geria.
  • Hash busters are cases where new words that were not in the lexicon used to train the text model are injected into content. One should use the count of the number of hash busters and use it as a feature in a model. One common hash buster for a naive profanity filter would be the word fcuk instead of the actual word f*ck.

Julia


After being enlightened by this wonderful talk, I am going to write a more substantial post focusing solely on Julia, so for now I will just briefly describe some of the more easy-to-explain content. This talk was presented by Michael Bean from Forio (developers of Julia Studio). As data scientists we love dynamic environments for interactive data munging such as R, or the Python shell (with pandas or SciPy). We typically start with a high level language such as R and then port this code to a compiled or performant language like C, C++ or Java (and maybe Python). This is a large barrier in scientific computing because it requires the data scientist to know two languages: one to experiment, and one to implement. Julia is a scientific computing language that provides the performance of a programming language like C++ and adds technical libraries and accessibility for scientific exploration. Bean cited that Julia’s performance is similar to C++. Julia allows us to complete tasks faster because we remove the need for “glue” code and Julia packages are written in Julia for performance rather than requiring C or Fortran. [R packages can be written solely in R, but for computationally intensive operations, or for packages that will sit in a bottom layer such as data structures etc. there is a huge performance hit.] Once one is familiar with Julia, it is easy to “hack the core” so to speak. 

Other features that impress me:

  • the user can redefine arithmetic operations and construct new data types. Julia uses multiple dispatch which is a programming language feature that uses different implementations of functions depending on the data types passed to the function. For example, if and B are of type matrix, then Julia will know that A * B is the matrix multiplication operation rather than elementwise multiplication. 
  • common data structures found in computer science are supported natively such as BitArrays and SubArrays as well as types statisticians are already familiar with including Distribution and DataFrame.
  • support for list comprehensions. For example, to square every element, use [xi^2 for xi in x] instead of a for loop.
  • every package is a Git repository and thus open-source and easy to access.
  • some packages support multicore natively.
  • certain functions that can have a bash (!) appended which tells Julia not to make copies of the object (think in-place sort which is sort!).

Bean showed that the development process with Julia is shorter than languages such as R because production-level re-implementation is not necessary. The runtime is also faster for the few examples he showed. The following is an example of the recursive implementation of generating Fibonacci numbers in both R and Julia

R Code Julia Code
fib <- function(n)
{
  if (n < 2) {
    return(n)
  } else {
    return(fib(n-1) + fib(n-2))
  }
}
 
start <- Sys.time()
fib(36)
end <- Sys.time()
end - start
fib(n) = n < 2 ? n : fib(n - 1) + fib(n - 2)
@elapsed fib(36)
Runtime: 192 seconds Runtime: 0.24 second

Connected World Strand

Bit.ly: Deriving an Interest Graph

The first talk in this strand that I attended was by Anna Smith of bit.ly titled Deriving an Interest Graph for Social Data. It should be no surprise that a URL shortening service would have a ton of data to sift through. Anna stated that a lot of her work is one-off analysis. What I liked about Anna’s talk in particular is that the visualizations she used were very basic. There was nothing fancy about what her graphics displayed — they just displayed some insights about the data and that is it. 

Bitly extracts a lot of data from each shortened URL including keywords, topics and the probability the click was a human. One can derive a taxonomy and interest graph by analyzing click data among links. The idea is to look at other webpages a user went to from the page related to the shortened URL. It is hypothesized that the next page the user visits is related in content to the current page. On a domain level, a coclick graph uses domains as nodes and the number of clicks between them as edges. From this, we can derive a graph of keywords by using the Jaccard similarity using the number of clicks to a domain with a particular keyword for both sets. The resulting coclick graph has 4.5 million keywords and 9 million edges. By using some basic processing (removing non-English keywords and keywords with low click numbers) and then running a clustering algorithm called DBSCAN, they were able to simplify their graph to 200,000 keyword clusters and 1 million edges.

The Data Science group at bit.ly keeps an updated GitHub repository for their work here.

LinkedIn Endorsements


The last session in this strand that I attended was by Sam Shah and Pete Skomoroch from LinkedIn. This talk discussed the skills endorsement feature of LinkedIn and how they made it successful using science. Sam and Pete credit most of the success to establishing viral loops and using recommendation engines as follows:A endorses B -> B is notified -> B accepts the endorsement and endorses someone else.

Social tagging of skills also accelerated adoption. First users market their skills, and then other skills are recommended for them to add. First, a user thinks about their skills and tags them on their profile. Then, a recommendation system recommends other related skills as well as some potential people for the user to endorse. But this is not the interesting part…

How does LinkedIn maintain a skill dictionary and taxonomy? This is a high unwieldy problem due to human psychology and variations of language usage. One of the biggest issues is in phrase sense disambiguation. The motivating example was the skill angel. If I list angel as a skill on my profile, am I referring to myself as an angel investor or as a spiritual being? The speakers indicated that by using the graph of all the skills listed in addition to angel, we could use agglomerative clustering and then a distance metric to determine which meaning is most likely. This is an example of MS Office, Microsoft Office, Office. All of these concepts refer to the same thing. For this particular problem, LinkedIn used crowdsourcing with Mechanical Turk tasks. An example human interaction task was to ask a participant to find the best Wikipedia article for the particular topic, since Wikipedia tends to already have a strong army that de-duplicates content. 

This is all great for users that actively use the Skills feature, but some do not. For those users, a system passes a sliding window over the profile text (n-grams) and emits possible matches basd on the taxonomy, and tossing out words that do not fit into the inferred topics in the profile. For example, if my profile text says “I love working with data, Python, Java and Hadoop.” The words I, with and and will all be tossed as stopwords. Then, I have the following keywords left:love, working, data, Python, Java, HadoopFor all practical purposes, working is probably considered a stopword or low-impact word because it appears so much on LinkedIn profiles. data is probably not an actual skill so both of these words are removed, leavinglove, Python, Java, HadoopUsing LinkedIn’s taxonomy of skills, we would probably deduce that Python, Java and Hadoop are highly related and love is an extreme outlier (for some people love may be an actual skill, but likely not in this context). Finally, this system would tag Python, Java, and Hadoop as skills to add to the profile. For more complex (realistic) examples, LinkedIn would then apply word sense disambiguation and de-duplication. A simple Naive Bayes algorithm is used to generate the actual recommendations. In the event of a completely blank profile, recommended skills are based on title, organization and perhaps social network features.

LinkedIn can also suggest endorsements where the system asks a user to endorse another user for particular skills that the user may know about. Some features used for this recommendation engine include people-skill combinations, school overlap, group overlap, similarity in industry, title similarity, site interactions, and co-interactions. Such a recommendation engine is basically a binary classification problem for link presense.

This talk by LinkedIn was surprisingly candid. Obviously, one cannot employ the methods they discussed because we do not have access to their data or infrastructure, thus such a talk is of no risk to intellectual property. Many companies do not get this and do not allow their employees to speak about anything involving their work.

Conclusion

I am glad I forked out the money to attend Strata and I will likely attend next year. The conference was huge and there was something for every data geek including a ton of food. The conference overall was not as sales-y as I thought it would be, but there were definitely moments particularly in the morning sessions and at the expo. I mainly just collected t-shirts at the expo hall, but it was basically just a giant “my Hadoop distribution is 100x faster than the other guys’.” There was also a really cool sensor lab setup for collection of data using Arduino sensors. There were several sensors placed throughout the conference venue and the data was visualized and place here.  

During my time at Strata so far, I have finally had the chance to meet some longtime Twitter friends and reunite with others. It was great meeting Neil Kodner and discussing our common interests as well as meeting Mathieu Bastian and discussing graph processing and the future of Gephi (I need to write a blog post about Gephi soon). I had a chance to talk to Wes McKinney over lunch as well about Python and the pandas community. On the last day, I went to an event hosted by Facebook and met several Facebook engineers and other Twitter friends including Joseph Turian, Bradford Stephens, Daniel Tunkelang and Greg Rahn. Everybody I met has now relocated to the Bay Area and I think I am going to need to follow…

Merry Christmas and Happy Holidays!

Wishing you all a very Merry Christmas, Happy Holidays and Happy New Year!

An update on me. In October, I began working at Riot Games, the developers of League of Legends. It has been an amazing experience and has occupied the majority of my free time as has my dissertation. My New Year’s resolution this year is to dust the cobwebs off this blog!

Have a safe holiday season!

Here in California, I will be having Christmas in the Sand

A New Data Toy -- Unboxing the Raspberry Pi

Last week I received two Raspberry Pis in the mail from AdaFruit and just now have some time to play with them. The Raspberry Pi is a minimal computer system that is about the size of a credit card. In the embedded systems community, the excitement is for obvious reasons, but I strongly believe that such a device can help collect and use data to help us make better decisions because not only is it a computer, but it is small and portable.

For development, Raspberry Pi can connect to a television (or other display) via HDMI or composite video (the “yellow” plug for those still stuck in the 1900s haha). A keyboard, mouse and other devices can be connected via two USB ports. A powered hub can provide support for even more devices. There are also various pins for connecting to a breadboard for analyzing analog signals, for a camera or for an external (or touchscreen) display. An SD Card essentially serves as the hard disk and probably a portion of the RAM. The more recent Model B ships with 256MB RAM. Raspberry Pi began shipping in February 2012 and these little guys have been very difficult to get a [...]

Adventures at My First JSM (Joint Statistical Meetings) #JSM2012

During the past few decades that I have been in graduate school (no, not literally) I have boycotted JSM on the notion that “I am not a statistician.” Ok, I am a renegade statistician, a statistician by training. JSM 2012 was held in San Diego, CA, one of the best places to spend a week during the summer. This time, I had no excuse not to go, and I figured that in order to get my Ph.D. in Statistics, I have to have been to at least one JSM. [...]

OpenPaths and a Progressive Approach to Privacy

OpenPaths is a service that allows users with mobile phones to transmit and store their location. It is an initiative by the New York Times that allows users to use their own data, or to contribute their location data for research projects and perhaps startups that wish to get into the geospatial space. OpenPaths brands itself as “a secure data locker for personal location information.” There is one aspect where OpenPaths is very different from other services like Google Latitude: Only the user has access to his/her own data and it is never shared with anybody else unless the user chooses to do so. Additionally, initiatives that wish to use a user’s location data must be asked personally via email (pictured below), and the user has the ability to deny the request.The data shared with each initiative provides only location, and not other data that may be personally identifiable such as name, email, browser, mobile type etc. In this sense, OpenPaths has provided a barebones platform for the collection and storage of location information. Google Latitude is similar, but the data stored on Google’s servers is obviously used by other Google services without explicit user permission.

The service is also opt-in, that [...]

SIAM Data Mining 2012 Conference

Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.

The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to [...]

My Interview about the Statistics Major

Recently, I participated in an email interview about what being a Statistics major entailed, how I got interested in the field and the future of Statistics. I figured this might be of interest to those that are contemplating majoring in Statistics, or considering a career in Data Science.

Q1: Why did you decide to pursue a major in statistics in college?

A: “When I was a kid, I really enjoyed looking at graphs, plots and maps. My parents and I could not make of what was behind the interest. At the same time, I was also heavily interested in education. My mother was a teacher and the first set of statistics I ever encountered were standardized test scores. I strived to understand what the scores attempted to say about me, and why such scores and tests are so trustworthy. When the stakes increased with the AP and SAT exams, I began reading articles published by the Educational Testing Service and learned a ton about how these tests are constructed to minimize bias, and how scores are comparable across forms. It fascinated me how much science goes into these tests, but in the end of the day they are still just one factor [...]

“Hold Only That Pair of 2s?” Studying a Video Poker Hand with R

Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is “do you count cards?” A blank look comes over their face when I say “no.”

Look, if I am at a casino, I am well aware that the odds are against me, so why even try to think that I can use statistics to make money in this way? Although I love numbers and math, the stuff flows through my brain all day long (and night long), every day. If the goal is to enjoy and have fun, I do not want to sit there crunching probability formulas in my head (yes that’s fun, but it is also work). So that leaves me at the video Poker machines enjoying the free drinks. Another positive about video Poker is that $20 can sometimes last a few hours. So it should be no surprise that I do not agree with using Poker to teach probability.  Poker is an extremely superficial way to introduce such a powerful tool and gives the impression that probability is a way to make a quick buck, rather than as an important tool in science and society. The only [...]

Merry Christmas 2011 From Byte Mining!

To all of my readers and followers, I wish you a very Merry Christmas and a very joyous and safe Happy New Year! This year, I am thankful for the community that has sprung up around Data Science and open-source data collection and processing. This blog is almost two years old, and like with Twitter, I have been able to communicate with many data scientists, enthusiasts and some of the most prolific contributors to the data science software community. I am thankful for all of the wonderful people I have met and have yet to meet, and for your comments and reading.

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

article content and template pages
article content with revision history (huge files)
article content including user pages and talk pages
redirect graph
page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
image metadata
site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why [...]