Summary of My First Trip to Strata #strataconf

In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.

The slides for most talks will be available here but not all speakers will share their slides.

This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context switching between listening, tweeting, emailing etc.


In the mornings there were several short plenary talks where people throughout the industry discussed their particular views of Data Science. This was basically a warm-up for what would become very long days. I mostly used this time to catch up on email and review stuff from the previous day. The second day apparently had a lot of gimmicky sales talks, but I was not paying attention apparently. The most interest talk came from Jennifer Pahlka from Code for America. I had first learned about Code for America at the Data Scientist Summit hosted by EMC in 2011. At the time it sounded like a very good idea — we have college graduates that dedicate a couple years to teaching in inner-city schools so it makes sense that we should have some data scientists working on “projects that make a difference in the world” as Jennifer would say. The projects that these data scientists work on involve democratizing data and open data initiatives in local governments. A couple of projects that stood out to me were a project to release some 800+ datasets from the City of Santa Cruz (an amazing city) on a website. Another project involved studying bail amounts and the outcome of a criminal trial. This talk was apropos considering the recent announcement of code.org, the start of a movement to begin teaching computer programming to K-12 students. [You can sign the petition and register as a volunteer here.]

Visualization Strand

I have said in the past that visualization is not my thing. I greatly appreciate interactive graphics and cool infographics that convey strong meaning to non-data scientists but it simply is not my cup of tea yet. However, it is something that I want to invest time into. I decided to attend visualization talks up to my tolerance level (which isn’t very high)… which meant one two.

I attended Chang She’s talk Agile Data Wrangling and Web-based Visualizations. Chang did what I usually do: pack too much into a one-hour talk… but I feel that talks like this really whet the appetite to learn more. He discussed how data science is missing a “blue button” that takes care of data management and then visualization. Using the federal election commission dataset, he showed political donations by party, candidate and state as the motivating example. Chang showed several examples of using pandas (a Python data munging library) to manipulate the data and then passing that data to d3.js using a JSON data format with a web server. I felt that this was just a basic talk on how to combine tools to munge data and then visualize it. It is far from a blue button, but shows how important such processing pipelines are.

Law, Ethics and Open Data Strand

One of the highly acclaimed talks of the day came from Joseph Turian of MetaOptimize, titled Sci vs. Sci: Attack Vectors for Black-Hat Data Scientists and Possible Countermeasures. Every skill has a good use and an evil use and Data Science is no exception. We create models to try to combat fraud, detect spam, measure influence and much more. These “good” uses of skills are called “white hat.” On the other hand, a more evil Data Scientist can circumvent these models to allow their spam to go undetected or game an influence metric such as PageRank. For example, consider a malicious web page that contains code that simply repeats a user’s 1Google query endlessly. To a very stupid search engine, such a web page would game a keyword matching algorithm and the search engine that is based on it. This crap web page would appear as the first result because it appears the most relevant. This is a very elementary example, but one can imagine how sophisticated models can produce nasty results.

Turian believes that most Data Scientists originally come from academia where the skills we learned are mainly “white hat”, but that our use in industry is mainly “grey hat” (somewhere between good and not-so-good). Such “grey hat” methods may involve some sort of data privacy issue such as with ad retargeting. A “black hat” data scientist may be useful in constructing a botnet, using Markov models or other language models to generate human-looking spam text, or to create sock puppets to sway opinion in a large social network. A sock puppet is essentially a social media account that is designed to look like a real genuine human but that has an ulterior motive, mainly to proliferate propaganda or false information. The use of these sock puppets is referred to as “astroturfing” — that is, a fake grassroots movement. One easy example I can think of are the thousands and thousands of Twitter accounts that are created simply to sway opinion about President Obama (search for #tcot and you are likely to find some examples, though many are also legitimate users). Turian cited one unsophisticated example of astroturfing: Newt Gingrinch and his huge jump of followers in a short period of time, which was determined to be fake. In this case, it is alleged that Gingrinch’s campaign paid for followers rather than create an army of sock puppets. Some methods for locating sock puppets are the presence of reply spam (@spam), manual classification, andhoneypots.

Some interesting statistics:

  1. 7% of Tweeps (Twitter users) are spam bots.
  2. 20% of us accept friend requests from people we do not know.
  3. 30% of us have been deceived by chat bots.

Note: MetaOptimize hosts an amazing machine learning Q and A site similar in function to StackExchange/StackOverflow. You can visit it here.

Data Science Strand

IPython Notebooks

The first talk of this series I attended was The IPython Notebook: a Comprehensive Tool for Data Science by Brian Granger at Cal Poly San Luis Obispo and Chronicle Labs. One of the major problems in Data Science is that “code and data do not communicate much.” That is, code is usually placed in one file, and data in another file and an analysis involves the coupling of data and code that must be kept in sync throughout the process. Imagine if all of your work as a Data Scientist could be contained on your physical desktop as separate objects — this is a good analogy for IPython Notebooks. An IPython Notebook functions much like aMathematica notebook, or a Sage notebook. One can analyze data in pandas data frames, use some fancy models from SciPy or scikit-learn, use the general Python language as well as the niceties provided by IPython all in one place. Once the code is written, one can produce plots with matplotlib in place and then distribute the document to others. IPython Notebooks provide a living document of one’s work and allows resilience from change by keeping all of the code in one place. Additionally, the concept of cell magic allows the execution of other languages such as R, Ruby and Julia from within the IPython Notebook! Soon there may be no need to run multiple interpreters or have multiple different open-source notebook projects for each additional language!

Here is the amazing part: by using so-called cell magic, one can push a Python object, say a pandas dataframe directly into R and it is converted into an R dataframe. I do not remember the specifics of why this is possible, but this is huge. This eliminates the need for packages like RPy2 for basic computations between R and Python. [Edit: RPy2 is used under the hood for this conversion. Thanks to Dirk for pointing this out.] Brian mentioned that it also may be possible to eventually allow Python objects to interact with JavaScript libraries such as d3.js for visualization using widgets.

IPython Notebooks support narrative text, headings, graphics and also mathematical typesetting via MathJax. Executing code produces JSON strings that are portable and serializable for saving results without requiring code to be re-executed. The site nbviewer.ipython.com provides an online viewer for IPython notebooks via URL, Git repository URL or Gist URL. This viewer does not require the web service to be installed locally. One current limitation of IPython Notebooks is that they only support a single user and thus cannot be hosted for, say, multiple students to login to their own notebook session in a classroom.

Once ipython and ipython-notebook (the Ubuntu packagename) are installed, one just executes the command ipython notebook in the directory of interest to start up a webserver for working with IPython Notebooks.

Apparently entire textbooks are being written as IPython Notebooks for their beauty, scientific ease and portability.

Adversarial Learning

The final talk I attended was What To Do When Your Machine Learning Gets Attacked by Vishwanath Ramarao. The purpose of this talk was to discuss issues with the bad guys trying to circumvent machine learning models designed to prevent abuse of a system, such as a spammer learning how to get around a spam filter over time. This spammer is called an adversary, and can be a “black hat” data scientist. Some examples of adversarial situations are login fraud (spearfishing, PR embarrassment or financial information), comment/mail spam, sign up fraud, astroturfing, credit card fraud and click fraud. Adversarial learning is the set of techniques that classify data emitted from an adversary.

An adversarial situation arises when the adversary is able to observe the output of the learning system and can change some subset of the features used in that system so that their attempts go unpunished. The goal of adversarial learning is to make it costly for an adversary to change features. The approach towards a solution is labor intensive, but simple to explain. Ramarao essentially said that the best way to combat adversaries is to

  1. engineer features interactively and quickly. 
  2. not throw away features as we commonly do. It is possible that some features may be activated as the adversary’s methods evolve. 
  3. consider the entire transmission of an adversarial transaction — that is, do not just look at the words in a spam email but also look at the HTTP headers and other communication information passed along with the text.
  4. study anomalies (outliers and high leverage points) and not discard them. Usually such anomalies are adversaries.
  5. permit overfitting when necessary for the reason mentioned in #3.

As a text mining enthusiast, I learned some interesting tricks on fitting machine learning models to text, neither which had anything to do with adversarial learning.

  • homoglyph is the translation of a word by replacing some characters with a character that looks similar. For example, p0rn is a homoglyph of porn — theo in porn is replaced with a character that looks similar, the zero 0. Broken words
  • A broken word is a translation of an intended word with spaces added. For example, the word nigeria could be a feature for a spam detection algorithm. An adversary can bypass the filter by instead writing ni geria.
  • Hash busters are cases where new words that were not in the lexicon used to train the text model are injected into content. One should use the count of the number of hash busters and use it as a feature in a model. One common hash buster for a naive profanity filter would be the word fcuk instead of the actual word f*ck.

Julia


After being enlightened by this wonderful talk, I am going to write a more substantial post focusing solely on Julia, so for now I will just briefly describe some of the more easy-to-explain content. This talk was presented by Michael Bean from Forio (developers of Julia Studio). As data scientists we love dynamic environments for interactive data munging such as R, or the Python shell (with pandas or SciPy). We typically start with a high level language such as R and then port this code to a compiled or performant language like C, C++ or Java (and maybe Python). This is a large barrier in scientific computing because it requires the data scientist to know two languages: one to experiment, and one to implement. Julia is a scientific computing language that provides the performance of a programming language like C++ and adds technical libraries and accessibility for scientific exploration. Bean cited that Julia’s performance is similar to C++. Julia allows us to complete tasks faster because we remove the need for “glue” code and Julia packages are written in Julia for performance rather than requiring C or Fortran. [R packages can be written solely in R, but for computationally intensive operations, or for packages that will sit in a bottom layer such as data structures etc. there is a huge performance hit.] Once one is familiar with Julia, it is easy to “hack the core” so to speak. 

Other features that impress me:

  • the user can redefine arithmetic operations and construct new data types. Julia uses multiple dispatch which is a programming language feature that uses different implementations of functions depending on the data types passed to the function. For example, if and B are of type matrix, then Julia will know that A * B is the matrix multiplication operation rather than elementwise multiplication. 
  • common data structures found in computer science are supported natively such as BitArrays and SubArrays as well as types statisticians are already familiar with including Distribution and DataFrame.
  • support for list comprehensions. For example, to square every element, use [xi^2 for xi in x] instead of a for loop.
  • every package is a Git repository and thus open-source and easy to access.
  • some packages support multicore natively.
  • certain functions that can have a bash (!) appended which tells Julia not to make copies of the object (think in-place sort which is sort!).

Bean showed that the development process with Julia is shorter than languages such as R because production-level re-implementation is not necessary. The runtime is also faster for the few examples he showed. The following is an example of the recursive implementation of generating Fibonacci numbers in both R and Julia

R Code Julia Code
fib <- function(n)
{
  if (n < 2) {
    return(n)
  } else {
    return(fib(n-1) + fib(n-2))
  }
}
 
start <- Sys.time()
fib(36)
end <- Sys.time()
end - start
fib(n) = n < 2 ? n : fib(n - 1) + fib(n - 2)
@elapsed fib(36)
Runtime: 192 seconds Runtime: 0.24 second

Connected World Strand

Bit.ly: Deriving an Interest Graph

The first talk in this strand that I attended was by Anna Smith of bit.ly titled Deriving an Interest Graph for Social Data. It should be no surprise that a URL shortening service would have a ton of data to sift through. Anna stated that a lot of her work is one-off analysis. What I liked about Anna’s talk in particular is that the visualizations she used were very basic. There was nothing fancy about what her graphics displayed — they just displayed some insights about the data and that is it. 

Bitly extracts a lot of data from each shortened URL including keywords, topics and the probability the click was a human. One can derive a taxonomy and interest graph by analyzing click data among links. The idea is to look at other webpages a user went to from the page related to the shortened URL. It is hypothesized that the next page the user visits is related in content to the current page. On a domain level, a coclick graph uses domains as nodes and the number of clicks between them as edges. From this, we can derive a graph of keywords by using the Jaccard similarity using the number of clicks to a domain with a particular keyword for both sets. The resulting coclick graph has 4.5 million keywords and 9 million edges. By using some basic processing (removing non-English keywords and keywords with low click numbers) and then running a clustering algorithm called DBSCAN, they were able to simplify their graph to 200,000 keyword clusters and 1 million edges.

The Data Science group at bit.ly keeps an updated GitHub repository for their work here.

LinkedIn Endorsements


The last session in this strand that I attended was by Sam Shah and Pete Skomoroch from LinkedIn. This talk discussed the skills endorsement feature of LinkedIn and how they made it successful using science. Sam and Pete credit most of the success to establishing viral loops and using recommendation engines as follows:A endorses B -> B is notified -> B accepts the endorsement and endorses someone else.

Social tagging of skills also accelerated adoption. First users market their skills, and then other skills are recommended for them to add. First, a user thinks about their skills and tags them on their profile. Then, a recommendation system recommends other related skills as well as some potential people for the user to endorse. But this is not the interesting part…

How does LinkedIn maintain a skill dictionary and taxonomy? This is a high unwieldy problem due to human psychology and variations of language usage. One of the biggest issues is in phrase sense disambiguation. The motivating example was the skill angel. If I list angel as a skill on my profile, am I referring to myself as an angel investor or as a spiritual being? The speakers indicated that by using the graph of all the skills listed in addition to angel, we could use agglomerative clustering and then a distance metric to determine which meaning is most likely. This is an example of MS Office, Microsoft Office, Office. All of these concepts refer to the same thing. For this particular problem, LinkedIn used crowdsourcing with Mechanical Turk tasks. An example human interaction task was to ask a participant to find the best Wikipedia article for the particular topic, since Wikipedia tends to already have a strong army that de-duplicates content. 

This is all great for users that actively use the Skills feature, but some do not. For those users, a system passes a sliding window over the profile text (n-grams) and emits possible matches basd on the taxonomy, and tossing out words that do not fit into the inferred topics in the profile. For example, if my profile text says “I love working with data, Python, Java and Hadoop.” The words I, with and and will all be tossed as stopwords. Then, I have the following keywords left:love, working, data, Python, Java, HadoopFor all practical purposes, working is probably considered a stopword or low-impact word because it appears so much on LinkedIn profiles. data is probably not an actual skill so both of these words are removed, leavinglove, Python, Java, HadoopUsing LinkedIn’s taxonomy of skills, we would probably deduce that Python, Java and Hadoop are highly related and love is an extreme outlier (for some people love may be an actual skill, but likely not in this context). Finally, this system would tag Python, Java, and Hadoop as skills to add to the profile. For more complex (realistic) examples, LinkedIn would then apply word sense disambiguation and de-duplication. A simple Naive Bayes algorithm is used to generate the actual recommendations. In the event of a completely blank profile, recommended skills are based on title, organization and perhaps social network features.

LinkedIn can also suggest endorsements where the system asks a user to endorse another user for particular skills that the user may know about. Some features used for this recommendation engine include people-skill combinations, school overlap, group overlap, similarity in industry, title similarity, site interactions, and co-interactions. Such a recommendation engine is basically a binary classification problem for link presense.

This talk by LinkedIn was surprisingly candid. Obviously, one cannot employ the methods they discussed because we do not have access to their data or infrastructure, thus such a talk is of no risk to intellectual property. Many companies do not get this and do not allow their employees to speak about anything involving their work.

Conclusion

I am glad I forked out the money to attend Strata and I will likely attend next year. The conference was huge and there was something for every data geek including a ton of food. The conference overall was not as sales-y as I thought it would be, but there were definitely moments particularly in the morning sessions and at the expo. I mainly just collected t-shirts at the expo hall, but it was basically just a giant “my Hadoop distribution is 100x faster than the other guys’.” There was also a really cool sensor lab setup for collection of data using Arduino sensors. There were several sensors placed throughout the conference venue and the data was visualized and place here.  

During my time at Strata so far, I have finally had the chance to meet some longtime Twitter friends and reunite with others. It was great meeting Neil Kodner and discussing our common interests as well as meeting Mathieu Bastian and discussing graph processing and the future of Gephi (I need to write a blog post about Gephi soon). I had a chance to talk to Wes McKinney over lunch as well about Python and the pandas community. On the last day, I went to an event hosted by Facebook and met several Facebook engineers and other Twitter friends including Joseph Turian, Bradford Stephens, Daniel Tunkelang and Greg Rahn. Everybody I met has now relocated to the Bay Area and I think I am going to need to follow…

4 comments to Summary of My First Trip to Strata #strataconf

  • Love the t-shirts! Great review – next time, make sure you stop by SiSense’s booth! We got voted “Best In Show” at the event. Find out why @ http://bit.ly/YOnUdy

  • Great review. We attended several of the same sessions. It was a rich conference for both its content and for the t-shirts.

    Spark and Shark from the AMPLab at Berkeley caught a lot of buzz the first couple of days. The Berkeley Data Analytics Stack (BDAS) is something to keep our eyes on. One of my to-do projects is to spin up an instance on AWS and finish going through their tutorials. Their documentation on how to do this and how to go through their exercises is pretty good. A paper published in Nov 2012 goes through some of the voodoo CS they did to get Shark so much faster than Hive. It’s a dense paper. I need to read more to see if it’s just for some use cases where Shark runs 100x than Hive.

    • I have added it to my list of things to read 🙂

      I am planning on doing an unscientific bakeoff between Java Hadoop, Spark and Shark using a basic filter and map/reduce. Curious to see the difference in performance.

  • Strata and other conferences | Fraud Analytics Blog

    […] Rosario’s excellent post about his first trip to the […]

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>