# Review of 2011 Data Scientist Summit

Some time over the past 6 weeks I randomly saw a tweet announcing the “Data Scientist Summit” and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.

The Pre-Party

The Venetian and all close hotels were booked, so I ended up at the Aria; a new experience. The hotel is beautiful and very ritzy. I had heard that the rooms were very technologically advanced but I wasn’t prepared for the recorded welcome message, music and automatic shades opening upon entry to the room. The Aria is a geek’s paradise. Everything is computerized. Key cards are “waved” rather than swiped, lights are turned on/off and dimmed by use case (“sleep”, “read” etc.), rather than manually. There are no paper “Do Not Disturb” signs; rather, a switch on the wall (or via TV) toggles an indicator light outside the door. And the best part… Internet is FREE!

 The rhododendrons hydrangeas are real! Work desk panel contains Ethernet, power, USB, VGA, audio. Cables, provided you want to pay the minibar charge.

Data Scientist Summit, Day 1

I arrived to the conference room and quickly took my seat. Seated in the close vicinity were several familiar faces. I also finally got a chance to meet Drew Conway (@drewconway) and David Smith (@revodavid), both happened to sit in the row in front of me. The keynote by Thorton May provided a lot of humor that kicked off a very energetic event. In the second session, we heard from data scientists and team from Bloom Studios, 23andMe, Kaggle and Google. I was happy to see somebody from Google present, as they never seem to attend these type events (neither does Facebook).

There has been a lot of buzz about 23andMe and Kaggle in the past few months. It is hard to keep up with all of the buzz, so it was great to hear from the companies themselves. 23andMe provides users with a kit containing a test tube into which the user spits. The kit is then sent back to 23AndMe labs which analyzes something like 500,000 to a million different markers (I am not a biologist) and can provide information about what markers are present such as: predisposition to diabetes or cancer etc. In 2011, it costs about $5,000 to do this analysis whereas 10 to 20 years ago the figure was in the millions. 23andMe goes a step further. They understand that genetics have a strong association with particular conditions, but that they are not necessarily causal. For example, someone with a predisposition to diabetes will not necessarily contract the disease. 23andMe wants to integrate other data into their models to help predict how likely a patient is to contract a certain condition, given their genetics. Kaggle is a community-based platform for individuals and organizations to submit datasets and open them up to the Data Science community for analysis…as a competition. I love the geekiness of this endeavor, and it continues where the Netflix Prize left off. Kaggle has some awesome prizes for winning the competition such as$3M for the Heritage Health Prize. There are other freebies as well, such as Revolution R Enterprise free for competitors.

As a disclaimer, I am not a huge visualization guy. I see its importance and usefulness in educating end-users about statistical results, and there are quite a few infographics that are exciting to me. However, there are many times when a boring ol’ boxplot works better than a Processing applet. So, it takes quite a bit to get me excited about cutting-edge graphics. The Immersive Data Visualization session by Dr. JoAnn Kuchera-Morin from UC Santa Barbara did exactly that. They have created a large metal sphere, called AlloSphere, containing a bridge in the center where researchers/analysts stand. Their data is projected, for the eye, throughout the ball in a 3D, or 3D-like world. Of course, data can be represented several ways to the eye: color, size, shape, texture, etc. AlloSphere also represents data using the other senses, particularly sound. In her presentation, JoAnn took us on a 3D tour of her colleague’s brain (fMRI). Of course, we could “see” the inside of the brain, but we could also hear the blood pressure change in different parts of the brain, indicating differing activities. There were some other demonstrations of studies from physics, but I cannot comment on those because I lost interest (physics has always been my worst subject). I attended UC Santa Barbara for one year after high school, so I am particularly proud of what they have done.

Of all the presentations on the first day, Data Scientist DNA was my favorite. In this panel, Anthony Goldbloom of Kaggle, Joe Hellerstein from UC Berkeley, David Steier from Deloitte and Roger Magoulas from O’Reilly Media discussed what makes a good Data Scientist or “data ninja” as stated in the program. All were in agreement that candidates should have an understanding of Probability and Statistics, although someone on the panel suggested that a “basic” background was all that was needed; I disagree with that. A Data Scientist should also be a proficient programmer in some language, either compiled or interpreted and understand at least one statistical package. More importantly, the panel stressed that above and beyond knowledge, it is imperative that a Data Scientist be willing to learn new tools, technologies and languages on the job. Dr. Hellerstein suggested some general guidelines in classes students should take: Statistics (I argue for a full year of upper division statistics, and graduate study), Operating Systems, Database Systems and Distributed Computing. My favorite quote from the panel came from David Steirer, “you don’t just hire a Data Scientist by themselves, you hire them onto a team.” I could not agree more. Finally, the moderator of the panel suggested that Roger Magoulas may have been the one to coin the term “big data” in 2005, but a Twitter follower found evidence that the term has been used since as early as 2000 (Thanks Amund! @atveit).

The last session of the day was given by Jennifer Pahlka from Code for America titled Imagining and Enabling a Better World. Pahlka started her talk by stating that the milennial generation is the most “pro-government” generation of the modern day. Regardless of politics, millennials see potential in the goverment and that it can be used for good. Jennifer compared Code for America to Teach for America for Data Scientists. The goal of Code for America is to put together very bright minds to tackle local, state and federal government issues using data. Pahlka brilliantly stated, “we don’t need guns, we need geeks. We are trying to create a geek army.”

During the end of day cocktail reception, I scored two posters of data visualizations: “super powers” and “game controllers over the years.” The other two posters offered were “beers” and “rappers.” I also had a chance to quickly meet Tim O’Reilly, Founder and CEO of O’Reilly Media, whose books are my favorite for learning programming languages and technologies (the animal books).

Data Scientist Summit, Day 2

Personally, I enjoyed the second day more than the first day but that may have been due to the fact that I got sleep the night before.

It seemed that the highlight of the morning was the talk by Jonathan Harris titled The Art and Science of Storytelling. He introduced his project “We Feel Fine” which is a conglomeration of emotions. His project aims to capture the status of the human condition. This was more of the touchy-feely kind of presentation which is different from most of the Data Science talks. He showed beautiful user interfaces and great examples of fluid user experience. Some statistics that caught my eye regard human emotion over time. It seemed that people experienced loneliness earlier in the week than later in the week. Joy and sadness were approximately inversely related throughout the week and hours of the day, but I cannot remember the direction of the trends. The most interesting graphics involved the difference between “feeling fat” and “being fat.” States like California and New York were hot spots for “feeling fat”, but they are actually some of the skinniest states. Instead, the region between the Gulf of Mexico and the Great Lakes was actually the fattest, but did not feel that way. A graphic for “I feel sick” showed a hotspot in Nevada which I thought was very interesting (nuclear fallout? alochol poisoning in Vegas?). The interesting part of this discussion was that it showed the vast geography of the field called Data Science. Some Data Scientists are more of the visualization and human connection variety, and others (where I consider myself) are more of the classic geeks that like to write code and dig into the data to get a noteworthy result. Well, I guess there isn’t much difference between both camps after all. As Jonathan would probably say, Data Science is about storytelling.

The next few sessions got a bit blurry (as is Data Science); they talked about various interconnected topics. Pete Skomoroch from LinkedIn, Sharon Franks Chiarella from Amazon Mechnical Turk, Gil Elbaz from Factual and Toby Segaram from Google discussed the fact that you can’t turn data into a story without joining the data with, well, other data. Another major topic discussed was how to get labeled data, and this is where Mechnical Turk stands out as a data resource. The next talk was humorously titled Hadoop – The Data Scienist’s Dream. I know some people that would gouge their eyes out when seeing that title. Really, Map-Reduce is the Data Scientist’s dream, but yeah, yeah, I know, Hadoop is the first widely accepted implementaton. Paul Brown from Booz Allen Hamilton and Martin Hall from Karmasphere discussed how Hadoop is typically being used in production and the briefly mentioned how Hadoop’s cousins make the Hadoop ecosystem more powerful.

The last session in this trifecta was titled The Data Scientist’s Toolset – The Recipes that Win. Representatives from various companies were panelists: SAS, Informatica, Cloudscale, Revolution Analytics, and Zementis. I felt that this discussion was lacking. The strength of the Data Science community stems from open-source technology I believe, and except for Revolution Analytics, none of the companies have a strong reputation in the open-source community yet. Discussion seemed to focus too much on enterprise analytics (SQL, SAS, Greenplum etc.) and Hadoop, and not enough on analysis and visualization. All in all, this panel was a bit too “enterprisey” for me. Some Twitterers felt that they were pushing their products too much. This was surprising because I felt the exact opposite, unless they were picking up on the “enterprisey” vibe. The panelists were asked what one tool for data science they would choose of they were on a desert island. The panelists responded with the following tools, “Perl, C++, Java, R [sic, thanks David], SQL and Python.” I was disappointed that SQL was mentioned without a countermention for NoSQL because not all data fits in a nice rectangle called a table. By itself, SQL is very limited. Python and R I definitely agree with. Perl is dated, but still has a use in the Data Scientist’s toolbox if the user is not familiar with Python, and doesn’t want to be. I was baffled by the C++ response and the lack of overlap in the other responses. But these are my opinions only.

The Summit Spotlight, Secrets of Attribution – The Stories Beyond the Last-Click discussed how researchers are trying to use data to “give credit” to not only the site that referred the user to a resource via a click, but all of the sites in the path that lead to that click, the so-called “conversion path” in SEO land. The final session, Building Data Science Firepower – Taking the Leap was very similar to the Data Scientist DNA talk but added in some food for thought. There are two philosophies for hiring and working with Data Scientists. The first is to hire a strong data science team, and the second is to enhance each team with Data Scientists.

2011 EMC Data Hero Awards

At the end of the summit, the recipients of the EMC Data Hero Awards were announced. I missed some of the honorable mentions, but here goes:

• Energy, Silver Springs Networks.
• Heath Care, Jeffrey Brenner, The Camden Coalition.
• Life Sciences, The Broad Institute of MIT and Harvard.
• Media, CMU Create Lab.
• Public Services, Global Virus Forecasting Initiative.
• Technology Application, IBM Watson Computing System.
• Technology IT Infrastructure, Apache Foundation, Hadoop.
• Visionary, Vivek Kundra, CIO, Data.gov.

Vivek Kundra was not present at the summit, but recorded a message to the attendees, which was really cool. He stated that in 2009, there were only 47 government datasets publicly available; in 2011, there are close to 400,000 datasets available to the public.

There were some interesting honorable mentions. Zynga received an honorable mention for Consumer Services. As a player of Farmville and Cityville, I can see the plethora of data that Zynga must work with. Additionally, Zynga has some very creative ways for advertising for brands such as McDonald’s and Tostitos (with Farmville items for both companies), 7-11’s new slurpee (seeds for the Goji Berry), and GagaVille. Zynga also participates in community service: “Sweet Seeds for Haiti” (pay to plant special seeds, with proceeds to Haiti) just to name one.

Tableau Software also received an honorable mention for the Media category. Tableau develops data visualization software, and is picking up huge steam in the data viz community.

The conference ended with an awesome video created by EMC called “I Am a Data Scientist” featuring several EMC Data Scientists, most of which I happened to have lunch with!

Overall Impression

All in all, the Data Scientist Summit was an eye-opening and empowering event, and it was only planned in six weeks. There was a great sense of community and collaboration among those in attendance. I work as a Data Scientist professionally because I love it. The one fact that I tend to overlook is that Data Scientists are in high demand and short supply. I was reminded of how important our work as Data Scientists is.

This was the first annual Data Scientist Summit, and I will no doubt be back. With that said, discussion of technical topics had a bit of an introductory flavor to them, which made the discussion of the technology seem dated. For example, “Vanilla” Hadoop was introduced as a tool for processing vast amounts of data. I would expect that most Data Scientists have worked with Hadoop, or at least know what it is. Hadoop is somewhat old news in terms of “cutting-edge technology.” Tools like Pig, Cascalog, HBase, Hive, Cascading, etc. would have been better discussion topics. I was also disappointed with how little coverage that data mining tools there was (except for Hadoop, NoSQL, and enterpise databases). It seemed as if R had gone M.I.A. and I was surprised that there was such little discussion of visualization tools like Tableau, Processing, Gephi, D3, Polymaps, etc.

The Data Scientist Summit set a very solid foundation for the future. I felt like the modus operandi was “here is why Data Science is cool” and “here is why others should be interested.” Although this is not a groundbreaking discussion, it sets the stage for future conferences and solidification of the community. The people that probably got the most value out of the technical discissions were people looking to switch careers, or enter Data Science.

Without a doubt I will be at next year’s Data Scientist Summit!

My thoughts and opinion on this blog do not reflect those of my employer, the Rubicon Project.

### 6 comments to Review of 2011 Data Scientist Summit

• Great writeup, Ryan! I feel like I just saved thousands of dollars! 🙂

The relationship between “sexy” data science, on twitter data or similar, and Enterprise data science, on corporate data in SQL databases, is interesting. I use R and machine learning and fancy visualizations to understand our corporate data, and to build predictive models and optimization systems. In many ways it’s the same thing. On the other hand, half of my job is to take statistical understanding and models and help to turn them into Enterprise software and decision-support systems. I don’t mostly write user interfaces and Java code with a 15-year lifetime, but I do work closely with programmers, annotate data in databases with statistically-based predictions, etc.

Also, I work with medium-sized data. Rarely more than 100,000 records in a table. But I tend to have to deal with dozens or hundreds of columns. A lot of the application databases that “sexier” data science tackles, like the Foursquare database, are trivially wide, less than 10 attributes, but insanely long. You really do need MapReduce to deal with that. I’ve never needed it.

Somehow the continuum that links my work to, say, “data journalism”, where the visualization is the end point, will be fully defined and fleshed out.

• Stuart

Those plants are hydrangeas, not rhododendrons.

• Ariana

Loved the review! Keep up the great writing. 🙂

• Data Enthusiast

[…] left:0px !important; } // To humbly take both Ryan Rosario’s comments on his post “Review of 2011 Data Scientist Summit” “It seemed that the highlight of the morning was the talk by Jonathan Harris titled The Art […]

• […] came from Jennifer Pahlka from Code for America. I had first learned about Code for America at the Data Scientist Summit hosted by EMC in 2011. At the time it sounded like a very good idea — we have college graduates that dedicate a […]

• […] Pahlka from Code for America. I had first learned about Code for America at the Data Scientist Summit hosted by EMC in 2011. At the time it sounded like a very good idea — we have college graduates that dedicate a […]