In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.
The slides for most talks will be available here but not all speakers will share their slides.
This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context [...]
During the past few decades that I have been in graduate school (no, not literally) I have boycotted JSM on the notion that “I am not a statistician.” Ok, I am a renegade statistician, a statistician by training. JSM 2012 was held in San Diego, CA, one of the best places to spend a week during the summer. This time, I had no excuse not to go, and I figured that in order to get my Ph.D. in Statistics, I have to have been to at least one JSM. [...]
Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is “do you count cards?” A blank look comes over their face when I say “no.”
Look, if I am at a casino, I am well aware that the odds are against me, so why even try to think that I can use statistics to make money in this way? Although I love numbers and math, the stuff flows through my brain all day long (and night long), every day. If the goal is to enjoy and have fun, I do not want to sit there crunching probability formulas in my head (yes that’s fun, but it is also work). So that leaves me at the video Poker machines enjoying the free drinks. Another positive about video Poker is that $20 can sometimes last a few hours. So it should be no surprise that I do not agree with using Poker to teach probability. Poker is an extremely superficial way to introduce such a powerful tool and gives the impression that probability is a way to make a quick buck, rather than as an important tool in science and society. The only [...]
<< My review of Day 1.
I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.
KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year’s conference had a few big names.
Steven Boyd, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed. The first keynote, by Steven Boyd, discussed convex optimization. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (“non-negative curvature” as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is “beautiful theory” [...]
I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. AdMeld did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That’s good targeting!
Mining and Learning on Graphs Workshop 2011
I had originally planned to attend the 2-day workshop Mining and Learning with Graphs (MLG2011) but I forgot that it started on Saturday and I arrived on Sunday. I attended part of MLG2011 but it was difficult to pay attention considering it was my first time waking up at 7am in a long time. The first talk I arrived for was Networks Spill the Beans by Lada Adamic from the University of Michigan. Adamic’s presented work involved inferring properties of content (the “what”) using network structure alone (using only the “who”: who shares with whom). One example she presented involved questions and answers on a Java programming language forum. The research problem was to determine things such as who is most likely to answer a Java beginner’s question: a guru, or a slightly more experienced user? Another research question asked what dynamic interactions tell us about information flow. [...]
Some time over the past 6 weeks I randomly saw a tweet announcing the “Data Scientist Summit” and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.
The Venetian and all close hotels were booked, so I ended up at the Aria; a new experience. The hotel is beautiful and very ritzy. I had heard that the rooms were very technologically advanced but I wasn’t prepared for the recorded welcome message, music and automatic shades opening upon entry to the room. The Aria is a geek’s paradise. Everything is computerized. Key cards are “waved” rather than swiped, lights are turned on/off and dimmed by use case (“sleep”, “read” etc.), rather than manually. There are no paper “Do Not Disturb” signs; rather, a switch on the wall (or via TV) toggles an indicator light outside the door. And the best part… Internet is FREE!
The rhododendrons hydrangeas are real!
Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user “boots” up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.
Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and “gotchas” that are important to keep in mind when using EC2, and in this post, with [...]
This week it was revealed that the iPhone stores users’ locations, and this immediately caused a huge firestorm of commentary by tech geeks, panic among privacy advocates, and delight to data geeks like myself. Even better/worse, it seems that the iPhone caches location traces long-term, possibly back to the date the phone was activated.
I ditched my iPhone this past December (good riddance) in favor of the Droid X (Android). I figured, on such an open source OS, Google must be doing the same thing. After surfing through Hacker News, it turns out I was right.
Compared to the iPhone though, getting the data on an Android phone is not simple.
The data is stored in two files, cache.cell and cache.wifi in the directory /data/data/com.google.android.location/files.
First, the user cannot browse this directory by attaching it to a computer. I installed an SSH daemon QuickSSHD to allow remote access into my phone.
Second, it is not possible to access this directory without getting a Permission denied error, even if logged in as “root” as Google has not made this directory readable.
Finally, for those (myself) that are still determined to crack this nut, you will need to root your phone. This makes the “root” user a real [...]
As most readers are probably aware, the free IDE for R, called RStudio, was recently released for general use and it immediately made huge waves within the R community. IDE stands for Integrated Development Environment. IDEs typically provides a rich set tools developing in some target language. For standard programming languages like C++ (VisualStudio) and Java (Eclipse or NetBeans), IDEs contain:
an editor tailored to the target language. The editor typically has tab/auto-complete for variable names, functions and class methods and properties and also features syntax highlighting.
a multiple document interface (MDI) where there may be several documents opened in different tabs.
a window that interacts with the compiler, or a panel containing the console to the language, a la MATLAB, and even vanilla R’s GUI.
a file browser and language reference.
RStudio plays to this analogy very well, and makes modifications where appropriate. RStudio provides many features that are lacking in the standard R GUI, and improves on features that do not work properly in the Windows R GUI. Over the past few days, I have been doing all of my R analysis within RStudio, shortly with the Desktop version, and mostly with the Server version. I will discuss mostly the server version [...]
I am happy to report that ByteMining is listed on “40 Fascinating Blogs for the Ultimate Statistics Geek“!
Some of the ones that I frequently read, or are written by Twitter friends/followers (in no particular order):
R-bloggers, an aggregate site containing blog posts tagged as posts about R. High quality content.
Statistical modeling, causal inference and social science. This one is a no brainer, as it is the blog for Andrew Gelman‘s group.
FlowingData by Nathan Yau (@flowingdata), fellow Statistics Ph.D. student at UCLA. Focuses on the data and information visualization side of Data Science.
dataists by Hilary Mason (@hmason, bit.ly), Vince Buffalo (@vsbuffalo, UC Davis),
Drew Conway (@drewconway, NYU), Mike Dewar (@mikedewar, Columbia),
John Myles White (@johnmyleswhite, Princeton) and others.
A new blog on several aspects of Data Science including Data Mining, visualization and uses of Statistics in current events. Heavy use of R and ggplot2.
Revolutions by Revolution Analytics provides a variety of content around R, Data Science and Statistics in general.
FiveThirtyEight by Nate Silver shares sophisticated modeling and analysis of elections and government happenings. It is in a different realm, as it attracts political news junkies (and the occasional extremist) rather than just Statisticians.
LoveStats by Annie Pettit, Ph.D. (@LoveStats) discusses Statistics as used in Social [...]