Some New Year Resolutions for (this) Data Scientist in 2017

I’ve never been very big on New Year’s resolutions. I’ve tried them in the past, and while they are nice to think about, they are always overly vague, difficult to accomplish in a year, trite, or just don’t get done (or attempted). This year I decided to try something different instead of just not making resolutions at all. I set out some professional goals for myself as a Data Scientist. So without further ado…

1. Don’t Complain about It, Fix It: Contribute to Open Source Software (More)

Open source software is only as good as its community and/or developer(s). Developers are human and typically cannot manage all bugs and feature requests themselves. My goal is to routinely contribute back to the community either with new features, or by fixing bugs that I discover. This not only helps the community at large, but also helps me as a software engineer. There is no better way to become an even better engineer than by wading through someone else’s code. While this is something I did all day every day at my $DAYJOB, I do it less while on my sabbatical.

Some of the projects I use the most and that I hope to contribute to are scikit-learn and pandas, particularly parts for higher performance computing such as out-of-core processing, and batched processing. These “tricks” are critical to working with huge datasets on small machines, particularly for students that may not wish to pay for Amazon EC2, Azure etc.

2. Focus on Sharing, Not Just Doing

One of the qualities of my Ph.D. advisor that I admire the most is his dedication to sharing pretty much everything that he does, even if it isn’t complete. Anytime there is a new medium for writing and sharing technical content, he adopts it for this work. Through him, I learned about GitBooks and RPubs. There is also GitHub Pages, ways to share Jupyter Notebooks, R Notebooks, etc. It is hard to keep up with all these new ways of sharing work, but the takeaway is that I need to get better at it.

When I am asked, I typically recommend that people only post on their GitHub completed projects that they are proud of. I am thinking of using a secondary GitHub account for exploration. There are often times I start a project, get distracted and never complete it. But, many times I learn some interesting techniques or hacks that somebody else could use and do not have time to blog about it. Right now, all of that knowledge goes to waste. By sharing this work somebody else can find these gems, even if the project itself is not complete(d). In academia there is the mantra publish or perish. While my academic pursuits will end at a Ph.D., some teaching and maybe a conference talk or two, I want to start taking this to heart in the technical world — give talks, contribute to meetups, blog more, participate on StackOverflow, Quora, Gitter, IRC… and maybe Slack…

One other aspect of sharing work I have done is that it encourages accountability. If I post unfinished work, I may be more likely to finish it, and if there is enough content to remind me of what I did and why I was doing it, that would encourage even more success.

This concept of sharing also applies to my persona at-home habit of writing one-off scripts. In software engineering we focus on reusable code. I want to start taking these one-off scripts and turning them into scripts with at least a command-line interface. This of course assumes that the script has some use to someone other than myself. It is my job to try to make it so, all in the name of sharing and contributing. It is not always about the goal. While some of my advisor’s shared manuscripts and code snippets are not useful to me in what they do, I have learned a lot about coding techniques and new algorithms and that makes sharing content worth it.

3. Create a Usable Web Service, Running on a Real Server (not localhost)

People say that communication is a big part of being a Data Scientist. I believe this depends on the type of Data Scientist role. A Data Science Engineer focuses more of his/her time on accessibility… developing data products or tools that allow people (or machines) to make decisions or present data in a way that a human can easily understand with interactive graphics and other forms of user interaction. Of course this is a special form of communication.

I’ve built machine learning systems, but at the time I did not appreciate the full lifecycle of the system. The system needs to sell itself. Not only should it implement a model in a scalable way, it also needs to adapt to new data (online learning and tuning) and also . At the time, I thought this was a pain, but I now realize that this is what makes a system speak to the user: a full feedback loop.

Going with that theme, web apps, web services, whatever, is a very useful way of sharing insights and presenting data to the client. I’ve worked almost entirely backend, and thus I would like to spend time turning some of my projects into front end web apps using JavaScript (D3, Leaflet, React) and perhaps some of the new features of Python 3 such as ayncio. Most of my experimentation has been on localhost. I want to create a real app on Amazon EC2 or running on this host… just somewhere other than

4. Open My Mind (More) to Neural Networks and Deep Learning

Naturally, I do a lot of machine learning. Over the past 2 to 3 years, Deep Learning has been named the solution to every problem under the sun and I am sure it can be used to find that missing sock. I knew about neural networks but it wasn’t the branch of machine learning I focused much on, so I put the whole deep learning thing on the “not right now” list. During 2016, the field went from just a bunch of headlines by well-known practitioners, but I also saw what people I respected thought on the matter as well as academia.

I received my copy of Deep Learning by Goodfellow and friends and I intend to read it cover to cover. I never had an interest in computer vision because I was not sure if we could ever solve vision problems on consumer hardware and yet here we are. While deep learning is part of that, I feel that it may be a more natural fit for vision and it would be more accessible to me and others. Of course, I am also very interested in applications to natural language processing.

5. Learn a New Language

I love Python. I also love R. Both of them do pretty much everything I need to do as a Data Scientist. Ok, I also need SQL and Bash quite a bit as well. As a developer, I do not want to get stuck in my ways and for my brain to start to rust so I would like to learn a new language. Scala is the one that is really calling my name. I suppose its rise was due to the rise of Spark, but it seems to have idioms that make it very useful for Data Science. I am very conservative when it comes to learning new programming languages because I have been caught up in the fad of new languages that end up being popular for a year or two and then falling out of popularity, even if they are still useful. Think Haskell, Erlang…

Of course, I might also just read Stroustrup’s C++ book cover-to-cover as well as Bruce Eckel’s Java book cover to cover to beef up my C++ and Java respectively as both of those languages are very important for high performance computing (C++), distributed computing (both, but mostly Java) and systems development.

6. Learn about Electronics and Explore

If you look anywhere on Twitter or in the blogosphere, you are sure to read about some gadget somebody has developed using a Raspberry Pi, Arduino, or just plain ole’ circuit boards and components. Developing software can be exciting, but it can only do so much. We need circuit boards, sensors and other components to physically do something. Of course this something can be talking to servers, software, the cloud or other devices. I have very little idea of how electronics work at this basic level and I am looking for a challenge. Right now, my most advanced “gadget” is actually just a tiny computer powered by Raspberry Pi, which serves as a snowcam.

My parents need a new doorbell, one that has a camera that always runs, has decent motion detection and sends alerts over multiple different channels of communication. We have the Ring which is proprietary and just does not do a great job of this. Wireless performance is terrible, and the bell only rings in one room. With some electronic components and either an Arduino or Raspberry Pi, I am convinced I can do better at least for our purpose. I can also access all of the video and alerts on my own server rather than having to pay and deal with the cloud. Another thing… my mother has an elaborate Christmas display in the front yard connected to several timers. The timers are neversynched properly and half the yard will be dark. I want to create a power bar that can be programmed over wifi or Bluetooth and that keeps itself synched. Such a device already exists, but I want to do it myself.

My fear or electricity and either electrocuting myself or wasting money burning out circuit boards has precluded me from participating in this fascinating field. I plan on going through this book on electronics to get me started, and from there we will see!

As for myself, personally…

I only have one personal resolution. One that is doable and that would give me joy: Travel somewhere new just to mountain bike. Who knows where I will end up in 2017, but if it involves me mountain biking somewhere other than Mammoth, Lake Tahoe or Southern California, I will consider that a success. Some places on my wishlist include Moab, UT, Bend, OR, Ashland, OR, Whistler, BC, Downieville (not really a trip though), Crested Butte, CO, Park City, UT and maybe Brevard, NC… or… Scotland?

What are your Data Science resolutions or goals this year? Tell me in the comments, and also if you have any recommendations for me based on what I wrote above, feel free to share them!

It’s Been a While

This past three years has really flown. It’s time for me to finally get back to my roots and also start blogging more, like I did previously.

My last post was about Strata 2013. During this time period, I was taking a break from working full-time to finish a Ph.D. dissertation that I had neglected during my previous two positions. I learned my lesson the hard way, never work externally if you want a Ph.D. in a reasonable amount of time! I quickly got my dissertation from an intro to the first 65 pages or so during this gap. I then received an offer from Facebook. I was ready to move to Silicon Valley and enjoy all the things I had been envious over for so many years: the perks, the culture of innovation and intelligence, and the technology community. This was an opportunity I could not pass, and the dissertation went on the back-burner for another two years as I spent the majority of my waking hours, both during the week and the weekend… and on holidays… coding into a frenzy. I was looking forward to living in a world where I was entrenched in the technology and data ecosystem. […]

Summary of My First Trip to Strata #strataconf

In this post I am goIing to summarize some of the things that I learned at Strata Santa Clara 2013. For now, I will only discuss the conference sessions as I have a much longer post about the tutorial sessions that I am still working on and will post at a later date. I will add to this post as the conference winds down.

The slides for most talks will be available here but not all speakers will share their slides.

This is/was my first trip to Strata so I was eagerly awaiting participating as an attendant. In the past, I had been put off by the cost and was also concerned that the conference would be an endless advertisement for the conference sponsors and Big Data platforms. I am happy to say that for the most part I was proven wrong. For easier reading, I am summarizing talks by topic rather than giving a laundry list schedule for a long day and also skip sessions that I did not find all that illuminating. I also do not claim 100% accuracy of this text as the days are very long and my ears and mind can only process so much data when I am context […]

Merry Christmas and Happy Holidays!

Wishing you all a very Merry Christmas, Happy Holidays and Happy New Year!

An update on me. In October, I began working at Riot Games, the developers of League of Legends. It has been an amazing experience and has occupied the majority of my free time as has my dissertation. My New Year’s resolution this year is to dust the cobwebs off this blog!

Have a safe holiday season!

Here in California, I will be having Christmas in the Sand

A New Data Toy -- Unboxing the Raspberry Pi

Last week I received two Raspberry Pis in the mail from AdaFruit and just now have some time to play with them. The Raspberry Pi is a minimal computer system that is about the size of a credit card. In the embedded systems community, the excitement is for obvious reasons, but I strongly believe that such a device can help collect and use data to help us make better decisions because not only is it a computer, but it is small and portable.

For development, Raspberry Pi can connect to a television (or other display) via HDMI or composite video (the “yellow” plug for those still stuck in the 1900s haha). A keyboard, mouse and other devices can be connected via two USB ports. A powered hub can provide support for even more devices. There are also various pins for connecting to a breadboard for analyzing analog signals, for a camera or for an external (or touchscreen) display. An SD Card essentially serves as the hard disk and probably a portion of the RAM. The more recent Model B ships with 256MB RAM. Raspberry Pi began shipping in February 2012 and these little guys have been very difficult to get a […]

Adventures at My First JSM (Joint Statistical Meetings) #JSM2012

During the past few decades that I have been in graduate school (no, not literally) I have boycotted JSM on the notion that “I am not a statistician.” Ok, I am a renegade statistician, a statistician by training. JSM 2012 was held in San Diego, CA, one of the best places to spend a week during the summer. This time, I had no excuse not to go, and I figured that in order to get my Ph.D. in Statistics, I have to have been to at least one JSM. […]

OpenPaths and a Progressive Approach to Privacy

OpenPaths is a service that allows users with mobile phones to transmit and store their location. It is an initiative by the New York Times that allows users to use their own data, or to contribute their location data for research projects and perhaps startups that wish to get into the geospatial space. OpenPaths brands itself as “a secure data locker for personal location information.” There is one aspect where OpenPaths is very different from other services like Google Latitude: Only the user has access to his/her own data and it is never shared with anybody else unless the user chooses to do so. Additionally, initiatives that wish to use a user’s location data must be asked personally via email (pictured below), and the user has the ability to deny the request.The data shared with each initiative provides only location, and not other data that may be personally identifiable such as name, email, browser, mobile type etc. In this sense, OpenPaths has provided a barebones platform for the collection and storage of location information. Google Latitude is similar, but the data stored on Google’s servers is obviously used by other Google services without explicit user permission.

The service is also opt-in, that […]

SIAM Data Mining 2012 Conference

Note: This would have been up a lot sooner but I have been dealing with a bug on and off for pretty much the past month!

From April 26-28 I had the pleasure to attend the SIAM Data Mining conference in Anaheim on the Disneyland Resort grounds. Aside from KDD2011, most of my recent conferences had been more “big data” and “data science” oriented, and I wanted to step away from the hype and just listen to talks that had more substance.

Attending a conference on Disneyland property was quite a bizarre experience. I wanted to get everything I could out of the conference, but the weather was so nice that I also wanted to get everything out of Disneyland as I could. Seeing adults wearing Mickey ears carrying Mickey shaped balloons, and seeing girls dressed up as their favorite Disney princesses screams “fun” rather than “business”, but I managed to make time for both.

The first two days started with a plenary talk from industry or research labs. After a coffee break, there were the usual breakout sessions followed by lunch. During my free 90 minutes, I ran over to Disneyland and California Adventure both days to eat lunch. I managed to […]

My Interview about the Statistics Major

Recently, I participated in an email interview about what being a Statistics major entailed, how I got interested in the field and the future of Statistics. I figured this might be of interest to those that are contemplating majoring in Statistics, or considering a career in Data Science.

Q1: Why did you decide to pursue a major in statistics in college?

A: “When I was a kid, I really enjoyed looking at graphs, plots and maps. My parents and I could not make of what was behind the interest. At the same time, I was also heavily interested in education. My mother was a teacher and the first set of statistics I ever encountered were standardized test scores. I strived to understand what the scores attempted to say about me, and why such scores and tests are so trustworthy. When the stakes increased with the AP and SAT exams, I began reading articles published by the Educational Testing Service and learned a ton about how these tests are constructed to minimize bias, and how scores are comparable across forms. It fascinated me how much science goes into these tests, but in the end of the day they are still just one factor […]

“Hold Only That Pair of 2s?” Studying a Video Poker Hand with R

Whenever I tell people in my family that I study Statistics, one of the first questions I get from laypeople is “do you count cards?” A blank look comes over their face when I say “no.”

Look, if I am at a casino, I am well aware that the odds are against me, so why even try to think that I can use statistics to make money in this way? Although I love numbers and math, the stuff flows through my brain all day long (and night long), every day. If the goal is to enjoy and have fun, I do not want to sit there crunching probability formulas in my head (yes that’s fun, but it is also work). So that leaves me at the video Poker machines enjoying the free drinks. Another positive about video Poker is that $20 can sometimes last a few hours. So it should be no surprise that I do not agree with using Poker to teach probability.  Poker is an extremely superficial way to introduce such a powerful tool and gives the impression that probability is a way to make a quick buck, rather than as an important tool in science and society. The only […]