Some New Year Resolutions for (this) Data Scientist in 2017

I’ve never been very big on New Year’s resolutions. I’ve tried them in the past, and while they are nice to think about, they are always overly vague, difficult to accomplish in a year, trite, or just don’t get done (or attempted). This year I decided to try something different instead of just not making resolutions at all. I set out some professional goals for myself as a Data Scientist. So without further ado…

1. Don’t Complain about It, Fix It: Contribute to Open Source Software (More)

Open source software is only as good as its community and/or developer(s). Developers are human and typically cannot manage all bugs and feature requests themselves. My goal is to routinely contribute back to the community either with new features, or by fixing bugs that I discover. This not only helps the community at large, but also helps me as a software engineer. There is no better way to become an even better engineer than by wading through someone else’s code. While this is something I did all day every day at my $DAYJOB, I do it less while on my sabbatical.

Some of the projects I use the most and that I hope to contribute to are scikit-learn and pandas, particularly parts for higher performance computing such as out-of-core processing, and batched processing. These “tricks” are critical to working with huge datasets on small machines, particularly for students that may not wish to pay for Amazon EC2, Azure etc.

2. Focus on Sharing, Not Just Doing

One of the qualities of my Ph.D. advisor that I admire the most is his dedication to sharing pretty much everything that he does, even if it isn’t complete. Anytime there is a new medium for writing and sharing technical content, he adopts it for this work. Through him, I learned about GitBooks and RPubs. There is also GitHub Pages, ways to share Jupyter Notebooks, R Notebooks, etc. It is hard to keep up with all these new ways of sharing work, but the takeaway is that I need to get better at it.

When I am asked, I typically recommend that people only post on their GitHub completed projects that they are proud of. I am thinking of using a secondary GitHub account for exploration. There are often times I start a project, get distracted and never complete it. But, many times I learn some interesting techniques or hacks that somebody else could use and do not have time to blog about it. Right now, all of that knowledge goes to waste. By sharing this work somebody else can find these gems, even if the project itself is not complete(d). In academia there is the mantra publish or perish. While my academic pursuits will end at a Ph.D., some teaching and maybe a conference talk or two, I want to start taking this to heart in the technical world — give talks, contribute to meetups, blog more, participate on StackOverflow, Quora, Gitter, IRC… and maybe Slack…

One other aspect of sharing work I have done is that it encourages accountability. If I post unfinished work, I may be more likely to finish it, and if there is enough content to remind me of what I did and why I was doing it, that would encourage even more success.

This concept of sharing also applies to my persona at-home habit of writing one-off scripts. In software engineering we focus on reusable code. I want to start taking these one-off scripts and turning them into scripts with at least a command-line interface. This of course assumes that the script has some use to someone other than myself. It is my job to try to make it so, all in the name of sharing and contributing. It is not always about the goal. While some of my advisor’s shared manuscripts and code snippets are not useful to me in what they do, I have learned a lot about coding techniques and new algorithms and that makes sharing content worth it.

3. Create a Usable Web Service, Running on a Real Server (not localhost)

People say that communication is a big part of being a Data Scientist. I believe this depends on the type of Data Scientist role. A Data Science Engineer focuses more of his/her time on accessibility… developing data products or tools that allow people (or machines) to make decisions or present data in a way that a human can easily understand with interactive graphics and other forms of user interaction. Of course this is a special form of communication.

I’ve built machine learning systems, but at the time I did not appreciate the full lifecycle of the system. The system needs to sell itself. Not only should it implement a model in a scalable way, it also needs to adapt to new data (online learning and tuning) and also . At the time, I thought this was a pain, but I now realize that this is what makes a system speak to the user: a full feedback loop.

Going with that theme, web apps, web services, whatever, is a very useful way of sharing insights and presenting data to the client. I’ve worked almost entirely backend, and thus I would like to spend time turning some of my projects into front end web apps using JavaScript (D3, Leaflet, React) and perhaps some of the new features of Python 3 such as ayncio. Most of my experimentation has been on localhost. I want to create a real app on Amazon EC2 or running on this host… just somewhere other than 127.0.0.1.

4. Open My Mind (More) to Neural Networks and Deep Learning

Naturally, I do a lot of machine learning. Over the past 2 to 3 years, Deep Learning has been named the solution to every problem under the sun and I am sure it can be used to find that missing sock. I knew about neural networks but it wasn’t the branch of machine learning I focused much on, so I put the whole deep learning thing on the “not right now” list. During 2016, the field went from just a bunch of headlines by well-known practitioners, but I also saw what people I respected thought on the matter as well as academia.

I received my copy of Deep Learning by Goodfellow and friends and I intend to read it cover to cover. I never had an interest in computer vision because I was not sure if we could ever solve vision problems on consumer hardware and yet here we are. While deep learning is part of that, I feel that it may be a more natural fit for vision and it would be more accessible to me and others. Of course, I am also very interested in applications to natural language processing.


5. Learn a New Language


I love Python. I also love R. Both of them do pretty much everything I need to do as a Data Scientist. Ok, I also need SQL and Bash quite a bit as well. As a developer, I do not want to get stuck in my ways and for my brain to start to rust so I would like to learn a new language. Scala is the one that is really calling my name. I suppose its rise was due to the rise of Spark, but it seems to have idioms that make it very useful for Data Science. I am very conservative when it comes to learning new programming languages because I have been caught up in the fad of new languages that end up being popular for a year or two and then falling out of popularity, even if they are still useful. Think Haskell, Erlang…

Of course, I might also just read Stroustrup’s C++ book cover-to-cover as well as Bruce Eckel’s Java book cover to cover to beef up my C++ and Java respectively as both of those languages are very important for high performance computing (C++), distributed computing (both, but mostly Java) and systems development.


6. Learn about Electronics and Explore


If you look anywhere on Twitter or in the blogosphere, you are sure to read about some gadget somebody has developed using a Raspberry Pi, Arduino, or just plain ole’ circuit boards and components. Developing software can be exciting, but it can only do so much. We need circuit boards, sensors and other components to physically do something. Of course this something can be talking to servers, software, the cloud or other devices. I have very little idea of how electronics work at this basic level and I am looking for a challenge. Right now, my most advanced “gadget” is actually just a tiny computer powered by Raspberry Pi, which serves as a snowcam.

My parents need a new doorbell, one that has a camera that always runs, has decent motion detection and sends alerts over multiple different channels of communication. We have the Ring which is proprietary and just does not do a great job of this. Wireless performance is terrible, and the bell only rings in one room. With some electronic components and either an Arduino or Raspberry Pi, I am convinced I can do better at least for our purpose. I can also access all of the video and alerts on my own server rather than having to pay and deal with the cloud. Another thing… my mother has an elaborate Christmas display in the front yard connected to several timers. The timers are neversynched properly and half the yard will be dark. I want to create a power bar that can be programmed over wifi or Bluetooth and that keeps itself synched. Such a device already exists, but I want to do it myself.

My fear or electricity and either electrocuting myself or wasting money burning out circuit boards has precluded me from participating in this fascinating field. I plan on going through this book on electronics to get me started, and from there we will see!


As for myself, personally…

I only have one personal resolution. One that is doable and that would give me joy: Travel somewhere new just to mountain bike. Who knows where I will end up in 2017, but if it involves me mountain biking somewhere other than Mammoth, Lake Tahoe or Southern California, I will consider that a success. Some places on my wishlist include Moab, UT, Bend, OR, Ashland, OR, Whistler, BC, Downieville (not really a trip though), Crested Butte, CO, Park City, UT and maybe Brevard, NC… or… Scotland?

What are your Data Science resolutions or goals this year? Tell me in the comments, and also if you have any recommendations for me based on what I wrote above, feel free to share them!


Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>