Big Data Camp 2011 #BigDataCamp

It has been a while since I have been to Silicon Valley, but Hadoop Summit gave me the opportunity to go. To make the most of the long trip, I also decided to check out BigDataCamp held the night before from 5:30 to 10pm. Although the weather was as predicted, I was not prepared for the deluge of pouring rain in the end of June. The weather is one of the things that is preventing me from moving up to Silicon Valley.

The food/drinks/networking event must have been amazing because it was very difficult to get everyone to come to the main room to start the event! We started with a series of lightning talks from some familiar names and some unfamiliar ones.

Chris Wensel, the developer of Cascading, is also the founder of Concurrent, Inc. Cascading is an alternate API for Map-Reduce written in Java. With Cascading, developers can chain multiple map-reduce jobs to form an ad hoc workflow. Cascading adds a built-in planner to manage jobs. Cascading usually infers Hadoop, but Cascading can run on other platforms including EMC Greenplum and the new MapR project. RazorFish and BestBuy use Cascading for behavioral targeting. Flightcaster uses a domain specific language (DSL) […]

Google -- Is Search-by-Multimedia on the Way?

Recently, I have been thinking about alternate ways of specifying search queries other than with text. A couple of weeks ago I came across a piece of music that I could not identify. I thought it would be a huge win for a search engine to allow me to upload this piece, and it would present me with matches, or near matches to other pieces that sound similar, or have similar characteristics. Some services already exist. Shazam allows a user to place a microphone near playing music and it will identify the artist and song. Some uses of search-by-sound:

Music identification (“solved” – Shazam)
Music personalizaton and recommendation (“solved” – Pandora)
Identification of the source of a sound (i.e. a species of bird, a musical instrument, an inanimate object)
MP3 and media file search
Finding material that violates copyright

As our motivating example, consider we find some really cool graphic on the web and we want to know where it likely originated (i.e. art, a meme). In such a search engine, we could upload the graphic and get results containing the exact image, or images that are very similar such as variations of the image (crop, resize, borders, different effects), modifications of the image (consider Obama-izing […]

Want to Build a Research Server?

I am usually pretty reserved with cash, but after working full-time for six months, I finally decided to spend some of my money on building a new research development server. This process was long overdue and the reason it took me so long to commit to this project was all of the new technology developed since building my last server. This “new technology” can be pretty confusing unless one specializes in computer architecture. I want to share what I have learned throughout this process, while giving some background. These are only my opinions, and I may be wrong on some things as I am not a hardware expert. I encourage you to read and learn more on your own.

The CPU/Processor

If you are reading this article, I probably do not need to explain what the CPU/processor does. For high performance computing, you will want to get a CPU that is very “fast” and also has multiple cores. The definition of the word “fast” is in the eyes of the beholder and typically refers to more than just clock speed (GHz). In the constant war between AMD and Intel, I stick with Intel. AMD processors are powerful, but they seem to have […]

Review of 2011 Data Scientist Summit

Some time over the past 6 weeks I randomly saw a tweet announcing the “Data Scientist Summit” and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up at 6am to partake on the 5.5 hour voyage to Las Vegas.

The Pre-Party

The Venetian and all close hotels were booked, so I ended up at the Aria; a new experience. The hotel is beautiful and very ritzy. I had heard that the rooms were very technologically advanced but I wasn’t prepared for the recorded welcome message, music and automatic shades opening upon entry to the room. The Aria is a geek’s paradise. Everything is computerized. Key cards are “waved” rather than swiped, lights are turned on/off and dimmed by use case (“sleep”, “read” etc.), rather than manually. There are no paper “Do Not Disturb” signs; rather, a switch on the wall (or via TV) toggles an indicator light outside the door. And the best part… Internet is FREE!

The rhododendrons hydrangeas are real!
Work […]

EC2 Trials and Tribulations, Part 1 (Web Crawling)

Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user “boots” up one or more machines and then accesses those machines as if they were logged into any other machine remotely. I used EC2 and Elastic MapReduce extensively for my M.S. thesis last spring, but mainly used its large memory capabilities rather than its potential for explicit parallelism.

Recently, I ran a crawling job on EC2 using a parellel crawler I wrote in Python with twill. Using EC2 poses its own challenges. Using parallel code poses more challenges. Combining these two facts with the fact that crawling is I/O bound can create some more interesting challenges. If you have taken a course in operating systems, you have heard this stuff over and over again. So have I, but I am stubborn. I tend to learn lessons from experience, and this was no exception. Through this series of posts, I want to point out difficulties and “gotchas” that are important to keep in mind when using EC2, and in this post, with […]

Location Tracking on Android, too!

This week it was revealed that the iPhone stores users’ locations, and this immediately caused a huge firestorm of commentary by tech geeks, panic among privacy advocates, and delight to data geeks like myself. Even better/worse, it seems that the iPhone caches location traces long-term, possibly back to the date the phone was activated.

I ditched my iPhone this past December (good riddance) in favor of the Droid X (Android). I figured, on such an open source OS, Google must be doing the same thing. After surfing through Hacker News, it turns out I was right.

Compared to the iPhone though, getting the data on an Android phone is not simple.

The data is stored in two files, cache.cell and cache.wifi in the directory /data/data/com.google.android.location/files.
First, the user cannot browse this directory by attaching it to a computer. I installed an SSH daemon QuickSSHD to allow remote access into my phone. 
Second, it is not possible to access this directory without getting a Permission denied error, even if logged in as “root” as Google has not made this directory readable.
Finally, for those (myself) that are still determined to crack this nut, you will need to root your phone. This makes the “root” user a real […]

Instructions for Installing 64bit SciPy, Python 2.7.1 on MacOS X 10.6

Numpy and SciPy are packages for numerical computation and scientific computing, for Python.

One wrinkle with NumPy/SciPy that needs to be ironed out is the difficulty of installation on certain OSes, and particularly, architectures.The SciPy SuperPack has done a good job of taking care of this issue, but it has not yet been updated for 2.7.1 and manually hacking away at its script has not worked for me.

I cannot take credit for the instructions in this article. A brave warrior, Jeremy Conlin, somehow managed to figure out how to install 64-bit NumPy and SciPy, with 64-bit Python 2.7.1 on Snow Leopard; he posted the directions to the SciPy User mailing list on February 24. I followed the directions, and miraculously they worked. I am reproducing them here for Google bait.

Install Python 2.7.1

1. Download the universal Mac 2.7.1 installer here (Python 2.7.1 Mac OS X 64-bit/32-bit x86-64/i386 Installer). Typically, Python will be installed to /Library/Frameworks/Python.framework/Versions/2.7/, but may be in other locations.

2. Verify that your new version of Python is 64-bit enabled. Note: Python installations typically do not get toggled as the default Python, so find the location of the 2.7.1 Python executable. On my machine, it is /Library/Frameworks/Python.framework/Versions/2.7/bin/python. python2.7 should also work.

Load […]

My First Few Days with RStudio

As most readers are probably aware, the free IDE for R, called RStudio, was recently released for general use and it immediately made huge waves within the R community. IDE stands for Integrated Development Environment. IDEs typically provides a rich set tools developing in some target language. For standard programming languages like C++ (VisualStudio) and Java (Eclipse or NetBeans), IDEs contain:

an editor tailored to the target language. The editor typically has tab/auto-complete for variable names, functions and class methods and properties and also features syntax highlighting.
a multiple document interface (MDI) where there may be several documents opened in different tabs.
a window that interacts with the compiler, or a panel containing the console to the language, a la MATLAB, and even vanilla R’s GUI.

a debugger
a file browser and language reference.

RStudio plays to this analogy very well, and makes modifications where appropriate. RStudio provides many features that are lacking in the standard R GUI, and improves on features that do not work properly in the Windows R GUI. Over the past few days, I have been doing all of my R analysis within RStudio, shortly with the Desktop version, and mostly with the Server version. I will discuss mostly the server version […]

Web Mining Pitfalls

Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.

Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the RSS Advisory Board developed a convention for the formatting of web pages so that browsers can automatically discover the links to the site’s RSS feeds. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.

Always Have a Plan B, C, D, …

One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery […]

40 Fascinating Blogs for the Ultimate Statistics Geek!

I am happy to report that ByteMining is listed on “40 Fascinating Blogs for the Ultimate Statistics Geek”!

Some of the ones that I frequently read, or are written by Twitter friends/followers (in no particular order):

R-bloggers, an aggregate site containing blog posts tagged as posts about R. High quality content.
Statistical modeling, causal inference and social science. This one is a no brainer, as it is the blog for Andrew Gelman‘s group.
FlowingData by Nathan Yau (@flowingdata), fellow Statistics Ph.D. student at UCLA. Focuses on the data and information visualization side of Data Science.
dataists by Hilary Mason (@hmason, bit.ly), Vince Buffalo (@vsbuffalo, UC Davis),
Drew Conway (@drewconway, NYU), Mike Dewar (@mikedewar, Columbia),
John Myles White (@johnmyleswhite, Princeton) and others.
A new blog on several aspects of Data Science including Data Mining, visualization and uses of Statistics in current events. Heavy use of R and ggplot2.
Revolutions by Revolution Analytics provides a variety of content around R, Data Science and Statistics in general.
FiveThirtyEight by Nate Silver shares sophisticated modeling and analysis of elections and government happenings. It is in a different realm, as it attracts political news junkies (and the occasional extremist) rather than just Statisticians.
LoveStats by Annie Pettit, Ph.D. (@LoveStats) discusses Statistics as used in Social […]