Hitting the Big Data Ceiling in R

As a true R fan, I like to believe that R can do anything, no matter how big, how small or how complicated: there is some way to do it in R. I decided to approach my large, sparse matrix problem with this attitude. But here I sit a broken man.

There is no “native” big data support built into R, even if using the 64bit build of R. Before venturing on this endeavor, I consulted with my advisor who reassured me that R uses the state of the art for sparse matrices. That was enough for me.

My Problem

For part of my Masters thesis, I wrote code to extract all of the friends and followers out to network degree 2 to construct a “small-world” snapshot of a user via their relationships. In a graph, nodes and edges grow exponentially as the degree increases. The number of nodes was on the order of 300,000. The number of edges I predict will be around 900,000. The code is still running. This means that a dense matrix would have size . Some of you already know how this story is going to end…

The matrix is very sparse.

Very sparse.

The raw data graph.log consists of an […]

Be Careful Searching Python Dictionaries!

For my talk on High Performance Computing in R (which I had to reschedule due to a nasty stomach bug), I used Wikipedia linking data, an adjacency list of articles and the articles to which they link. This data was linked from DataWrangling and was originally created by Henry Haselgrove. The dataset is small on disk, but I needed a dataset that was huge, very huge. So, without a simple option off the top of my head, I took this data and expanded a subset of it into an incidence matrix, occupying 2GB in RAM. Perfect!

The subsetting was a bit of a problem because I had to maintain the links within the subgraph induced by the same. This required me to search dictionary objects for keys. This is where things went awry. Usually, I try to be efficient as possible. Since I was just producing a little example, and would never ever run this code otherwise, I wasn’t as careful.

The data were presented as a follows

1. First, I looked at from, and if from was in the chosen subset, keep it and proceed to 2, otherwise, throw it out.
2. Then, take the to nodes and only keep the nodes that […]