# Hitting the Big Data Ceiling in R

As a true R fan, I like to believe that R can do anything, no matter how big, how small or how complicated: there is some way to do it in R. I decided to approach my large, sparse matrix problem with this attitude. But here I sit a broken man.

There is no “native” big data support built into R, even if using the 64bit build of R. Before venturing on this endeavor, I consulted with my advisor who reassured me that R uses the state of the art for sparse matrices. That was enough for me.

My Problem

For part of my Masters thesis, I wrote code to extract all of the friends and followers out to network degree 2 to construct a “small-world” snapshot of a user via their relationships. In a graph, nodes and edges grow exponentially as the degree increases. The number of nodes was on the order of 300,000. The number of edges I predict will be around 900,000. The code is still running. This means that a dense matrix would have size $300000^2 \approx 9 \times 10^{10}$. Some of you already know how this story is going to end…

The matrix is very sparse.

$\frac{900000}{9 \times 10^{10}} \approx 10^{-6}$

Very sparse.

The raw data graph.log consists of an edgelist with Twitter usernames separated by a comma. Another file names.row contains the node order from NetworkX.

R has some packages for working with large datasets. Let’s take a look.

Bigmemory

Bigmemory is a high-level interface for creating and manipulating very large C++ matrices in shared memory or on disk. Storing matrices in shared memory allows several instances of R to work on the same dataset, particularly useful in a multicore system or an analytics server that needs to have several instances of R operating on the same data.

Even better, Bigmemory supports filebacked matrices. A filebacked matrix resides on disk and also comes with a descriptor file that essentially tells R where the data is. Once a big.matrix is constructed, R simply retains a pointer to the matrix on disk. Wonderful! So how did I try to use it?

the.names < - read.table("names.row")[,1] dims <- length(the.names) graph <- read.table("graph.log",sep=",",header=FALSE) graph[,1] <- levels(graph[,1])[graph[,1]] graph[,2] <- levels(graph[,2])[graph[,2]] indices <- cbind(match(graph[,1], the.names), match(graph[,2], the.names))   library(bigmemory) X <- filebacked.big.matrix(dims,dims, type="char", init=0, separated=FALSE, backingfile="incidence_matrix.bin", descriptor="incidence_matrix.desc") for (row in 1:nrow(indices)) { X[indices[row,1], indices[row,2]] <- '1' }

The good news is that it worked. It would be nice if the constructor though had some type of progress indicator because the filesystem does not report the filesize of the matrix while it is being written on disk. As a workaround, I used separated=TRUE so that each column was written to its own file and I could see which files have been created. This worked, but it caused the filesystem to choke due to the number of files created. Trying to do anything yielded the following nightmare:

Argument list too long.

So when I had to delete this mess of files, I had to use some shell magic.

find . -type f -name 'incidence_matrix.bin_column_*' -exec rm {} \;

Now that I knew how long it took to write all 300,000 columns/files to disk (about 25 mins), I knew it would take about the same amount of time with separated=FALSE.

The bad news? The file was 117GB! Of course, this is because Bigmemory apparently stores matrices as dense matrices. There does not seem to be a sparse option with Bigmemory.

ff

Then there’s ff. What does it stand for? Fun files? Fancy files? Flat files? ff and its sister package bit have very fine-grained controls for matrices on disk, but unfortunately, R was not up to my challenge.

library(ff) ffMatrix < - ff(vmode="logical", dim=c(300000,300000))

yielded

Error in if (length < 0 || length > .Machine$integer.max) stop("length must be between 1 and .Machine$integer.max") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In ff(vmode = "logical", dim = c(3e+05, 3e+05)) :
NAs introduced by coercion

So what did I do wrong? Well, nothing. What did the package do wrong? There must be something wrong to yield such a type error, right? Again, nothing is wrong. This is an R problem. I have a machine with 64bit CPUs, a 64bit capable OS (Snow Leopard) running in 64bit mode (hold keys 6 4 at bootup), running 64bit R, 64bit ff and all prerequisite libraries built for 64bit. So, what gives?

The missing piece of the puzzle is that although R is 64bit capable, the ramifications mostly involve memory and the amount of memory that can be addressed. The kicker is that R does not have an int64, or long long type. R will not recognize integers that are larger than .Machine\$max.integer for anything. Thanks to Hadley Wickham for pointing this out.

sparseMatrix() in Matrix

The Matrix package contains a handy function sparseMatrix that takes in a vector of row coordinates, a vector of column coordinates and a vector of values for location (row, col), and some other arguments. Constructing the matrix was fast, but there was a major problem: there is no way to modify the sparsity structure of the matrix. The best I could do for the full matrix (much bigger than the one used for bigmemory and ff) was to create sparse matrices based on one million elements at a time, and accumulate by adding the individual matrices together. This worked fine for a while, but the system started to drastically slow down around 65 million entries. Thanks to David Smith for suggesting sparseMatrix as it looks powerful for smaller problems.

So, What to Do?

The common response is, “just read in what you need.” That’s fine, but not all matrix factorizations can be treated this way, and all of these algorithms would need to be implemented by the user as nothing convenient seems to exist. If you know of any piecemeal filewise packages, comment!

If what I just mentioned is not an option, I suggest using another language for the time being. Python‘s NumPy and SciPy modules are awesome with huge, sparse matrices, but still some issues exist when using a matrix of my size requirement. Clojure has become another good option due to its pure functional credentials. Right now, I just sound like a Clojure fanboy because I really have no explanation as to why Clojure may be better than R and Python in this regard, but it seems to be better for parallel operations necessary for large data.

The conclusion? R is great, but really falls flat with large and sparse matrices. If SciPy can do it, R will do it…eventually.

### 22 comments to Hitting the Big Data Ceiling in R

• Hi Ryan,

Regarding progress bars (although this is probably the least of your problems), I suggest going through this post:
http://ryouready.wordpress.com/2009/03/16/r-monitor-function-progress-with-a-progress-bar/
Very good tips there.

The second suggestion I have for you is to see if you could (in your situation) use the:
?raw
Data types (instead of boolean or integer), since it is the data type which takes up the least amount of space.

A third suggestion is for you to have a look (and report? π ) at the SparseM package:
http://cran.r-project.org/web/packages/SparseM/
I remember using it (combined with raw) to try and handle the algorithms I tried running on the netflix-prize dataset.

And last suggestion is regarding blogging and wordpress, consider installing the following two plugins:
1) subscribe to comments
2) wp-syntax for R

Best wishes,
Tal

• Thanks for the suggestions. The documentation for SparseM seemed to last be updated in 2003, but I could be wrong. I will check it out.

1) You can subscribe to comments by clicking on the link at the top right of the page. I will try to find a way to make it more obvious.
2) Thanks, will do. I am using SyntaxHighlighter which supports R, but couldn’t get it to recognize the R brush.

• Hi Ryan,

Regarding SparseM, the latest release was two months ago, but the original doc is from 2003. So I think it is worth a look.

Regarding subscribe to comments, it allows people subscribing to only the comments to this post, and via e-mail.

The link at the top does it through RSS, and to all the comments.

Cheers π

Tal

• Done, thanks! I will check out SparseM as well. Apparently there is another new package called “slam” that was recommended to me by Dirk Eddelbuettel.

• Petr

Try to use package igraph, I am using that package for dealing with data 300,000-1,000,000 nodes connected by 3E6 to 300E6 on 64bit system with 16GB RAM. It does not store data on disk however.

• I graph is a great package. Are you referring to the R wrapper or the main C package? I would imagine the C package has support for file-backed graphs. In R, definitely not, at least not currently.

• Zach

The graph you describe above (9,000,000 edges) only needs about 3.6 MB of RAMβ you won’t need a file backed graph!

• Do you really need to compute on the adjacency matrix? This is a very inefficient way of storing networks.

I’m backing Petr’s comment here. Use ‘igraph’, its quite efficient for big networks. The ‘network’ package is following closely on efficiency too.

~michal

• Thanks for your comment! I agree, adjacency matrix is inefficient. For the metrics I am trying to compute (Jaccard), it seems easier to compute using the matrix. I’m not sure how I would do it using graph traversal except for V passes of DFS or BFS which would be O(V^2) which is about the same as computing on the matrix O(V^2). Ideas? π

• Why not just use Pig? (http://hadoop.apache.org/pig/) This is the type of thing it is designed for. (Note: it will spill to disk when memory cap is reached but keep going…). See http://github.com/Ganglion/Sounder for examples.

–jacob (@thedatachef)

• Pig is definitely on my todo list of things to master. I am planning on replicating my research using fancier tools (Pig, Mahout etc.) this summer.

• I was doing cellular automata iterations with matrices a few months ago, and quite honestly fell in love with the R-C interface. It took me 2 days to get running, was conceptually simple, provided tremendous speedup. I evangelize “C in R” to everyone who’s hitting the time-resource roof. I’m not sure exactly how this will apply to memory, esp. w.r.t. 64btt, but you have much more control of when and how memory is allocated, and have much thinner overhead.

If your full matrix will actually fit in memory as a C object, then you may be able to do all your processing there and return the information via indices that can be passed to sparseMatrix.

This seems more of a learning exercise than a “good strategy” for networks. Still, it should make for an entertaining exercise!

• Yes this would! Good for a blog post as well. I definitely want to use the C-R interface more. For this particular project, the matrix did not fit in RAM π

• One other thought: Relational databases are great at sparse matrices (assuming you store them intelligently).

You could use RMySQL (or some such R DB interface) to connect R to an engine designed to handle so much data.

An interesting related talk available at here

• Using a RDBMS completely slipped my mind. This would be very useful and RMySQL is a great interface to MySQL. Hmm. I think I might try that this summer π

• Miguel

I found a master thesis that talks about how to avoid the overhead of RMySQL either embedding R into MySQL or the other way around. The C/C++ bridge of R are key for this:

http://repository.lib.ncsu.edu/ir/handle/1840.16/2781

• Thanks! This is great news and very helpful.
The overhead of calling MySQL from R vs. Python is very painful.

• Chris S

Mahout is where its at!

• Zach

What do you mean by “there is no way to modify the sparsity structure of the matrix”?

• Zach

Ok, I’m 5 years or so late to the party, but this post shows up prominently on google (at least when I search for “R file-backed data.tables”), so I thought I’d throw in my 2 cents. The following snippet of code creates a 300,000 x 300,000 sparse matrix in R and calculates the Jacquard similarity between every pair of rows in less than 3 seconds (including the time to create the matrix):

#Define the problem
t1 <- Sys.time()
set.seed(1)
n_nodes <- 300000L
n_edges <- 900000L
nodes <- 1L:n_nodes
edge_node_1 <- sample(nodes, n_edges, replace=TRUE)
edge_node_2 <- sample(nodes, n_edges, replace=TRUE)

#Sparse matrix
library(Matrix)
M <- sparseMatrix(
i = edge_node_1,
j = edge_node_2
)

#Row-wise Jaccard similarity
#http://stats.stackexchange.com/a/89947/2817
jaccard 0, arr.ind=TRUE)
b = rowSums(m)
Aim = A[im]
J = sparseMatrix(
i = im[,1],
j = im[,2],
x = Aim / (b[im[,1]] + b[im[,2]] – Aim),
dims = dim(A)
)
return(J)
}
J dim(M)
[1] 300000 300000
> dim(J)
[1] 300000 300000
> Sys.time() – t1
Time difference of 2.132279 secs

I know I’m from the future and all, but my spaceship computer has a 2.2 GHz Intel processor and 16 GB of RAM, which isn’t too insane even by 2010 standards.

More generally, I’ve seen a lot of people fall into this same trap: converting high-dimensional sparse matrices to a dense representation is guaranteed to bog down any system. I’ve seen this kind of naive sparse-to-dense conversion bring down “big data” clusters with terabytes of storage and 100’s of gigabytes of RAM (let alone my laptop with a 256GB hard drive and 16GB of RAM).

In the post above, you’re trying to create a 260 GB object and then do math on it, which is really scary even by future-man standards.

*Keep your sparse matrices sparse*. You’ll have to do some work to re-think your analysis in terms of sparse matrix operations (or better yet graph operations), but this is pretty much the only way sparse (graph) problems ever scale, no matter what “big data” technology you’re using.

• skan

Hello.

What if we need to do something more complex with our data, such as fitting a mixed-effects model regression for a large dataset? (also called repeated measures or hierarchical)
I mean a dataset 10 or 100 times larger than the computer’s memory.
R is not able to work with this data.
And other tools such as Spark don’t have advanced statistical functions able to do this kind of analysis.