What if we need to do something more complex with our data, such as fitting a mixed-effects model regression for a large dataset? (also called repeated measures or hierarchical)

I mean a dataset 10 or 100 times larger than the computer’s memory.

R is not able to work with this data.

And other tools such as Spark don’t have advanced statistical functions able to do this kind of analysis.

I have a stack overflow question posted about trying to use the Cloud9 process described here, but I can’t seem to get it to work. If anybody who has used this could cruise over to this post and give me any ideas of what I’m doing wrong, that would be much appreciated! Thanks. Post here:

http://stackoverflow.com/questions/35760657/extracting-wikipedia-article-text-with-cloud9-and-hadoop

Seth

]]>I have a blog centered on the same ideas you discuss and would

really like to have you share some stories/information. I kjow my visitors would value your work.

If you are even remotely interested, feel free to shoot me

an e mail. ]]>

My reasoning is that the casino is giving you some extra incentive to hold a pair of 2’s or 3’s.

Giving you that extra thing to bet on, is to me, almost like giving you an option to buy insurance.

So they want you to keep a hand that is less likely to win.

“dump the 22 and keep the J”

]]>to my friends. I’m confident they’ll be benefited from this web site. ]]>

#Define the problem

t1 <- Sys.time()

set.seed(1)

n_nodes <- 300000L

n_edges <- 900000L

nodes <- 1L:n_nodes

edge_node_1 <- sample(nodes, n_edges, replace=TRUE)

edge_node_2 <- sample(nodes, n_edges, replace=TRUE)

#Sparse matrix

library(Matrix)

M <- sparseMatrix(

i = edge_node_1,

j = edge_node_2

)

#Row-wise Jaccard similarity

#http://stats.stackexchange.com/a/89947/2817

jaccard 0, arr.ind=TRUE)

b = rowSums(m)

Aim = A[im]

J = sparseMatrix(

i = im[,1],

j = im[,2],

x = Aim / (b[im[,1]] + b[im[,2]] – Aim),

dims = dim(A)

)

return(J)

}

J dim(M)

[1] 300000 300000

> dim(J)

[1] 300000 300000

> Sys.time() – t1

Time difference of 2.132279 secs

I know I’m from the future and all, but my spaceship computer has a 2.2 GHz Intel processor and 16 GB of RAM, which isn’t too insane even by 2010 standards.

More generally, I’ve seen a lot of people fall into this same trap: converting high-dimensional sparse matrices to a dense representation is guaranteed to bog down any system. I’ve seen this kind of naive sparse-to-dense conversion bring down “big data” clusters with terabytes of storage and 100’s of gigabytes of RAM (let alone my laptop with a 256GB hard drive and 16GB of RAM).

In the post above, you’re trying to create a 260 GB object and then do math on it, which is really scary even by future-man standards.

*Keep your sparse matrices sparse*. You’ll have to do some work to re-think your analysis in terms of sparse matrix operations (or better yet graph operations), but this is pretty much the only way sparse (graph) problems ever scale, no matter what “big data” technology you’re using.

]]>could you uploading the videos again please?

]]>NoSQL be better than using a SQL Database?

]]>