Comments for Byte Mining

Comment on Accessing R from Python using RPy2 by Ryan Rosario

Ryan Rosario — Tue, 20 Mar 2018 19:38:04 +0000

In reply to Renel Chesak. Thanks for letting me know. The links seem to throw a HTTP 500. I'll see if I can fix it tonight.

Comment on Accessing R from Python using RPy2 by Renel Chesak

Renel Chesak — Tue, 20 Mar 2018 08:43:09 +0000

Hi, it seems the links provided for the .py files are no longer valid. Is there any way they can be re-uploaded? Thanks!

Comment on “Hold Only That Pair of 2s?” Studying a Video Poker Hand with R by Joseph

Joseph — Fri, 06 Oct 2017 19:00:27 +0000

Would you mind adding th option to hold the Jack alone, and throw in the awesomest 53rd cars, Mr. Joker, please? I’m at Harrah’s and find when I hold a low pair when it’s ACES or better, the following cards have about a 60% chance of having an Ace, if not a Joker and Ace (very rare).

Have you considered doing one for duces wild? I need to study the strategies for playing that. Any tips? IE, when you’re dealt two pair, do you go for the full House, or ditch a pair, hoping for four or five of a kind?

Thanks for this!

Joseph

Comment on Hadoop Fatigue — Alternatives to Hadoop by Areeb

Areeb — Fri, 31 Mar 2017 04:17:47 +0000

Nice writeup Ryan do you have an updated version since this was from 2011? It is interesting that I had much of the same conclusions about the complexity of Hadoop, and in fact many like you are saying this is slowing adoption of Big Data processes. I am also very put off by the unnecessary complexity of Java and prefer Disco’s approach where being able to write such simple, efficient and high performance Python code natively is a huge advantage, yet I also agree it still seems Hadoop has the best ecosystem.

Comment on Be Careful Searching Python Dictionaries! by Sridevi P

Sridevi P — Fri, 24 Mar 2017 07:58:03 +0000

Awesome tip!! Reduced computation time of my code from 3.5 hours to 13 sec!!

Comment on Taking R to the Limit, Part II – Large Datasets in R by Handling large data sets in R | dataprasad

Handling large data sets in R | dataprasad — Fri, 17 Mar 2017 07:08:26 +0000

[…] Taking R to the limit […]

Comment on Be Careful Searching Python Dictionaries! by Ryan Rosario

Ryan Rosario — Mon, 28 Nov 2016 01:13:55 +0000

In reply to Wai Kay. Correct. I should add an update header to this article. This was written back when Python 2 was more common... but even I have moved on to 3.

Comment on Be Careful Searching Python Dictionaries! by Wai Kay

Wai Kay — Sat, 26 Nov 2016 05:55:59 +0000

This is only a problem with Python 2. In Python 3 both have the same efficiency.

Comment on It’s Been a While by Statwonk

Statwonk — Mon, 31 Oct 2016 02:14:04 +0000

Hey thanks for sharing Ryan. It’s nice to read honest accounts like this, indeed the grass isn’t always greener on the other side. I’ve followed you for some time now on Twitter and I’ve read your work. It’s great! I bet you’ll have no problem finding interesting problems in the new areas you listed. Take care buddy. 🙂

Comment on Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Zhonglin Ye

Zhonglin Ye — Thu, 08 Sep 2016 03:54:37 +0000

thanks for you hard works, it help me to save a lot of time to extract. For original purpose, i want to wirte a java code to extract it. But the performance is poor. YOUR work really contribute to every.

Comment on Hitting the Big Data Ceiling in R by skan

skan — Tue, 12 Jul 2016 15:28:16 +0000

Hello.

What if we need to do something more complex with our data, such as fitting a mixed-effects model regression for a large dataset? (also called repeated measures or hierarchical)
I mean a dataset 10 or 100 times larger than the computer’s memory.
R is not able to work with this data.
And other tools such as Spark don’t have advanced statistical functions able to do this kind of analysis.

Comment on Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Seth Green

Seth Green — Thu, 03 Mar 2016 00:28:33 +0000

Hi,
I have a stack overflow question posted about trying to use the Cloud9 process described here, but I can’t seem to get it to work. If anybody who has used this could cruise over to this post and give me any ideas of what I’m doing wrong, that would be much appreciated! Thanks. Post here:
http://stackoverflow.com/questions/35760657/extracting-wikipedia-article-text-with-cloud9-and-hadoop

Seth

Comment on My Experience at Hadoop Summit 2010 #hadoopsummit by My Review of Hadoop Summit 2011 #hadoopsummit « Byte Mining

My Review of Hadoop Summit 2011 #hadoopsummit « Byte Mining — Thu, 18 Feb 2016 22:46:57 +0000

[…] from Los Angeles and the Big Data Camp that lasted until 10pm the night before. Having been to Hadoop Summit 2010, I was interested to see how much of the content in the conference had […]

Comment on Taking R to the Limit, Part II – Large Datasets in R by Devenir Riche En France

Devenir Riche En France — Sat, 12 Dec 2015 09:39:59 +0000

Have you ever considered publishing an ebook or guest authoring on othe websites?
I have a blog centered on the same ideas you discuss and would
really like to have you share some stories/information. I kjow my visitors would value your work.
If you are even remotely interested, feel free to shoot me
an e mail.

Comment on Advanced Graphics in R by Useful Links | Clint P. George

Useful Links | Clint P. George — Tue, 24 Nov 2015 00:08:01 +0000

[…] for MATLAB users Producing Simple Graphs with R Advanced Graphics in R Basic R Tutorial Google’s R coding style R Free IDE for academic […]

Comment on “Hold Only That Pair of 2s?” Studying a Video Poker Hand with R by Barack Obama

Barack Obama — Tue, 10 Nov 2015 18:27:00 +0000

In reply to Ryan.

My take is that the J must be better.

My reasoning is that the casino is giving you some extra incentive to hold a pair of 2’s or 3’s.
Giving you that extra thing to bet on, is to me, almost like giving you an option to buy insurance.
So they want you to keep a hand that is less likely to win.

Comment on “Hold Only That Pair of 2s?” Studying a Video Poker Hand with R by Barack Obama

Barack Obama — Tue, 10 Nov 2015 18:18:52 +0000

The other question is

“dump the 22 and keep the J”

Comment on Hadoop Fatigue — Alternatives to Hadoop by Home Design Ideas

Home Design Ideas — Sun, 18 Oct 2015 14:51:33 +0000

Hey there, You’ve done a fantastic job. I will definitely digg it and personally recommend
to my friends. I’m confident they’ll be benefited from this web site.

Comment on Hitting the Big Data Ceiling in R by Zach

Zach — Wed, 16 Sep 2015 19:55:05 +0000

In reply to Ryan. The graph you describe above (9,000,000 edges) only needs about 3.6 MB of RAM— you won't need a file backed graph!

Comment on Hitting the Big Data Ceiling in R by Zach

Zach — Wed, 16 Sep 2015 19:51:26 +0000

In reply to Zach.

My code in the above post got mangled. See here: https://gist.github.com/zachmayer/f2b643d8d1b4d1589dcc

Comment on Hitting the Big Data Ceiling in R by Zach

Zach — Wed, 16 Sep 2015 19:49:16 +0000

Ok, I’m 5 years or so late to the party, but this post shows up prominently on google (at least when I search for “R file-backed data.tables”), so I thought I’d throw in my 2 cents. The following snippet of code creates a 300,000 x 300,000 sparse matrix in R and calculates the Jacquard similarity between every pair of rows in less than 3 seconds (including the time to create the matrix):

#Define the problem
t1 <- Sys.time()
set.seed(1)
n_nodes <- 300000L
n_edges <- 900000L
nodes <- 1L:n_nodes
edge_node_1 <- sample(nodes, n_edges, replace=TRUE)
edge_node_2 <- sample(nodes, n_edges, replace=TRUE)

#Sparse matrix
library(Matrix)
M <- sparseMatrix(
i = edge_node_1,
j = edge_node_2
)

#Row-wise Jaccard similarity
#http://stats.stackexchange.com/a/89947/2817
jaccard 0, arr.ind=TRUE)
b = rowSums(m)
Aim = A[im]
J = sparseMatrix(
i = im[,1],
j = im[,2],
x = Aim / (b[im[,1]] + b[im[,2]] – Aim),
dims = dim(A)
)
return(J)
}
J dim(M)
[1] 300000 300000
> dim(J)
[1] 300000 300000
> Sys.time() – t1
Time difference of 2.132279 secs

I know I’m from the future and all, but my spaceship computer has a 2.2 GHz Intel processor and 16 GB of RAM, which isn’t too insane even by 2010 standards.

More generally, I’ve seen a lot of people fall into this same trap: converting high-dimensional sparse matrices to a dense representation is guaranteed to bog down any system. I’ve seen this kind of naive sparse-to-dense conversion bring down “big data” clusters with terabytes of storage and 100’s of gigabytes of RAM (let alone my laptop with a 256GB hard drive and 16GB of RAM).

In the post above, you’re trying to create a 260 GB object and then do math on it, which is really scary even by future-man standards.

*Keep your sparse matrices sparse*. You’ll have to do some work to re-think your analysis in terms of sparse matrix operations (or better yet graph operations), but this is pretty much the only way sparse (graph) problems ever scale, no matter what “big data” technology you’re using.

Comment on Hitting the Big Data Ceiling in R by Zach

Zach — Wed, 16 Sep 2015 19:19:28 +0000

What do you mean by “there is no way to modify the sparsity structure of the matrix”?

Comment on Some LaTeX Gems – Part 1: TikZ, Loops and more by Iterating through a character/string array in LaTeX - codeengine

Iterating through a character/string array in LaTeX - codeengine — Wed, 22 Jul 2015 17:22:51 +0000

[…] have tried looking at similar posts (Iteration in LaTeX, http://www.bytemining.com/2010/04/some-latex-gems-part-1-tikz-loops-and-more/) but their focus is different enough that I cannot apply it to solve my problem […]

Comment on Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Data Science Tools: Python | Likelihood Log

Data Science Tools: Python | Likelihood Log — Sun, 19 Jul 2015 10:27:42 +0000

[…] http://www.bytemining.com/2011/11/parsing-wikipedia-articles-wikipedia-extractor-and-cloud9/ […]

Comment on My First Few Days with RStudio by Chelsea

Chelsea — Wed, 08 Jul 2015 22:49:45 +0000

Thank you!! Your writing is so comprehensible.