For Part I, Parallelism in R, click here.
Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.
Slides
My edited slides are posted on SlideShare, and available for download here.
Topics included:
- bigmemory, biganalytics and bigtabulate
- ff
- HadoopStreaming
- brief mention of Rhipe
Code
The corresponding demonstration code is here.
Data
Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).
Large datasets:
- On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset for you called airline.csv in the directory in which it is executed. I would just post it here, but it is very large and I only have so much bandwidth!
- The Twitter dataset appears to no longer be available. Instead, use anna.txt which comes with HadoopStreaming. Simply replace twitter.tsv with anna.txt.
Video
The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.
For Part I, Parallelism in R, click here.

Thanks for sharing it!
Do you know if anyone is gonna also publish it (even by reference to this post) on here:
http://www.r-bloggers.com/RUG/
?
[…] Taking R to the Limit, Part II – Large Datasets in R My Experience at Hadoop Summit 2010 #hadoopsummit […]
[…] More information about the talk is here. […]
[…] Data with R Given that I gave a talk to the Los Angeles R Users’ Group on working with large datasets in R, I figured this would be an enlightening session. Unfortunately, the R skills that were covered […]
Ryan, thank you for you post. excuse me, can one in this way make visualization of big data? have nice day.
These packages do not take care of the visualization aspect. That requires quite a bit of creativity.
Hi Ryan,
This is a great material!!!
I appreciate if you could give me some advice on using R and MapReduce though I am a pure newbie to mapreduce.
I was asked in my company to process 100million row*10columns of data and conduct cluster analysis. Though I am
looking for materials, I haven’t had luck yet except for yours.
Then I would like to ask you if you have experience using EC2 and MapReduce services on AWS.
1. Suppose I subscribe 4 EC2 instances, how many MapReduce instances should I subscribe? – I guess that would be 4.
2. Should I use mapReduce or HadoopStreaming?
Please forgive me that I am asking very simple/basic questions..
Thanks in advance.
Taka
Taka,
Most probably by now you would have completed the project. Still some answers for anyone who has such questions:
a) The total slots (Map+Reduce) would be slightly more than the number of cores (assume 75% CPU for the MapReduce) [1]
b) Depends on the programming language – for Python et al use hadoop streaming and for Java use MapReduce
Ref:
[1] http://goo.gl/FWT8y
[…] Via Ryan Rosario’s Byte Mining, challenges of and solutions to performing analytics on 10 to 15 gig data sets using R. It’s a long deck, but Ryan covers a lot of very cool material. I’m looking forward to trying a few these myself. SlideShare below, and grab the PDF as reference. Taking R to the Limit (High Performance Computing in R), Part 2 — Large Datasets, LA R Users’ Group 8/17/10 View more presentations from Ryan Rosario This entry was posted in analytics and tagged analytics, big data, R, statistical programming by Luiz. Bookmark the permalink. […]
[…] Taking R to the Limit, Part II Large Datasets in R by Byte Mining Share this:TwitterFacebookLike this:LikeBe the first to like this post. « Machine Learning […]
[…] Taking R to the Limit, Part II – Large Datasets in R « Byte Mining […]
Are the videos of this and the previous posts available somewhere else? the current link indicates that the file has been removed from Blip.
Hi Ryan,
I currently work in SAS on some pretty big datasets – some tables are just 10 to 30 million records, but a few that I slice and dice against are in the 300 to 600 million record size. I use to program in R during grad school on tiny data and had 20 years of software development in C++/Java/Python/etc. Can ff, and filebacked.big.matrix handle 300-600 million record sizes including about 50 to 300 bytes per record? That is my current work environment in SAS.
Is there any R packages that would allow me to handle a table that I would grind through with about 5 billion records? Currently, I suspect I just have to chunk at the 50-100 million record size (one site at a time). But it would be good to avoid this segmentation and have one large file if I could so I can analyze accross sites. Everything in R seems to limit one to a mere 32 bit record index – a problem SAS use to have but seems to have been lifted with 9.4 on Linux x64 platforms.
Is there anything out there for handling 5 billion record tables? Does the RAM on the system have to be under one CPU or can it be clustered like some big systems that have 40 cores and 512 Gigs of ram?
Patrick
Have you ever considered publishing an ebook or guest authoring on othe websites?
I have a blog centered on the same ideas you discuss and would
really like to have you share some stories/information. I kjow my visitors would value your work.
If you are even remotely interested, feel free to shoot me
an e mail.
[…] Taking R to the limit […]