Taking R to the Limit, Part II – Large Datasets in R

For Part I, Parallelism in R, click here.

Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.

Slides

My edited slides are posted on SlideShare, and available for download here.

Taking R to the Limit (High Performance Computing in R), Part 2 — Large Datasets, LA R Users' Group 8/17/10

View more presentations from Ryan Rosario.

Topics included:

bigmemory, biganalytics and bigtabulate
ff
HadoopStreaming
brief mention of Rhipe

Code

The corresponding demonstration code is here.

Data

Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).

Large datasets:

On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset for you called airline.csv in the directory in which it is executed. I would just post it here, but it is very large and I only have so much bandwidth!
The Twitter dataset appears to no longer be available. Instead, use anna.txt which comes with HadoopStreaming. Simply replace twitter.tsv with anna.txt.

Video

The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.

For Part I, Parallelism in R, click here.

Tal Galili

August 20, 2010 at 6:43 pm · Reply

Thanks for sharing it!

Do you know if anyone is gonna also publish it (even by reference to this post) on here:
http://www.r-bloggers.com/RUG/

Taking R to the Limit, Part I – Parallelization in R « Byte Mining

August 26, 2010 at 2:55 pm · Reply

[…] Taking R to the Limit, Part II – Large Datasets in R My Experience at Hadoop Summit 2010 #hadoopsummit […]

Taking R to the Limit: Large Datasets | R User Groups

August 30, 2010 at 9:31 am · Reply

[…] More information about the talk is here. […]

My Day at ACM Data Mining Camp III #DMCAMP « Byte Mining

November 14, 2010 at 5:43 pm · Reply

[…] Data with R Given that I gave a talk to the Los Angeles R Users’ Group on working with large datasets in R, I figured this would be an enlightening session. Unfortunately, the R skills that were covered […]

Felipe

May 2, 2011 at 10:58 am · Reply

Ryan, thank you for you post. excuse me, can one in this way make visualization of big data? have nice day.

Ryan

May 2, 2011 at 12:35 pm · Reply

These packages do not take care of the visualization aspect. That requires quite a bit of creativity.

Taka

July 14, 2011 at 7:59 am · Reply

Hi Ryan,

This is a great material!!!

I appreciate if you could give me some advice on using R and MapReduce though I am a pure newbie to mapreduce.
I was asked in my company to process 100million row*10columns of data and conduct cluster analysis. Though I am
looking for materials, I haven’t had luck yet except for yours.

Then I would like to ask you if you have experience using EC2 and MapReduce services on AWS.

1. Suppose I subscribe 4 EC2 instances, how many MapReduce instances should I subscribe? – I guess that would be 4.
2. Should I use mapReduce or HadoopStreaming?

Please forgive me that I am asking very simple/basic questions..

Thanks in advance.

Krishna Sankar

August 28, 2011 at 9:18 am · Reply

Taka,
Most probably by now you would have completed the project. Still some answers for anyone who has such questions:
a) The total slots (Map+Reduce) would be slightly more than the number of cores (assume 75% CPU for the MapReduce) [1]
b) Depends on the programming language – for Python et al use hadoop streaming and for Java use MapReduce

Ref:
[1] http://goo.gl/FWT8y

Large Data Sets in R | luiz p. c. de freitas

December 12, 2011 at 7:13 pm · Reply

[…] Via Ryan Rosario’s Byte Mining, challenges of and solutions to performing analytics on 10 to 15 gig data sets using R. It’s a long deck, but Ryan covers a lot of very cool material. I’m looking forward to trying a few these myself. SlideShare below, and grab the PDF as reference. Taking R to the Limit (High Performance Computing in R), Part 2 — Large Datasets, LA R Users’ Group 8/17/10 View more presentations from Ryan Rosario This entry was posted in analytics and tagged analytics, big data, R, statistical programming by Luiz. Bookmark the permalink. […]

Hadoop « Data Meaning…

March 31, 2012 at 3:53 am · Reply

[…] Taking R to the Limit, Part II Large Datasets in R by Byte Mining Share this:TwitterFacebookLike this:LikeBe the first to like this post. « Machine Learning […]

Bigdata, NoSQL ja R ratkaisut « Olipa kerran Bigdata

September 9, 2012 at 11:31 am · Reply

[…] Taking R to the Limit, Part II – Large Datasets in R « Byte Mining […]

Pablo

July 28, 2014 at 1:55 am · Reply

Are the videos of this and the previous posts available somewhere else? the current link indicates that the file has been removed from Blip.

Patrick Champion

September 18, 2014 at 9:44 am · Reply

I currently work in SAS on some pretty big datasets – some tables are just 10 to 30 million records, but a few that I slice and dice against are in the 300 to 600 million record size. I use to program in R during grad school on tiny data and had 20 years of software development in C++/Java/Python/etc. Can ff, and filebacked.big.matrix handle 300-600 million record sizes including about 50 to 300 bytes per record? That is my current work environment in SAS.

Is there any R packages that would allow me to handle a table that I would grind through with about 5 billion records? Currently, I suspect I just have to chunk at the 50-100 million record size (one site at a time). But it would be good to avoid this segmentation and have one large file if I could so I can analyze accross sites. Everything in R seems to limit one to a mere 32 bit record index – a problem SAS use to have but seems to have been lifted with 9.4 on Linux x64 platforms.

Is there anything out there for handling 5 billion record tables? Does the RAM on the system have to be under one CPU or can it be clustered like some big systems that have 40 cores and 512 Gigs of ram?

Patrick

Devenir Riche En France

December 12, 2015 at 1:39 am · Reply

Have you ever considered publishing an ebook or guest authoring on othe websites?
I have a blog centered on the same ideas you discuss and would
really like to have you share some stories/information. I kjow my visitors would value your work.
If you are even remotely interested, feel free to shoot me
an e mail.

Handling large data sets in R | dataprasad

March 17, 2017 at 12:08 am · Reply

[…] Taking R to the limit […]

Taking R to the Limit, Part II – Large Datasets in R

15 comments to Taking R to the Limit, Part II – Large Datasets in R

Leave a Reply Cancel reply

Recent Posts

Blogroll

Meta