For Part I, Parallelism in R, click here.
Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.
My edited slides are posted on SlideShare, and available for download here.
Taking R to the Limit (High Performance Computing in R), Part 2 — Large Datasets, LA R Users' Group 8/17/10
View more presentations from Ryan Rosario.
bigmemory, biganalytics and bigtabulate
brief mention of Rhipe
The corresponding demonstration code is here.
Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).
On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset [...]