Package management on Spark with Packrat
Apache Spark is a distributed compute framework which runs numerous tasks in
parallel across several worker nodes. Within DataLabs we have used Spark to
solve large compute problems using the R
language.
We have installed devtools
and packrat
on the worker nodes to enable
management of packages on a project-by-project basis.
Using Packrat on the Spark cluster
Prerequisites
Firstly, initialise the project using Packrat (packrat::init
) then install the
required packages and finally update the lockfile with packrat::snapshot
, see
Packrat Quick-Start Guide.
Opening a project in a notebook with a Spark context
With using R
in a Zeppelin notebook or a Jupyer notebook using R (SparkR)
kernel the Spark context has been automatically generated and the SparkR
loaded. To prevent Packrat from unloading this library the project must be
opened using the clean.search.path = FALSE
argument.
setwd('/data/example_project')
packrat::on(clean.search.path = FALSE)
Install R packages in a function running on spark
In DataLabs, the /data
directory is shared between the Spark worker nodes and
the notebooks. Running packrat::restore
will compile and install any missing
packages from the lockfile. These packages are stored within the private project
library, therefore should only need to be build once as it is a shared
directory.
sparkFunct <- function(idx) {
# Open project
setwd('/data/example-project')
packrat::on(clean.search.path = FALSE)
# Install if needed
packrat::restore()
# Run code on cluster
library(fortunes)
return(fortune())
}
spark.lapply(seq(4), sparkFunct)
In a multi-node cluster, it is possible the that many nodes will race to build the same package in an identical location, to prevent this run the spark function once (using a sequence of length 1) before running it with a larger sequence.