SparkR with Project Datastore
This is documentation on how users can create Spark clusters from within R sessions in Jupyter notebooks and submit jobs to them using the project datastore.
Create Notebook, Conda Environment and Select R Kernel
Create a Jupyter notebook within your defined project. When the notebook is created the Spark properties needed for the Spark cluster to be created is dynamically loaded into the notebook environment. Install a conda enviroment using https://datalab-docs.datalabs.ceh.ac.uk/conda-pkgs/conda_environments.html and select the R kernel.
Create Spark session
install.packages("SparkR")
library(SparkR, lib.loc = file.path(Sys.getenv('SPARK_HOME'), "R", "lib"))
sparkR.session(appname = "SparkR-Test",
sparkHome = Sys.getenv("SPARK_HOME"),
sparkConfig = list(spark.executor.instances = "6",
spark.kubernetes.container.image = "nerc/sparkr-k8s:latest"))
Read data from project datastore
df <- read.df('cities.csv', header=True)
Dispaly the dataframe created
head(df)
Stop the Spark session
sparkR.session.stop()