Parallelization Using R

Overview

In order to effectively (and efficiently) use R on massively parallel systems (like the LRZ Linux Cluster or SuperMUC-NG) familiarity with parallel computing principles is essential. Also, the available infrastructure needs to be considered, i.e. there needs to be awareness for the layout of the cluster and the architecture of the individual compute nodes. See Access and Overview of HPC Systems for a technical outline of the LRZ cluster systems.

As is often the case, R provides multiple approaches to address a specific challenge like parallel computing. A general overview of the parallel computing capabilities of R and additional packages as well as other topics of interest in the area of High-Performance Computing (HPC) can be found in the respective CRAN Task View: https://cran.r-project.org/web/views/HighPerformanceComputing.html

Basic Concepts and Requirements

Parallelization on a Single Node

When working on a single compute node (shared memory environment) of a LRZ cluster system (or local machine with a UNIX-like operating system), the multicore-like functionality of the "parallel"-package (included in the base R installation) can be utilized for explicit parallelization. This allows for an R main process to fork additional processes on the available cores of the specific node. See vignette("parallel") within R or https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf for a comprehensive description of the "parallel"-package's functionality. The use-case described in this paragraph is referenced in chapter 5 (note that "parallel" incorporates and extends the functionality of two predecessor packages, "multicore" and "snow").

A minimal working example of a batch script specification (see SLURM Workload Manager for a comprehensive description of usage options) and R script for this use-case on the CoolMUC-2 cm2_tiny cluster of the Linux Cluster (see Job Processing on the Linux-Cluster for acluster overview) is as follows :

submit.sh

#!/bin/bash
#SBATCH --clusters=cm2_tiny
#SBATCH --nodes=1

module load slurm_setup 

module load r
 
Rscript script.R

The following R script is used to probe the number of available cores and provides two time measurements for a dummy task, once run serially and again using multicore-like parallelization functionality provided by the "parallel"-package:

script.R

library(parallel)
 
detectCores()
 
system.time(
    lapply(1:20, function(x) sum(sort(runif(1e7))))
)
 
system.time(
    mclapply(1:20, function(x) sum(sort(runif(1e7))))
)

Inter-Node Parallelization

Once parallelization across multiple compute nodes is desired, additional measures need to be taken to allow for communication between R processes running on different compute nodes. The established HPC approach for this scenario is message passing, i.e. relying on the Message Passing Interface (MPI). The original "snow"-package can be utilized for explicit MPI parallelization. This allows for multiple R processes to be distributed as MPI tasks across multiple compute nodes. See https://cran.r-project.org/web/packages/snow/snow.pdf for a comprehensive description of the "snow"-package's functionality. The MPI functionality of "snow" is relying on the "Rmpi"-package. This package provides an interface/wrapper to MPI. It is dependent on a properly configured MPI implementation. For usage with R, an installation of OpenMPI is preferred. On the LRZ Linux Cluster, installation and usage of "Rmpi" has successfully been tested using OpenMPI version >= 3.1.3 (e.g. the "openmpi/4.0.3-intel19"-module).

First, an adequate OpenMPI installation has to be made available by loading the corresponding module (and unloading conflicting implementations):

$ module unload intel-mpi
$ module load openmpi

Then the "Rmpi"-package can successfully be installed within R, followed by an installation of "snow":

$ module load r
$ R
 
> install.packages("Rmpi")
> install.packages("snow")

Once these packages are installed, the command mpirun can be used in combination with the RMPISNOW shell script (provided as part of the "snow"-package installation) to create an MPI environment which can then be registered as a "snow" cluster object within R.

A minimal working example of a batch script specification (see SLURM Workload Manager for a comprehensive description of usage options) and R script for a use-case with 3 nodes on the CoolMUC-2 (cm2 cluster) cm2_std partition of the Linux Cluster is as follows. Make sure to include the demonstrated module commands to make specific modules correctly available on the compute nodes. Also, in this case MPI is used without task spawning, i.e. a cluster is created in "non-spawn" or "static" mode, which should give good performance (details can be found here: http://homepage.divms.uiowa.edu/~luke/R/cluster/cluster.html). Please note that the first task will act as main process and run the script as provided while any additional tasks will provide the worker processes for the registered backend (the mpirun default, as in the following, is an allocation of a single CPU per task, i.e. the overall number of tasks equals the number of available CPUs):

submit.sh

#!/bin/bash
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --nodes=3

module load slurm_setup 

module unload intel-mpi
module load openmpi
module load r
 
mpirun /full/path/to/R_LIB/snow/RMPISNOW < script.R

The following R script provides a time measurement for the dummy task (extended to be distributed across all MPI workers):

script.R

library(snow)
 
cl <- makeCluster()
 
system.time(
    parLapply(cl, 1:167, function(x){
        sum(sort(runif(1e7)))
        }
    )
)
 
stopCluster(cl)

Using the "foreach" package

for loop

Using a basic for loop in R, the previous dummy task could be expressed as follows. This would be executed serially directly in the main R thread:

for(i in 1:100) sum(sort(runif(1e7)))

foreach looping construct

The R package "foreach" provides an alternative looping construct for executing R code repeatedly. In combination with various backends it allows for parallel code execution in a unified manner. The basic usage pattern is the following:

library(foreach)
foreach(i = 1:100) %do% sum(sort(runif(1e7))) # serial execution
foreach(i = 1:100) %dopar% sum(sort(runif(1e7))) # potential parallel execution

doParallel - multicore-like single node parallelization

A "backend" has to be used to create and register a cluster resource for the %dopar%-operator to rely on for parallel execution. This can, for example, be based on the "parallel"-package's multicore-like functionality (provided by the "doParallel"-package) and can as such be used on single nodes of an LRZ cluster system (or a local machine with a UNIX-like operating system):

library(foreach)
library(doParallel)
registerDoParallel(cores = <no.of.cores>)
foreach(i = 1:100) %dopar% sum(sort(runif(1e7))) # parallel execution

doParallel - snow-like single node parallelization

The snow-like functionality of "parallel" (again, provided by the "doParallel"-package) can be used on any major operating system (but the above is more efficient, when available). This resembles the general "foreach" and "%dopar%-backend" framework using an explicit cluster object:

library(foreach)
library(doParallel)
cl <- makePSOCKCluster(<no.of.cores>)
registerDoParallel(cl)
foreach(i = 1:100) %dopar% sum(sort(runif(1e7)))
stopCluster(cl)

doSNOW - snow MPI cluster multi-node parallelization

Similarly, the "snow"-package's MPI cluster functionality can be used as a backend (provided by the "doSNOW"-package), once a MPI environment is established (relying on "Rmpi") as described in the "Inter-Node Parallelization" example of the "Basic Concepts and Requirements" section above:

library(foreach)
library(doSNOW)
cl <- makeCluster()
registerDoSNOW(cl)
foreach(i = 1:100) %dopar% sum(sort(runif(1e7)))
stopCluster(cl)

However, the following approach using the "doMPI"-package is typically prefered.

doMPI - multi-node parallelization (directly using Rmpi)

The "Rmpi"-package's MPI interface/wrapper can directly be utilized as "%dopar-backend" using the "doMPI"-package. This typically executes more efficiently than a "snow"-package's MPI cluster. See the "Inter-Node Parallelization" example of the "Basic Concepts and Requirements" section above for the necessary installation requirements for "Rmpi" (and "doMPI"). Once these packages are installed, the command mpirun can used to create a MPI environment which can then be registered as a "%dopar-backend" in R.

A minimal working example of a batch script specification (see SLURM Workload Manager for a comprehensive description of usage options) and R script for a use-case with 3 nodes on the CoolMUC-2 (cm2 cluster) cm2_std partition of the Linux Cluster is as follows. Please note that the first task will act as main process and run the script as provided while any additional tasks will provide the worker processes for the registered backend (the mpirun default, as in the following, is an allocation of a single CPU per task, i.e. the overall number of tasks equals the number of available CPUs):

submit.sh

#!/bin/bash
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --nodes=3

module load slurm_setup
 
module unload intel-mpi
module load openmpi
module load r
 
mpirun Rscript script.R

The following R script provides a time measurement for the dummy task (extended to be distributed across all MPI workers):

script.R

library(foreach)
library(doMPI)

cl <- startMPIcluster()  # use verbose = TRUE for detailed worker message output
registerDoMPI(cl)

system.time(
	foreach(i = 1:167) %dopar% sum(sort(runif(1e7)))
)

closeCluster(cl)
mpi.quit()