Page Content


The MPS (fka MATLAB Distributed Computing Server (MDCS) in R2018b and older releases) extends the functionality of the Parallel Computing Toolbox by allowing parallel jobs across multiple compute nodes. MPS jobs are handled as common parallel jobs using the Slurm queueing system in the background. MPS jobs can be submitted from both the login nodes of the Linux Cluster and the user's remote computer (e. g., laptop, desktop PC). Hereafter, we briefly describe both ways. Please also consult the MPS User Guide for use on CoolMUC-2 for detailed  and  MATLAB Parallel Server for general information.


Please note: This MATLAB product is available to (particular) TUM users only.

Submit MPS Job to the Linux Cluster

In order to run your parallel MATLAB code exploiting PCT + MPS, you have to obey the steps described by the tables in the following sections.

I want to submit MPS jobs from a MATLAB session on the Linux Cluster Login Node

StepComment
1. Login to one of the Linux Cluster login nodes, load a MATLAB module and start MATLAB

Getting Started: MATLAB Modules
2. Job configuration for CoolMUC-2
>> configCluster(cluster_name, partition_name);

Run cluster configuration. This step is mandatory. Otherwise, MATLAB will use its default cluster settings ('local' cluster) which will not work! Both name of the cluster (e.g. cm2) and the name of the partition (=queue, e.g. cm2_std) have to be passed to configCluster().

Please check the requirements of your job, i. e., the number of tasks (workers) and tasks per node. Then set the correct cluster and partition.

>> ch = parcluster;

Create a cluster object and return the cluster object handle.

>> % Job walltime in format hh:mm:ss => --time=00:30:00 in Slurm
>> ch.AdditionalProperties.WallTime = '00:30:00';

>> % MPI tasks per node => --tasks-per-node=28 in Slurm
>> ch.AdditionalProperties.ProcsPerNode = 28;		

>> % additional: disabling multi-threading and setting memory requirement
>> ch.AdditionalProperties.AdditionalSubmitArgs = '--cpus-per-task=1 --mem=55G';  

Define job parameters (members of the cluster object). MPS will translate all settings to the according Slurm flags (needed by the sbatch command, see documentation of the Slurm Workload Manager at LRZ). Please consider, that only most important Slurm flags are provided by the cluster object. Following parameters can/must be adjusted by the user.

Further Slurm flags can be added to the cluster object as a space-separated string using the field AdditionalSubmitArgs.

There are pre-defined parameters which may not be changed:

  • job output: logfiles for output (*.out) and error (*.err) are created automatically
>> jobdir = fullfile(getenv('SCRATCH'), 'MdcsDataLocation/coolmuc', version('-release'));
>> if ~exist(jobdir), mkdir(jobdir); end
>> ch.JobStorageLocation = jobdir; 

MPS will store both results of user code (MATLAB's "mat" file format) and job output to the file system. By default the HOME directory is used. Due to performance and capacity reasons, we highly recommend to use the SCRATCH partition.

NOTE: Depending on the usecase, the output might exceed the maximum size of a mat file. The job will finish successfully. However, the data will be lost. Hence, we also recommend that the user code directly writes all data to the file system.

>> ch.saveProfile;

Save settings.

3. Submit MPS job to Slurm workload manager
>> job = ch.batch(@myfunction, n_arg_out, {arg_in_1, ..., arg_in_n}, 'Pool', np);

Submit job, which will run the user code 'myfunction.m', by calling the batch function as a member of the cluster object. The input/output arguments are as follows:

Input:
@myfunction ... reference to myfunction
n_arg_out ..... number of output arguments of myfunction
arg_in_# ...... list of input arguments of myfunction
'Pool', np .... key-value pair with size of parallel pool
                = number of workers
Output:
job ........... job object providing all job information and
                member functions to control the job

IMPORTANT:
The setting of np workers will result in the allocation of np+1 tasks, because MPS requires an additional management task!

Example: The job uses 14 tasks per node and 28 workers in total. Including an additional task a third compute node with only one task will be involved into the job. That results in inefficient resource usage and probably longer waiting times!

4. Basic job control functions
>> job.State



>> job.cancel

>> myresults = job.fetchOutputs
show current state ('queued', 'running', 'finished', 'failed'),
equivalent to "squeue --clusters=cm2 --users=$USER"
in Linux command line

cancel job, i. e. remove it from the Slurm queue

obtain all results (return values) from myfunction

I want to submit MPS jobs from a MATLAB session on my Laptop or Desktop PC

StepComment

1. Prerequisites

1a. MATLAB
Install Matlab on your computer.

The MATLAB release has to match one of the releases supported by LRZ. Please refer to next step.

1b. Download LRZ MPS configuration
Supported MATLAB ReleaseFile
R2021amatlab-R2021a.mps.remote.zip

Using file, showing a release mismatch between MPS and installed MATLAB, will cause a failure of MPS jobs.

1c. Extract zip archive and install files

For example on Linux terminal:

> unzip matlab-RYYYYx.mps.remote.zip
> cp -r matlab-RYYYYx.mps.remote/* <MATLAB_PATH>/toolbox/local/

The zip file matlab-RYYYYx.mps.remote.zip contains the directory
matlab-RYYYYx.mps.remote. Please copy its content, not the entire directory.

MATLAB_PATH refers to the base directory of your MATLAB installation.

2. Job configuration for CoolMUC-2
Please follow the instructions described in step 2 of previous table.
3. Submit MPS job to Slurm workload manager

Please follow the instructions described in step 3 of previous table.

After execution of the batch command you will be asked

  • to enter the full path of your HOME directory on the Linux Cluster, which should look like:
    /dss/dsshome1/XX/USERID
  • to enter your credentials (user ID and password)

Then, the job will be transferred to the cluster and submitted via Slurm.


4. Basic job control functions
Please refer to step 4 in previous table.Now, you are working remotely on the Linux Cluster. Execute job control functions inside your MATLAB installation. The commands will be transferred to the cluster via ssh (in the background).

MPS Examples

The following table shows two examples using either spmd environment or parfor loop. For convenience, the MATLAB file "job_config.m" summarizes all configuration steps and submits the job to Slurm. Using this example, you may test work with MPS on both login node and your remote computer. Start MATLAB and run job_config, for example:

>> myfunction = 'myfunction_spmd';
>> % or
>> myfunction = 'myfunction_parfor';
>>
>> cluster_name = 'cm2_tiny';
>> partition_name = 'cm2_tiny';
>> walltime = '00:30:00';
>> tasks_per_node = 28;
>> num_worker = 16;
>>
>> [job,ch] = job_config(myfunction, cluster_name, partition_name, walltime, tasks_per_node, num_worker);
Configuration scriptImplementation of user-defined function
job_config.m
function [job,ch] = job_config(example, ...
                               cluster_name, ...
                               partition_name, ...
                               walltime, ...
                               tasks_per_node, ...
                               num_worker)

%===============================================================================
% MATLAB MPS EXAMPLE
% -> configuration script to initialze MPS and submit job 
%===============================================================================
% input:
% example .................... name of user function without extension ".m"
% cluster_name ............... name of cluster, e. g.: cm2 (refers to the HPC
%                              machine, e.g. CoolMUC-2)
% partition_name ............. name of queue/partition, e. g.: cm2_std
%                              Details on cluster/partition names:
%                              https://doku.lrz.de/x/AgaVAg
% walltime, tasks_per_node ... equivalent to Slurm parameters
% num_worker ................. number of MPS workers
% return values:
% job ........................ job handle
% ch ......................... cluster object handle
%===============================================================================

%===============================================================================
% Step 1: cluster configuration
%===============================================================================
configCluster(cluster_name, partition_name);

%===============================================================================
% Step 2: job configuration
%===============================================================================
ch = parcluster;
jobdir = fullfile(getenv('SCRATCH'), 'MdcsDataLocation/coolmuc/', version('-release'));
if ~exist(jobdir)
    mkdir(jobdir);
end
ch.JobStorageLocation = jobdir; 
ch.AdditionalProperties.WallTime = walltime;
ch.AdditionalProperties.ProcsPerNode = tasks_per_node;
ch.saveProfile;

%===============================================================================
% Step 3: job submission to Slurm
%===============================================================================
% Command:
%   job = ch.batch(@myfunction, n_arg_out, {arg_in_1, ..., arg_in_n}, 'Pool', np)
% Input:
%   @myfunc ...... reference to user-defined function/script myfunc.m
%   n_arg_out .... number of expected output arguments
%   {arg_in_#} ... list of input arguments of myfunction
%   'Pool', np ... key-value-pair: define size of pool (number of workers)
%
% Help via Matlab commands:
%   help batch
%   doc batch

fhandle = eval(sprintf('@%s', example));
job = ch.batch(fhandle, 4, {}, 'Pool', num_worker);
spmd example
myfunction_spmd.m
function [nlabs,comptime,Cref,Cfin] = myfunction_spmd

%===================================================================
% MATLAB EXAMPLE: PARALLEL HELLO WORLD USING PCT TOOLBOX
%                 -> matrix-matrix multiplication C = A*B
%
% return values:
% nlabs ...... total number of workers (just FYI and used by data
%              distribution functions)
% comptime ... time needed for multiplication
% Cref ....... reference result obtained from serial computation
% Cfin ....... result obtained from parallel computation
%===================================================================

%===================================================================
% Input
%===================================================================
% exemplary matrices
SIZE_A = [2000 100];
SIZE_B = [100 8000];

A = zeros(SIZE_A);
B = zeros(SIZE_B);

for n=1:SIZE_A(1)
    A(n,:) = linspace(1,n, SIZE_A(2));
end
for n=1:SIZE_B(1)
    B(n,:) = linspace(1,n, SIZE_B(2));
end

% reference result
Cref = A*B;

%===================================================================
% Manage parallel pool
%===================================================================
% get number of workers:
spmd
    nl = numlabs;
end
nlabs = nl{:};

% disallow Threading
maxNumCompThreads(1);

%===================================================================
% Parallel work
%===================================================================
SIZE_C = [SIZE_A(1) SIZE_B(2)];

% parallel environment
spmd
    % distribute data to all workers
    Ad = codistributed(A, codistributor2dbc([nlabs 1]));
    Bd = codistributed(B, codistributor2dbc([1 nlabs]));
    Cd = zeros(SIZE_C, codistributor2dbc([1 nlabs]));

    % timing
    tic;
    Cd = Ad*Bd;
    t = toc;
end

% collect data from all workers => final result
Cfin = gather(Cd);

comptime = t{:}
parfor example
myfunction_parfor.m
function [nlabs,comptime,Cref,C] = myfunc_parfor

%===================================================================
% MATLAB EXAMPLE: PARALLEL HELLO WORLD USING PCT TOOLBOX
%                 -> matrix-matrix multiplication C = A*B
%
% return values:
% nlabs ...... total number of workers (just FYI and used by data
%              distribution functions)
% comptime ... time needed for multiplication
% Cref ....... reference result obtained from serial computation
% Cfin ....... result obtained from parallel computation
%===================================================================

%===================================================================
% Input
%===================================================================
% exemplary matrices
SIZE_A = [2000 100];
SIZE_B = [100 8000];

A = zeros(SIZE_A);
B = zeros(SIZE_B);

for n=1:SIZE_A(1)
    A(n,:) = linspace(1,n, SIZE_A(2));
end
for n=1:SIZE_B(1)
    B(n,:) = linspace(1,n, SIZE_B(2));
end

% reference result
Cref = A*B;

%===================================================================
% Manage parallel pool
%===================================================================
% get number of workers:
spmd
    nl = numlabs;
end
nlabs = nl{:};

% disallow Threading
maxNumCompThreads(1);

%===================================================================
% Parallel work
%===================================================================
SIZE_C = [SIZE_A(1) SIZE_B(2)];
C = zeros(SIZE_C);

% timing
tic;
% parallel environment
parfor n=1:SIZE_A(1)
    % compute
    C(n,:) = A(n,:)*B;
end
comptime = toc;
  • No labels