Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Panel
borderColor#EBECF0
bgColor#FAFBFC
titleColorblack
borderWidth4
titleBGColor#FAFBFC
borderStylesolid
titlePage Content

Inhalt
stylenone


The MPS (fka MATLAB Distributed Computing Server (MDCS) in R2018b and older releases) extends the functionality of the Parallel Computing Toolbox by allowing parallel jobs across multiple compute nodes. MPS jobs are handled as common parallel jobs using the Slurm queueing system in the background. MPS jobs can be submitted from both the login nodes of the Linux Cluster and the user's remote computer (e. g., laptop, desktop PC). Hereafter, we briefly describe both ways. Please also consult the MPS User Guide for use on CoolMUC-2 for detailed  and  MATLAB Parallel Server for general information.


Hinweis

Please note: This MATLAB product is available to (particular) TUM users only.

Submit MPS Job to the Linux Cluster

In order to run your parallel MATLAB code exploiting PCT + MPS, you have to obey the steps described by the tables in the following sections.

I want to submit MPS jobs from a MATLAB session on the Linux Cluster Login Node

StepComment
1. Login to one of the Linux Cluster login nodes, load a MATLAB module and start MATLAB

Getting Started: MATLAB Modules
2. Job configuration for CoolMUC-2


Codeblock
languagetext
>> configCluster(cluster_name, partition_name);


Run cluster configuration. This step is mandatory. Otherwise, MATLAB will use its default cluster settings ('local' cluster) which will not work! Both name of the cluster (e.g. cm2) and the name of the partition (=queue, e.g. cm2_std) have to be passed to configCluster().

Hinweis

Please check the requirements of your job, i. e., the number of tasks (workers) and tasks per node. Then set the correct cluster and partition.



Codeblock
languagetext
>> ch = parcluster;


Create a cluster object and return the cluster object handle.


Codeblock
languagetext
>> % Job walltime in format hh:mm:ss => --time=00:30:00 in Slurm
>> ch.AdditionalProperties.WallTime = '00:30:00';

>> % MPI tasks per node => --tasks-per-node=28 in Slurm
>> ch.AdditionalProperties.ProcsPerNode = 28;		

>> % additional: disabling multi-threading and setting memory requirement
>> ch.AdditionalProperties.AdditionalSubmitArgs = '--cpus-per-task=1 --mem=55G';  


Define job parameters (members of the cluster object). MPS will translate all settings to the according Slurm flags (needed by the sbatch command, see documentation of the Slurm Workload Manager at LRZ). Please consider, that only most important Slurm flags are provided by the cluster object. Following parameters can/must be adjusted by the user.

Further Slurm flags can be added to the cluster object as a space-separated string using the field AdditionalSubmitArgs.

There are pre-defined parameters which may not be changed:

  • job output: logfiles for output (*.out) and error (*.err) are created automatically


Codeblock
languagetext
>> jobdir = fullfile(getenv('SCRATCH'), 'MdcsDataLocation/coolmuc', version('-release'));
>> if ~exist(jobdir), mkdir(jobdir); end
>> ch.JobStorageLocation = jobdir; 


MPS will store both results of user code (MATLAB's "mat" file format) and job output to the file system. By default the HOME directory is used. Due to performance and capacity reasons, we highly recommend to use the SCRATCH partition.

NOTE: Depending on the usecase, the output might exceed the maximum size of a mat file. The job will finish successfully. However, the data will be lost. Hence, we also recommend that the user code directly writes all data to the file system.


Codeblock
languagetext
>> ch.saveProfile;


Save settings.

3. Submit MPS job to Slurm workload manager


Codeblock
languagetext
>> job = ch.batch(@myfunction, n_arg_out, {arg_in_1, ..., arg_in_n}, 'Pool', np);


Submit job, which will run the user code 'myfunction.m', by calling the batch function as a member of the cluster object. The input/output arguments are as follows:

Codeblock
languagetext
Input:
@myfunction ... reference to myfunction
n_arg_out ..... number of output arguments of myfunction
arg_in_# ...... list of input arguments of myfunction
'Pool', np .... key-value pair with size of parallel pool
                = number of workers
Output:
job ........... job object providing all job information and
                member functions to control the job


Warnung

IMPORTANT:
The setting of np workers will result in the allocation of np+1 tasks, because MPS requires an additional management task!

Example: The job uses 14 tasks per node and 28 workers in total. Including an additional task a third compute node with only one task will be involved into the job. That results in inefficient resource usage and probably longer waiting times!


4. Basic job control functions


Codeblock
languagetext
firstline1
linenumberstrue
>> job.State



>> job.cancel

>> myresults = job.fetchOutputs



Codeblock
languagetext
firstline1
linenumberstrue
show current state ('queued', 'running', 'finished', 'failed'),
equivalent to "squeue --clusters=cm2 --users=$USER"
in Linux command line

cancel job, i. e. remove it from the Slurm queue

obtain all results (return values) from myfunction


I want to submit MPS jobs from a MATLAB session on my Laptop or Desktop PC

StepComment

1. Prerequisites

1a. MATLAB
Install Matlab on your computer.


Warnung

The MATLAB release has to match one of the releases supported by LRZ. Please refer to next step.


1b. Download LRZ MPS configuration


Supported MATLAB ReleaseFile
R2021amatlab-R2021a.mps.remote.zip



Warnung

Using file, showing a release mismatch between MPS and installed MATLAB, will cause a failure of MPS jobs.


1c. Extract zip archive and install files

For example on Linux terminal:

Codeblock
languagebash
> unzip matlab-RYYYYx.mps.remote.zip
> cp -r matlab-RYYYYx.mps.remote/* <MATLAB_PATH>/toolbox/local/



Hinweis

The zip file matlab-RYYYYx.mps.remote.zip contains the directory
matlab-RYYYYx.mps.remote. Please copy its content, not the entire directory.

MATLAB_PATH refers to the base directory of your MATLAB installation.


2. Job configuration for CoolMUC-2
Please follow the instructions described in step 2 of previous table.
3. Submit MPS job to Slurm workload manager

Please follow the instructions described in step 3 of previous table.

After execution of the batch command you will be asked

  • to enter the full path of your HOME directory on the Linux Cluster, which should look like:
    /dss/dsshome1/XX/USERID
  • to enter your credentials (user ID and password)

Then, the job will be transferred to the cluster and submitted via Slurm.


4. Basic job control functions
Please refer to step 4 in previous table.Now, you are working remotely on the Linux Cluster. Execute job control functions inside your MATLAB installation. The commands will be transferred to the cluster via ssh (in the background).

MPS Examples

The following table shows two examples using either spmd environment or parfor loop. For convenience, the MATLAB file "job_config.m" summarizes all configuration steps and submits the job to Slurm. Using this example, you may test work with MPS on both login node and your remote computer. Start MATLAB and run job_config, for example:

Codeblock
languagetext
>> myfunction = 'myfunction_spmd';
>> % or
>> myfunction = 'myfunction_parfor';
>>
>> cluster_name = 'cm2_tiny';
>> partition_name = 'cm2_tiny';
>> walltime = '00:30:00';
>> tasks_per_node = 28;
>> num_worker = 16;
>>
>> [job,ch] = job_config(myfunction, cluster_name, partition_name, walltime, tasks_per_node, num_worker);


Configuration scriptImplementation of user-defined function



Codeblock
languagetext
firstline1
titlejob_config.m
linenumberstrue
collapsetrue
function [job,ch] = job_config(example, ...
                               cluster_name, ...
                               partition_name, ...
                               walltime, ...
                               tasks_per_node, ...
                               num_worker)

%===============================================================================
% MATLAB MPS EXAMPLE
% -> configuration script to initialze MPS and submit job 
%===============================================================================
% input:
% example .................... name of user function without extension ".m"
% cluster_name ............... name of cluster, e. g.: cm2 (refers to the HPC
%                              machine, e.g. CoolMUC-2)
% partition_name ............. name of queue/partition, e. g.: cm2_std
%                              Details on cluster/partition names:
%                              https://doku.lrz.de/x/AgaVAg
% walltime, tasks_per_node ... equivalent to Slurm parameters
% num_worker ................. number of MPS workers
% return values:
% job ........................ job handle
% ch ......................... cluster object handle
%===============================================================================

%===============================================================================
% Step 1: cluster configuration
%===============================================================================
configCluster(cluster_name, partition_name);

%===============================================================================
% Step 2: job configuration
%===============================================================================
ch = parcluster;
jobdir = fullfile(getenv('SCRATCH'), 'MdcsDataLocation/coolmuc/', version('-release'));
if ~exist(jobdir)
    mkdir(jobdir);
end
ch.JobStorageLocation = jobdir; 
ch.AdditionalProperties.WallTime = walltime;
ch.AdditionalProperties.ProcsPerNode = tasks_per_node;
ch.saveProfile;

%===============================================================================
% Step 3: job submission to Slurm
%===============================================================================
% Command:
%   job = ch.batch(@myfunction, n_arg_out, {arg_in_1, ..., arg_in_n}, 'Pool', np)
% Input:
%   @myfunc ...... reference to user-defined function/script myfunc.m
%   n_arg_out .... number of expected output arguments
%   {arg_in_#} ... list of input arguments of myfunction
%   'Pool', np ... key-value-pair: define size of pool (number of workers)
%
% Help via Matlab commands:
%   help batch
%   doc batch

fhandle = eval(sprintf('@%s', example));
job = ch.batch(fhandle, 4, {}, 'Pool', num_worker);





spmd example


Codeblock
languagetext
firstline1
titlemyfunction_spmd.m
linenumberstrue
collapsetrue
function [nlabs,comptime,Cref,Cfin] = myfunction_spmd

%===================================================================
% MATLAB EXAMPLE: PARALLEL HELLO WORLD USING PCT TOOLBOX
%                 -> matrix-matrix multiplication C = A*B
%
% return values:
% nlabs ...... total number of workers (just FYI and used by data
%              distribution functions)
% comptime ... time needed for multiplication
% Cref ....... reference result obtained from serial computation
% Cfin ....... result obtained from parallel computation
%===================================================================

%===================================================================
% Input
%===================================================================
% exemplary matrices
SIZE_A = [2000 100];
SIZE_B = [100 8000];

A = zeros(SIZE_A);
B = zeros(SIZE_B);

for n=1:SIZE_A(1)
    A(n,:) = linspace(1,n, SIZE_A(2));
end
for n=1:SIZE_B(1)
    B(n,:) = linspace(1,n, SIZE_B(2));
end

% reference result
Cref = A*B;

%===================================================================
% Manage parallel pool
%===================================================================
% get number of workers:
spmd
    nl = numlabs;
end
nlabs = nl{:};

% disallow Threading
maxNumCompThreads(1);

%===================================================================
% Parallel work
%===================================================================
SIZE_C = [SIZE_A(1) SIZE_B(2)];

% parallel environment
spmd
    % distribute data to all workers
    Ad = codistributed(A, codistributor2dbc([nlabs 1]));
    Bd = codistributed(B, codistributor2dbc([1 nlabs]));
    Cd = zeros(SIZE_C, codistributor2dbc([1 nlabs]));

    % timing
    tic;
    Cd = Ad*Bd;
    t = toc;
end

% collect data from all workers => final result
Cfin = gather(Cd);

comptime = t{:}


parfor example


Codeblock
languagetext
firstline1
titlemyfunction_parfor.m
linenumberstrue
collapsetrue
function [nlabs,comptime,Cref,C] = myfunc_parfor

%===================================================================
% MATLAB EXAMPLE: PARALLEL HELLO WORLD USING PCT TOOLBOX
%                 -> matrix-matrix multiplication C = A*B
%
% return values:
% nlabs ...... total number of workers (just FYI and used by data
%              distribution functions)
% comptime ... time needed for multiplication
% Cref ....... reference result obtained from serial computation
% Cfin ....... result obtained from parallel computation
%===================================================================

%===================================================================
% Input
%===================================================================
% exemplary matrices
SIZE_A = [2000 100];
SIZE_B = [100 8000];

A = zeros(SIZE_A);
B = zeros(SIZE_B);

for n=1:SIZE_A(1)
    A(n,:) = linspace(1,n, SIZE_A(2));
end
for n=1:SIZE_B(1)
    B(n,:) = linspace(1,n, SIZE_B(2));
end

% reference result
Cref = A*B;

%===================================================================
% Manage parallel pool
%===================================================================
% get number of workers:
spmd
    nl = numlabs;
end
nlabs = nl{:};

% disallow Threading
maxNumCompThreads(1);

%===================================================================
% Parallel work
%===================================================================
SIZE_C = [SIZE_A(1) SIZE_B(2)];
C = zeros(SIZE_C);

% timing
tic;
% parallel environment
parfor n=1:SIZE_A(1)
    % compute
    C(n,:) = A(n,:)*B;
end
comptime = toc;