OpenACC Labs

General Setup

For the hands-on sessions open a terminal on the local PC and log in to HAWK via

`ssh hawk.hww.hlrs.de -l username`

To work interactively on a compute node submit a job using

`qsub -I -l select=1:node_type=rome-ai -l walltime=08:00:00 -q R_dl`

Several nodes on HAWK have been reserved specially for the course. R_dl is the reservation name which is only available during the course.

Further details can be found under:

https://kb.hlrs.de/platforms/index.php/HPE_Hawk

https://kb.hlrs.de/platforms/index.php/Big_Data,_AI_Aplications_and_Frameworks#Hawk

Copy the exercises to your own home directory via

`cp -r /lustre/hpe/ws10/ws10.3/ws/hpctohaa-dl-openacc/materials/OpenACC ~/`

This folder contains both C and Fortran versions of the code.

Initialise the environment on the compute node via

`cd ~/OpenACC; . init.sh`

This runs the following commands in the current shell

module use -a /zhome/academic/HLRS/hlrs/hlrsk40/nvhpc/modulefiles

module load nvhpc/23.5

export TMPDIR=/localscratch/$PBS_JOBID

In the labs you will implement 12 different version of the code for multicore CPUs and GPUs. Note the execution time for each of those versions in the results sheet handed out during the first hands-on session (also available online under https://doku.lrz.de/files/29609547/35881452/1/1688912940323/OpenACC-Labs-Resultsheet-v1.pdf).

Lab 1: Profiling with Nsight Systems

Check the configuration of the GPUs via the following commands which are provided with the NVIDIA HPC Software Development Kit (SDK) on the compute node:

nvidia-smi

nvaccelinfo

We are using the NVIDIA HPC SDK C and Fortran compilers for the hands-on.

Compile the code using the following commands:

For C:

`nvc -fast -o laplace jacobi.c laplace2d.c`

For Fortran:

`nvfortran -fast -o laplace laplace2d.f90 jacobi.f90`

Alternatively you can adopt the Makefile located in the C/Fortran directory and just run "make" to compile and execute the program.

To profile the code run the Nsight Systems command line interface on the compute node via:

`nsys profile -t nvtx --stats=true --force-overwrite true -o laplace ./laplace`

This will produce the trace files laplace.nsys-rep and laplace.sqlite. Use "nsys --help" and "nsys --help profile" to get more information about the available profiler options.

For performance reasons, visualisation of the trace files should be always done on the local PCs.

Copy the trace files to your local PC in a terminal on the local PC via e.g.

`scp username@hawk.hww.hlrs.de:OpenACC/C/*.nsys-rep .`

Run "Nsight Systems 2021.2.1" on the local PC and open the trace file via File → Open. Zoom in the timeline and also test different views like "Top-Down view". Identify the 2 most time consuming routines of the code.

Further information is available under https://developer.nvidia.com/nsight-systems

Using NVIDIA Tools Extension (NVTX)

For C:

Add #include "nvtx3/nvToolsExt.h" in a copy jacobi-nvtx.c of the source code jacobi.c and wrap parts of the code which you want to capture events with calls to the NVTX API functions. For example, try adding nvtxRangePush("calc") before calling calcNext() and nvtxRangePop() just after calling it. We recommend to wrap calls to the functions initialize(), calcNext() and swap() and also wrap the while-loop in which the last 2 functions are called.

For C compile the NVTX annotated code as before:

`nvc -fast -o laplace-nvtx jacobi-nvtx.c laplace2d.c`

For Fortran:

For Fortran codes an additional Fortran module nvtx in the file nvtx.f90 is needed to instrument the code. Add use nvtx in a copy jacobi-nvtx.f90 of the source code jacobi.f90 and wrap parts of the code which you want to capture events with calls to the NVTX API functions. For example, try adding call nvtxStartRange("calc") before calling calcNext() and call nvtxEndRange() just after calling it. We recommend to wrap calls to the functions initialize(), calcNext() and swap() and also wrap the while-loop in which the last 2 functions are called.

For Fortran compile the NVTX annotated code including the special nvtx module and link the library libnvToolsExt as follows:

`nvfortran -fast -o laplace-nvtx laplace2d.f90 jacobi-nvtx.f90 nvtx.f90 -lnvToolsExt`

Profile again on the compute node using

`nsys profile -t nvtx --stats=true --force-overwrite true -o laplace ./laplace-nvtx`

and visualise the profile data on the local PC.

Further information can be found in the CUDA Profiler User’s Guide https://docs.nvidia.com/cuda/profiler-users-guide/index.html

You can use the NVTX annotated code also for profiling in the next Labs if you want.

Lab 2: OpenACC Directives

Compiling for multicore CPUs

Compile the code for a muliticore CPU using the following commands

For C:

`nvc -fast -acc=multicore -Minfo=accel,opt -o laplace-multicore jacobi.c laplace2d.c`

For Fortran:

`nvfortran -fast -acc=multicore -Minfo=accel,opt -o laplace-multicore jacobi.f90 laplace2d.f90`

Run and profile the code on the compute node and compare with the profiling results of the previous lab.

Using OpenACC for multicore CPUs with the parallel directive

For C:

Make a copy laplace2d-multicore-parallel.c of the file laplace2d.c and include the following directives at the right place to parallelise the calcNext() and swap() functions:

#pragma acc parallel loop

#pragma acc loop

For Fortran:

Make a copy laplace2d-multicore-parallel.f90 of the file laplace2d.f90 and include the following directives at the right place to parallelise the calcNext() and swap() functions:

!$acc parallel loop

!$acc loop

Compile the code for multicore CPUs using the following commands:

For C:

`nvc -fast -acc=multicore -Minfo=accel,opt -o laplace-multicore-parallel jacobi.c laplace2d-multicore-parallel.c`

For Fortran:

`nvfortran -fast -acc=multicore -Minfo=accel,opt -o laplace-multicore-parallel jacobi.f90 laplace2d-multicore-parallel.f90`

Run the code on the compute node.

Using OpenACC for multicore CPUs with the kernels directive

For C:

Make a copy laplace2d-multicore-kernels.c of the file laplace2d.c and include the following directives at the right place to parallelise the calcNext() and swap() functions:

#pragma acc kernels

{

...

}

If you run into a situation where the compiler refuses to parallelise a loop, you may override the compilers decision (on your own risk!) via

#pragma acc kernels loop independent

#pragma acc loop independent

For Fortran:

Make a copy laplace2d-multicore-kernels.f90 of the file laplace2d.f90 and include the following directives at the right place to parallelise the calcNext() and swap() functions:

!$acc kernels

...

!$acc end kernels

If you run into a situation where the compiler refuses to parallelise a loop, you may override the compilers decision (on your own risk!) via

!$acc kernels loop independent

!$acc loop independent

Compile the code for a muliticore CPU using the following commands

For C:

`nvc -fast -acc=multicore -Minfo=accel,opt -o laplace-multicore-kernels jacobi.c laplace2d-multicore-kernels.c`

For Fortran:

`nvfortran -fast -acc=multicore -Minfo=accel,opt -o laplace-multicore-kernels jacobi.f90 laplace2d-multicore-kernels.f90`

Run the code on the compute node.

Profiling OpenACC code

Profile both versions of the code on the compute node via

nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o laplace-multicore-parallel ./laplace-multicore-parallel

nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o laplace-multicore-kernels ./laplace-multicore-kernels

Compare with the previous profiling results.

Lab 3: GPU Programming

Programming GPUs using OpenACC with CUDA managed memory

Compile the code using the following command including the option -gpu=managed to use CUDA managed memory:

For C:

Reuse the files laplace2d-multicore-parallel.c and laplace2d-multicore-kernels.c from the previous lab.

nvc -fast -acc=gpu -gpu=managed -Minfo=accel,opt -o laplace-gpu-managed-parallel jacobi.c laplace2d-multicore-parallel.c

nvc -fast -acc=gpu -gpu=managed -Minfo=accel,opt -o laplace-gpu-managed-kernels jacobi.c laplace2d-multicore-kernels.c

For Fortran:

Reuse the files laplace2d-multicore-parallel.f90 and laplace2d-multicore-kernels.f90 from the previous lab.

nvfortran -fast -acc=gpu -gpu=managed -Minfo=accel,opt -o laplace-gpu-managed-parallel jacobi.f90 laplace2d-multicore-parallel.f90

nvfortran -fast -acc=gpu -gpu=managed -Minfo=accel,opt -o laplace-gpu-managed-kernels jacobi.f90 laplace2d-multicore-kernels.f90

Run and profile both versions of the code on the compute node and compare with the previous profiling results.

Programming GPUs using OpenACC without CUDA managed memory

For C:

Make copies laplace2d-gpu-parallel.c and laplace2d-gpu-kernels.c of laplace2d-multicore-parallel.c and laplace2d-multicore-kernels.c, respectively.

Compile the code using the following commands

nvc -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-parallel jacobi.c laplace2d-gpu-parallel.c

nvc -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-kernels jacobi.c laplace2d-gpu-kernels.c

For Fortran:

Make copies laplace2d-gpu-parallel.f90 and laplace2d-gpu-kernels.f90 of laplace2d-multicore-parallel.f90 and laplace2d-multicore-kernels.f90, respectively.

Compile the code using the following commands

nvfortran -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-parallel jacobi.f90 laplace2d-gpu-parallel.f90

nvfortran -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-kernels jacobi.f90 laplace2d-gpu-kernels.f90

Check if the code compiles and runs correctly.

Modify the code to specify the sizes of the arrays to be copied from/to the GPU correctly. The sizes of the arrays can be specified with the copy, copyin, create etc. clauses of the parallel/kernels directive, e.g.:

In C:

`copy(array[starting_index:length])`

In Fortran:

`copy(array(starting_index:ending_index))`

Run and profile both versions of the code on the compute node and compare with the previous profiling results.

Lab 4: Data Management

OpenACC Structured Data Directive

For C:

Make a copy jacobi-gpu-data-structured.c of the file jacobi.c and include the following directive at the right place with the right data clauses (copy, copyin, create etc.) to create a data region:

`#pragma acc data <data clauses>` { ... }

Compile the code using the following commands

nvc -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-data-structured-parallel jacobi-gpu-data-structured.c laplace2d-gpu-parallel.c

nvc -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-data-structured-kernels jacobi-gpu-data-structured.c laplace2d-gpu-kernels.c

Run and profile both versions of the code on the compute node and compare with the previous profiling results.

For Fortran:

Make a copy jacobi-gpu-data-structured.f90 of the file jacobi.f90 and include the following directive at the right place with the right data clauses (copy, copyin, create etc.) to create a data region:

`!$acc data <data clauses> ... !$acc end data`

Compile the code using the following commands

nvfortran -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-data-structured-parallel jacobi-gpu-data-structured.f90 laplace2d-gpu-parallel.f90

nvfortran -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-data-structured-kernels jacobi-gpu-data-structured.f90 laplace2d-gpu-kernels.f90

Run and profile both versions of the code on the compute node and compare with the previous profiling results.

OpenACC Unstructured Data Directives

For C:

Make copies laplace2d-gpu-data-unstructured-parallel.c and laplace2d-gpu-data-unstructured-kernels.c of laplace2d-gpu-parallel.c and laplace2d-gpu-kernels.c, respectively.

Include the following directive at the end of the initialize() function with the right data clauses (copy, copyin, create, delete etc.):

#pragma acc enter data <data clauses>

Include the following directive at the beginning of the deallocate() function with the right data clauses (copy, copyin, create, delete etc.):

#pragma acc exit data <data clauses>

Compile the code using the following commands

nvc -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-data-unstructured-parallel jacobi.c laplace2d-gpu-data-unstructured-parallel.c

nvc -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-data-unstructured-kernels jacobi.c laplace2d-gpu-data-unstructured-kernels.c

Run and profile both versions of the code on the compute node and compare with the previous profiling results.

For Fortran:

Make copies laplace2d-gpu-data-unstructured-parallel.f90 and laplace2d-gpu-data-unstructured-kernels.f90 of laplace2d-gpu-parallel.f90 and laplace2d-gpu-kernels.f90, respectively.

Include the following directive at the end of the initialize() function with the right data clauses (copy, copyin, create, delete etc.):

!$acc enter data <data clauses>

Include the following directive at the beginning of the deallocate() function with the right data clauses (copy, copyin, create, delete etc.):

!$acc exit data <data clauses>

Compile the code using the following commands

nvfortran -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-data-unstructured-parallel jacobi.f90 laplace2d-gpu-data-unstructured-parallel.f90

nvfortran -fast -acc=gpu -Minfo=accel,opt -o laplace-gpu-data-unstructured-kernels jacobi.f90 laplace2d-gpu-data-unstructured-kernels.f90

Run and profile both versions of the code on the compute node and compare with the previous profiling results.