

#### **Overview**







- The workshop is co-organized by LRZ and NVIDIA Deep Learning Institute (DLI) for the Partnership for Advanced Computing in Europe (PRACE).
- NVIDIA Deep Learning Institute (DLI) offers hands-on training for developers, data scientists, and researchers looking to solve challenging problems with deep learning.
- This 4-days workshop offered online combines lectures about fundamentals of Deep Learning for Multiple Data Types and Multi-GPUs with lectures about Accelerated Computing with OpenACC and CUDA C/C++
- Learn how to accelerate your applications with OpenACC and CUDA, how to train and deploy a neural network to solve real-world problems, and how to effectively parallelize training of deep neural networks on Multi-GPUs.
- The lectures are interleaved with many hands-on sessions using Jupyter Notebooks. The exercises will be
  done on a fully configured GPU-accelerated workstation in the cloud.









#### DEEP LEARNING INSTITUTE

DLI Mission: Help the world to solve the most challenging problems using AI and deep learning

We help developers, data scientists and engineers to get started in architecting, optimizing, and deploying neural networks to solve real-world problems in diverse industries such as autonomous vehicles, healthcare, robotics, media & entertainment and game development.

## Fundamentals of Accelerated Computing with CUDA C/C++







- You learn the basics of CUDA C/C++ by:
  - Accelerating CPU-only applications to run their latent parallelism on GPUs.
  - Utilizing essential CUDA memory management techniques to optimize accelerated applications.
  - Exposing accelerated application potential for concurrency and exploiting it with CUDA streams.
  - Leveraging command line and visual profiling to guide and check your work.
  - Upon completion, you'll be able to accelerate and optimize existing C/C++ CPU-only applications using the most essential CUDA tools and techniques. You'll understand an iterative style of CUDA development that will allow you to ship accelerated applications fast.

#### **Tentative Agenda**







#### **Workshop Webpage**







- Lecture material will be made available under:
  - https://tinyurl.com/dli-workshop-lrz

- Access CUDA C/C++ Code:
- See the: Chat Window

#### **Training Setup**







- To get started, follow these steps:
- Create an NVIDIA Developer account at <a href="http://courses.nvidia.com/join">http://courses.nvidia.com/join</a> Select "Log in with my NVIDIA Account" and then ""Create Account".
- If you use your own laptop, make sure that WebSockets works for you:
   Test your Laptop at <a href="http://websocketstest.com">http://websocketstest.com</a>
  - Under ENVIRONMENT, confirm that "WebSockets" is checked yes.
  - Under WEBSOCKETS (PORT 80]. confirm that "Data Receive", "Send", and "Echo Test" are checked yes.
  - If there are issues with WebSockets, try updating your browser. We recommend Chrome, Firefox, or Safari for an optimal performance.
- Visit <a href="http://courses.nvidia.com/dli-event">http://courses.nvidia.com/dli-event</a> and enter the event code provided by the instructor.
- You're ready to get started.







#### **Enjoy the course!**







#### Moore's law is dead !!

The long-held notion that the processing power of computers increases exponentially every couple of years has hit its limit .....

The free lunch is over ...

Future is parallel!



Typical example Intel chip: Core i7 7th Gen

- 4\*CPU cores
- with hyperthreading
- Each with 8-wide AVX instructions
- GPU with 1280 processing elements

#### Programming on chip:

- Serial C/C++ .. Code alone only takes advantage of a very small amount of the available resources of the chip
- Using vectorisation allows you to fully utilise the resources of a single hyper-thread
- Using multi-threading allows you to fully utilise all CPU cores

#### GPU need to be used?-









Intel Kaby Lake-S











Using heterogeneous programming allows you to dispatch and fully utilise the entire chip.









#### GPU programming:

- Limited only to a specific domain
- Separate source solutions
- Verbose low Level APIs
- SYCL
- CUDA C/C++
- Kokkos
- HPX
- OpenCL
- oneAPI & DPC++
- NVPTX, Raja...



#### Why do we need GPUs on HPC?







Increase in parallelism

Today almost a similar amount of efforts on using CPUs vs GPUs by real applications

GPUs well-suited to deep learning.



NVIDIA Software uses CUDA

Why do we need "accelerators" on HPC?

Fundamentals of Accelerated Computing with CUDA C/C++ | 13.07.2021; Allalen@Irz.de

Top500.org



|          | Rank | System                                                                                                                                                                                     | Cores      | Rmax<br>(TFlop/s) | Rpeak<br>(TFlop/s) | Power<br>(kW) |
|----------|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-------------------|--------------------|---------------|
|          | 1    | Supercomputer Fugaku - Supercomputer Fugaku, A64FX 48C 2.2GHz, Tofu interconnect D, Fujitsu RIKEN Center for Computational Science Japan                                                   | 7,630,848  | 442,010.0         | 537,212.0          | 29,899        |
|          | 2    | Summit - IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM DOE/SC/Oak Ridge National Laboratory United States                     | 2,414,592  | 148,600.0         | 200,794.9          | 10,096        |
|          | 3    | Sierra - IBM Power System AC922, IBM POWER9 22C 3.1GHz,<br>NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM /<br>NVIDIA / Mellanox<br>DOE/NNSA/LLNL<br>United States             | 1,572,480  | 94,640.0          | 125,712.0          | 7,438         |
|          | 4    | Sunway TaihuLight - Sunway MPP, Sunway SW26010 260C<br>1.45GHz, Sunway, NRCPC<br>National Supercomputing Center in Wuxi<br>China                                                           | 10,649,600 | 93,014.6          | 125,435.9          | 15,371        |
|          | 5    | Perlmutter - HPE Cray EX235n, AMD EPYC 7763 64C 2.45GHz,<br>NVIDIA A100 SXM4 40 GB, Slingshot-10, HPE<br>D0E/SC/LBNL/NERSC<br>United States                                                | 706,304    | 64,590.0          | 89,794.5           | 2,528         |
| •        | 6    | Selene - NVIDIA DGX A100, AMD EPYC 7742 64C 2.25GHz,<br>NVIDIA A100, Mellanox HDR Infiniband, Nvidia<br>NVIDIA Corporation<br>United States                                                | 555,520    | 63,460.0          | 79,215.0           | 2,646         |
|          | 7    | Tianhe-2A - TH-IVB-FEP Cluster, Intel Xeon E5-2692v2 12C 2.2GHz, TH Express-2, Matrix-2000, NUDT National Super Computer Center in Guangzhou China                                         | 4,981,760  | 61,444.5          | 100,678.7          | 18,482        |
|          | 8    | JUWELS Booster Module - Bull Sequana XH2000 , AMD EPYC 7402 24C 2.8GHz, NVIDIA A100, Mellanox HDR InfiniBand/ParTec ParaStation ClusterSuite, Atos Forschungszentrum Juelich (FZJ) Germany | 449,280    | 44,120.0          | 70,980.0           | 1,764         |
| <b>Y</b> | 9    | HPC5 - PowerEdge C4140, Xeon Gold 6252 24C 2.1GHz, NVIDIA<br>Tesla V100, Mellanox HDR Infiniband, Dell EMC<br>Eni S.p.A.<br>Italy                                                          | 669,760    | 35,450.0          | 51,720.8           | 2,252         |
|          |      |                                                                                                                                                                                            |            |                   |                    |               |

#### Why do we need "accelerators" on HPC?



| 1  | 335 | MN-3 - MN-Core Server, Xeon Platinum 8260M<br>24C 2.4GHz, Preferred Networks MN-Core,<br>MN-Core DirectConnect, Preferred Networks<br>Preferred Networks<br>Japan                                                  | 1,664   | 1,822.4  | 61    | 29.700 |
|----|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|----------|-------|--------|
| 2  | 22  | HiPerGator AI - NVIDIA DGX A100, AMD EPYC<br>7742 64C 2.25GHz, NVIDIA A100, Infiniband<br>HDR, Nvidia<br>University of Florida<br>United States                                                                    | 138,880 | 17,200.0 | 583   | 29.521 |
| 3  | 100 | Wilkes-3 - PowerEdge XE8545, AMD EPYC<br>7763 64C 2.45GHz, NVIDIA A100 80GB,<br>Infiniband HDR200 dual rail, Dell EMC<br>University of Cambridge<br>United Kingdom                                                 | 44,800  | 4,124.0  | 147   | 28.144 |
| 4  | 36  | MeluXina - Accelerator Module - BullSequana<br>XH2000, AMD EPYC 7452 32C 2.35GHz, NVIDIA<br>A100 40GB, Mellanox HDR InfiniBand/ParTec<br>ParaStation ClusterSuite, Atos<br>LuxProvide<br>Luxembourg                | 99,200  | 10,520.0 | 390   | 26.957 |
| 5  | 214 | NVIDIA DGX SuperPOD - NVIDIA DGX A100,<br>AMD EPYC 7742 64C 2.25GHz, NVIDIA A100,<br>Mellanox HDR Infiniband, Nvidia<br>NVIDIA Corporation<br>United States                                                        | 19,840  | 2,356.0  | 90    | 26.195 |
| 6  | 5   | Perlmutter - HPE Cray EX235n, AMD EPYC<br>7763 64C 2.45GHz, NVIDIA A100 SXM4 40 GB,<br>Stingshot-10, HPE<br>D0E/SC/LBNL/NERSC<br>United States                                                                     | 706,304 | 64,590.0 | 2,528 | 25.550 |
| 7  | 8   | JUWELS Booster Module - Bull Sequana<br>XH2000 , AMD EPYC 7402 24C 2.86Hz, NVIDIA<br>A100, Mellanox HDR InfiniBand/ParTec<br>ParaStation ClusterSuite, Atos<br>Forschungszentrum Juelich (FZJ)<br>Germany          | 449,280 | 44,120.0 | 1,764 | 25.008 |
| 8  | 43  | JURECA Data Centric Module - BullSequana<br>XH2000, AMD EPYC 7742 64C 2.25GHz, NVIDIA<br>A100 40GB, Mellanox HDR InfiniBand/ParTec<br>ParaStation ClusterSuite, Atos<br>Forschungszentrum Juelich (FZJ)<br>Germany | 105,840 | 9,330.0  | 384   | 24.291 |
| 9  | 189 | Spartan2 - Bull Sequana XH2000 , AMD EPYC<br>7402 24C 2.8GHz, NVIDIA A100, Mellanox HDR<br>Infiniband, Atos<br>Atos<br>France                                                                                      | 23,040  | 2,566.0  | 106   | 24.262 |
| 10 | 93  | Wisteria/BDEC-01 (Aquarius) - PRIMERGY<br>GX2570 M6, Xeon Platinum 8360Y 36C 2.4GHz,<br>NVIDIA A100 SXM4 40 GB, Infiniband HDR,<br>Fujitsu<br>Information Technology Center, The University                        | 42,120  | 4,425.0  | 184   | 24.058 |

#### **GPU vs CPU Architecture**









- \* More control structures and less processing units
- \*Newer CPUs have more parallelism.
- \*Optimised for latency which requires quite a lot of power



- \* Large number of small cores
- \* Less control structured and more processing units
- \* Less flexible program model
- \* There're more restrictions but Requires a lot less power
- \* High Throughput
- \* Newer GPUs: Scatter/Gather memory Access and better flow control (becoming more CPU like)

•GPU devotes more transistors data processing rather than data caching and flow control. Same problem executed on many data elements in parallel.





#### What and Why CUDA C/C++?







- CUDA = "Compute Unified Device Architecture"
  - \* Introduced and released in 2006 for the GeForce 8800\*
  - GPU = dedicated super-threaded, massively data parallel co-processor

C/C++ plus a few simple extensions

- Compute oriented drivers, language, and tools

Allows HPC programmers to model problems and achieve up to 100x performance.

#### **Documentations:**

CUDA\_C\_Programming\_Guide.pdf CUDA\_C\_Getting\_Started.pdf CUDA\_C\_Toolkit\_Release.pdf

#### **CUDA Programming Model**

- A kernel is executed as a grid of thread blocks
- All threads share data memory space
- A thread block is a batch of threads that can cooperate with each other by:
  - Synchronizing their execution
  - Efficiently sharing data through a low latency shared memory
- Tow threads from two different blocks cannot cooperate
- Sequential code launches asynchronously GPU kernels









#### CUDA C/C++







#### Terminology:

Host: The CPU and ist memory (host memory)



Host

**Device**: The GPU and ist memory (device memory)



**Device** 

# CUDA Devices and Threads Execution Model









#### CUDA C/C++









The CPU allocates memory on the GPU The CPU copies data from CPU to GPU The CPU launches kernels on the GPU The CPU copies data to CPU from GPU



#### **NVCC** Compiler







NVIDIA provides a CUDA-C compiler

#### → nvcc

- NVCC splits your code in 2: Host code and Device code.
- Device code sent to NVIDIA device compiler.

- nvcc is capable of linking together both host and device code into a single executable.
- Convention: C++ source files containing CUDA syntax are typically given the extension .cu.









## Lab1: Accelerating Applications with CUDA C/C++

Dr. Momme Allalen Leibniz Computing Centre, Munich Germany - www.lrz.de

Deep Learning Certified Instructor, NVIDIA Deep Learning Institute NVIDIA Corporation.

# Lab1: Accelerating Applications with CUDA C/C++







#### **Prerequisites**

You should already be able to:

- Declare variables, write loops, and use if / else statements in C.
- Define and invoke functions in C.
- Allocate arrays in C.
- No previous CUDA knowledge is required.

#### **Objectives**

By the time you complete this lab, you will be able to:

- Write, compile, and run C/C++ programs that both call
   CPU functions and launch GPU kernels.
  - Control parallel threadhierarchy using execution configuration.
    - Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free memory available to both CPUs and GPUs.
  - Handle errors generated by CUDA code.
    - Accelerate CPU-only applications.









nvc: is a C11 compiler for NVIDIA GPUs and AMD, Intel, OpenPOWER, and Arm CPUs. It invokes the C compiler, assembler, and linker for the target processors with options derived from its command line arguments. nvc supports ISO C11, supports GPU programming with OpenACC, and supports multicore CPU programming with OpenACC and OpenMP.

**nvc++**: is a C++17 compiler for NVIDIA GPUs and AMD, Intel, OpenPOWER, and Arm CPUs. It invokes the C++ compiler, assembler, and linker for the target processors with options derived from its command line arguments. nvc++ supports ISO C++17, supports GPU and multicore CPU programming with C++17 parallel algorithms, OpenACC, and OpenMP.

#### nvfortran, nvcc Compiler







nvfortran: is a Fortran compiler for NVIDIA GPUs and AMD, Intel, OpenPOWER, and Arm CPUs. It invokes the Fortran compiler, assembler, and linker for the target processors with options derived from its command line arguments. nvfortran supports ISO Fortran 2003 and many features of ISO Fortran 2008, supports GPU programming with CUDA Fortran, and GPU and multicore CPU programming with ISO Fortran parallel language features, OpenACC, and OpenMP.

**nvcc**: is the CUDA C and CUDA C++ compiler driver for NVIDIA GPUs. nvcc accepts a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process. nvcc produces optimized code for NVIDIA GPUs and drives a supported host compiler for AMD, Intel, OpenPOWER, and Arm CPUs.









# Lab2: Managing Accelerated Application Memory with CUDA Unified Memory and nsys

Dr. Momme Allalen Leibniz Computing Centre, Munich Germany - www.lrz.de

Deep Learning Certified Instructor, NVIDIA Deep Learning Institute NVIDIA Corporation.

# Lab2: Managing Accelerated Application Memory with CUDA Unified Memory and nsys







#### **Prerequisites**

You should already be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
- Control parallel thread hierarchy using execution configuration.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free Unified Memory.

#### **Objectives**

By the time you complete this lab, you will be able to:

- Use the NVIDIA Nsight Systems command line tool (nsys) to profile accelerated application performance.
  - Understanding of **Streaming Multiprocessors** to optimize execution configurations.
  - Understand the behavior of Unified Memory with regard to page faulting and data migrations.
- Use asynchronous memory prefetching to reduce page faults and data migrations for increased performance.
  - Employ an iterative development cycle to rapidly accelerate and deploy applications.

#### **CUDA® PROFILING TOOLS**







nvvp: NVIDIA visual profiler

prof: tool to understand and optimize the performance of your CUDA,

OpenACC or OpenMP applications,

Application level opportunities

Overall application performance

Overlap CPU and GPU work, identify the bottlenecks (CPU or GPU)

Overall GPU utilization and efficiency

- -Overlap compute and memory copies
- -Utilize compute and copy engines effectively.

#### Kernel level opportunities

- Use memory bandwidth efficiently
- Use compute resources efficiently
- Hide instruction and memory latency

There are more features, example for Dependency Analysis

Command: nvprof --dependency-analysis --cpu-thread-tracing on ./executable\_cuda



# **Nsight Systems Nsight Compute**

#### **NSIGHT PRODUCT FAMILY**







#### **Standalone Performance Tools:**

Ns- Systems – System-wide application algorithm tuning

Ns- Compute – Debug/&Profile specific CUDA kernels

Ns- Graphics – Analyze/&Optimize specific graphics workloads

# Nsight Systems Nsight Compute Nsight Graphics

#### **IDE Plugins**

Nsight Eclipse Edition/Visual Studio – editor, debugger, some perf analysis

Nvprof replaced with nsys profile --stats=true ./exe

Docs/product: https://developer.nvidia.com/nsight-systems

#### **NSIGHT SYSTEMS**







System-wide application algorithm tuning Multi-process tree support

Locate optimization opportunities

Visualize millions of events on a very fast GUI timeline

Or gaps of unused CPU and GPU time

Balance your workload across multiple CPUs and GPUs CPU algorithms, utilization, and thread state GPU streams, kernels, memory transfers, etc

Multi-platform: Linux & Windows, x86-64, Tegra, Power, MacOSX (host only)

GPUs: Volta, Turing ...

Docs/product: https://developer.nvidia.com/nsight-systems

#### **NSIGHT COMPUTE**







#### **CUDA Kernel profiler**

Targeted metric sections for various performance aspects (Debug/&Profile)

Very high freq GPU perf counter, customizable data collection and presentation (tables, charts ..,)

Python-based rules for guided analysis (or postprocessing)

GPUs: Volta, Turing, Amper

Docs/product: https://developer.nvidia.com/nsight-systems











Re-check overall performance

**NVIDIA Nsight Systems** 

**Dive into CUDA** kernels by using metrics/counter collection

Re-check overall performance

**Nsight Compute Detailed CUDA kernel performance** 

**Nsight Graphics Detailed frame/render performance** 

Nsight Systems - Analyze application algorithm system-wide

**Nsight Compute** - Debug/optimize CUDA kernel

**Nsight Graphics** - Debug/optimize graphics workloads









**NVIDIA Tools Extension API Library (NVTX)** 

The NVIDIA Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.

Applications which integrate NVTX can use NVIDIA Nsight VSE to capture and visualize these events and ranges.

```
void Wait(int waitMilliseconds)
         nvtxNameOsThread("MAIN");
         nvtxRangePush( FUNCTION );
         nvtxMark(>"Waiting...");
         Sleep(waitMilliseconds);
         nvtxRangePop();
int main(void)
         nvtxNameOsThread("MAIN");
         nvtxRangePush( FUNCTION );
         Wait();
         nvtxRangePop();
```

nsys profile -t nvtx --stats=true ...









# Lab3: Asynchronous Streaming, and Visual Profiling with CUDA C/C++

Dr. Momme Allalen

Leibniz Computing Centre, Munich Germany - www.lrz.de

Deep Learning Certified Instructor, NVIDIA Deep Learning Institute NVIDIA Corporation.

## Lab3: Asynchronous Streaming, and Visual Profiling With CUDA C/C++







#### **Prerequisites**

To get the most out of this lab you should already be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
- Control parallel thread hierarchy using execution configuration.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free CUDA Unified Memory.
- Understand the behaviour of Unified Memory with regard to page faulting and data migrations.
- Use asynchronous memory prefetching to reduce page faults and data migrations.

#### **Objectives**

By the time you complete this lab you will be able to:

- Use the Nsight Systems to visually profile the timeline of GPU-accelerated CUDA applications.
- Use Nsight Systems to identify, and exploit, optimization opportunities in GPU-accelerated CUDA applications.
- Utilize CUDA streams for concurrent kernel execution in accelerated applications.
- (Optional Advanced Content) Use manual memory allocation, including allocating pinned memory, in order to asynchronously transfer data in concurrent CUDA streams.

#### **Multiple Streams**









Overlap copy with kernel

### Stream 0 memcpy A to GPU memcpy B to GPU kernel memcpy C from GPU memcpy A to GPU memcpy B to GPU kernel memcpy C from GPU



#### **Multiple Streams**







```
for (int i=0; i<FULL SIZE; i+= N*2) {</pre>
// copy the locked memory to the device, async
cudaMemcpyAsync(dev a0, host a+i, N * sizeof(int), cudaMemcpyHostToDevice, stream0);
cudaMemcpyAsync(dev b0, host b+i, N * sizeof(int), cudaMemcpyHostToDevice, stream0);
kernel <<< N/256,256,0,stream 0>>> ( dev a0, dev b0, dev c0 );
// copy the data from device to locked memory
cudaMemcpyAsync(host c+i, dev c0, N * sizeof(int), cudaMemcpyDeviceToHost, stream0);
// copy the locked memory to the device, async
cudaMemcpyAsync(dev a1,host a+i+N, N * sizeof(int),cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(dev b1,host b+i+N, N * sizeof(int),cudaMemcpyHostToDevice, stream1);
kernel <<< N/256,256,0,stream1>>> ( dev a1, dev b1, dev c1 );
// copy the data from device to locked memory
cudaMemcpyAsync(host c+i+N,dev c1, N * sizeof(int),cudaMemcpyDeviceToHost, stream1);
```

