





# **Fundamentals of Accelerated** Computing with CUDA C/C++

Dr. Momme Allalen | CSC | 09.09.2020

CSC

IT4INNOVATIONS NATIONAL SUPERCOMPUTING CENTER

#### **Overview**



- The workshop is co-organized by LRZ, CSC, IT4Innovations and NVIDIA Deep Learning Institute (DLI) for the Partnership for Advanced Computing in Europe (PRACE).
- NVIDIA Deep Learning Institute (DLI) offers hands-on training for developers, data scientists, and researchers looking to solve challenging problems with deep learning.
- This 4-days workshop offered for the first time online combines lectures about fundamentals of Deep Learning for Multiple Data Types and Multi-GPUs with lectures about Accelerated Computing with OpenACC and CUDA C/C++
- Learn how to train and deploy a neural network to solve real-world problems, how to generate effective descriptions of content within images and video clips, how to effectively parallelize training of deep neural networks on Multi-GPUs and how to accelerate your applications with OpenACC and CUDA C/C++.
- The lectures are interleaved with many hands-on sessions using Jupyter Notebooks. The exercises will be done on a fully configured GPU-accelerated workstation in the cloud.



## **PRACE Training**



VSB TECHNICAL

CSC

IT4INNOVATIONS

CENTER

NATIONAL SUPERCOMPUTING



Mission: Serve as European hubs and key drivers of advanced high-quality training for researchers working in the computational sciences.

http://www.training.prace-ri.eu/







# DEEP LEARNING INSTITUTE

DLI Mission: Help the world to solve the most challenging problems using AI and deep learning

We help developers, data scientists and engineers to get started in architecting, optimizing, and deploying neural networks to solve real-world problems in diverse industries such as autonomous vehicles, healthcare, robotics, media & entertainment and game development.



# Fundamentals of Accelerated Computing with CUDA C/C++



- You learn the basics of CUDA C/C++ by:
  - Accelerating CPU-only applications to run their latent parallelism on GPUs.
  - Utilizing essential CUDA memory management techniques to optimize accelerated applications.
  - Exposing accelerated application potential for concurrency and exploiting it with CUDA streams.
  - Leveraging command line and visual profiling to guide and check your work.
  - Upon completion, you'll be able to accelerate and optimize existing C/C++ CPU-only
    applications using the most essential CUDA tools and techniques. You'll understand
    an iterative style of CUDA development that will allow you to ship accelerated
    applications fast.



#### **Tentative Agenda**



- 10:00-10:15 Introduction CUDA C/C++
  10:15-12:00 Accelerating Applications with CUDA C/C++
  12:00-13:00 Lunch Break
  13:00-14:20 Managing Accelerated Application Memory with CUDA Unified Memory and nsys
  14:20-14:30 Coffee Break
- 14:30-15:45 Asynchronous Streaming and Visual Profiling for Accelerated Applications with CUDA C/C++
- 15:45-16:00 Q&A, Final Remarks



## Workshop Webpage



- Lecture material will be made available under:
  - https://tinyurl.com/dl-gpu-workshop-csc

- Access CUDA C/C++ Code:
- See the: Chat Window



## **Training Setup**



- To get started, follow these steps:
- Create an NVIDIA Developer account at <a href="http://courses.nvidia.com/join">http://courses.nvidia.com/join</a> Select "Log in with my NVIDIA Account" and then "Create Account".
- If you use your own laptop, make sure that WebSockets works for you: Test your Laptop at <u>http://websocketstest.com</u>
  - Under ENVIRONMENT, confirm that "WebSockets" is checked yes.
  - Under WEBSOCKETS (PORT 80]. confirm that "Data Receive", "Send", and "Echo Test" are checked yes.
  - If there are issues with WebSockets, try updating your browser. We recommend Chrome, Firefox, or Safari for an optimal performance.
- Visit <u>http://courses.nvidia.com/dli-event</u> and enter the event code provided by the instructor.
- You're ready to get started.



And now ...



## **Enjoy the course!**





#### Moore's law is dead !!

The long-held notion that the processing power of computers increases exponentially every couple of years has hit its limit .....

The free lunch is over ..

Future is parallel !





Typical example Intel chip: Core i7 7<sup>th</sup> Gen

- 4\*CPU cores
- with hyperthreading
- Each with 8-wide AVX instructions
- GPU with 1280 processing elements

Programming on chip:

- Serial C/C++ .. Code alone only takes advantage of a very small amount of the available resources of the chip
- Using vectorisation allows you to fully utilise the resources of a single hyper-thread
- Using multi-threading allows you to fully utilise all CPU cores

# GPU need to be used?----



#### Intel Kaby Lake-S







Using heterogeneous programming allows you to dispatch and fully utilise the entire chip.

PRACE ON DEEP LEARNING INSTITUTE

GPU programming:

- Limited only to a specific domain
- Separate source solutions
- Verbose low Level APIs
- C++ AMP
- CUDA C/C++
- Kokkos
- HPX
- Raja
- SYCL
- NVPTX



## Why do we need GPUs on HPC?



OF OSTRAVA

CENTER

- Increase in parallelism
- Today almost a similar amount of efforts on using CPUs vs GPUs by real applications
- GPUs well-suited to deep learning.



CSC

|                                                                                          | Rank | System                                                                                                                                                                                | Cores      | (TFlop/s) | (TFlop/s) | (kW)   |
|------------------------------------------------------------------------------------------|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|-----------|--------|
| Why do we need "accelerators" on HPC?<br>Top500.org                                      | 1    | Supercomputer Fugaku - Supercomputer<br>Fugaku, A64FX 48C 2.2GHz, Tofu<br>interconnect D, Fujitsu<br>RIKEN Center for Computational Science<br>Japan                                  | 7,299,072  | 415,530.0 | 513,854.7 | 28,335 |
|                                                                                          | 2    | Summit - IBM Power System AC922, IBM<br>POWER9 22C 3.07GHz, NVIDIA Volta<br>GV100, Dual-rail Mellanox EDR Infiniband,<br>IBM<br>D0E/SC/0ak Ridge National Laboratory<br>United States | 2,414,592  | 148,600.0 | 200,794.9 | 10,096 |
|                                                                                          | 3    | Sierra - IBM Power System AC922, IBM<br>POWER9 22C 3.1GHz, NVIDIA Volta GV100,<br>Dual-rail Mellanox EDR Infiniband, IBM /<br>NVIDIA / Mellanox<br>DOE/NNSA/LLNL<br>United States     | 1,572,480  | 94,640.0  | 125,712.0 | 7,438  |
| NVIDIA                                                                                   | 4    | Sunway TaihuLight - Sunway MPP,<br>Sunway SW26010 260C 1.45GHz, Sunway,<br>NRCPC<br>National Supercomputing Center in Wuxi<br>China                                                   | 10,649,600 | 93,014.6  | 125,435.9 | 15,371 |
| GPUs                                                                                     | 5    | Tianhe-2A - TH-IVB-FEP Cluster, Intel<br>Xeon E5-2692v2 12C 2.2GHz, TH Express-<br>2, Matrix-2000, NUDT<br>National Super Computer Center in<br>Guangzhou<br>China                    | 4,981,760  | 61,444.5  | 100,678.7 | 18,482 |
|                                                                                          | 6    | HPC5 - PowerEdge C4140, Xeon Gold 6252<br>24C 2.1GHz, NVIDIA Tesla V100, Mellanox<br>HDR Infiniband, Dell EMC<br>Eni S.p.A.<br>Italy                                                  | 669,760    | 35,450.0  | 51,720.8  | 2,252  |
|                                                                                          | 7    | Selene - DGX A100 SuperPOD, AMD EPYC<br>7742 64C 2.25GHz, NVIDIA A100, Mellanox<br>HDR Infiniband, Nvidia<br>NVIDIA Corporation<br>United States                                      | 272,800    | 27,580.0  | 34,568.6  | 1,344  |
|                                                                                          | 8    | Frontera - Dell C6420, Xeon Platinum<br>8280 28C 2.7GHz, Mellanox InfiniBand<br>HDR, Dell EMC<br>Texas Advanced Computing Center/Univ.<br>of Texas<br>United States                   | 448,448    | 23,516.4  | 38,745.9  |        |
|                                                                                          |      | Marconi-100 - IBM Power System AC922,<br>IBM POWER9 16C 3GHz, Nvidia Volta V100,<br>Dual-rail Mellanox EDR Infiniband, <b>IBM</b><br>CINECA<br>Italy                                  | 347,776    | 21,640.0  | 29,354.0  | 1,476  |
| Fundamentals of Accelerated Computing with CUDA C/C++   CSC   09.09.2020; Allalen@Irz.de | 10   | <b>Piz Daint</b> - Cray XC50, Xeon E5-2690v3<br>12C 2.6GHz, Aries interconnect , NVIDIA                                                                                               | 387,872    | 21,230.0  | 27,154.3  | 2,384  |

|                                                                                          | Rank | Rank | System                                                                                                                                                                                                                                     | Cores     | (TFlop/s) | (kW)   | (GFlops/watts) |
|------------------------------------------------------------------------------------------|------|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-----------|--------|----------------|
| Why do we need "accelerators" on HPC?                                                    |      | 393  | MN-3 - MN-Core Server,<br>Xeon 8260M 24C 2.46Hz,<br>MN-Core, RoCEv2/MN-<br>Core DirectConnect,<br>Preferred Networks<br>Japan                                                                                                              | 2,080     | 1,621.1   | 77     | 21.108         |
| Green top500                                                                             | 2    | 7    | Selene - DGX A100<br>SuperPOD, AMD EPYC<br>7742 64C 2.256Hz,<br>NVIDIA A100, Mellanox<br>HDR Infiniband, Nvidia<br>NVIDIA Corporation<br>United States                                                                                     | 272,800   | 27,580.0  | 1,344  | 20.518         |
|                                                                                          | 3    | 468  | NA-1 - ZettaScaler-2.2,<br>Xeon D-1571 16C 1.3GHz,<br>Infiniband EDR, PEZY-<br>SC2 700Mhz, PEZY<br>Computing / Exascaler<br>Inc.<br>PEZY Computing K.K.<br>Japan                                                                           | 1,271,040 | 1,303.2   | 80     | 18.433         |
|                                                                                          | 4    | 204  | A64FX prototype -<br>Fujitsu A64FX, Fujitsu<br>A64FX 48C 2GHz, Tofu<br>interconnect D, Fujitsu<br>Fujitsu Numazu Plant<br>Japan                                                                                                            | 36,864    | 1,999.5   | 118    | 16.876         |
| NVIDIA<br>GPUs                                                                           | 5    | 26   | AiMOS - IBM Power<br>System AC922, IBM<br>POWER9 20C 3.45GHz,<br>NVIDIA Volta GV100,<br>Dual-rait Mellanox EDR<br>Infiniband, IBM<br>Rensselaer Polytechnic<br>Institute Center for<br>Computational<br>Innovations (CCI)<br>United States | 130,000   | 8,339.0   | 512    | 16.285         |
|                                                                                          | 6    | 6    | HPC5 - PowerEdge<br>C4140, Xeon Gold 6252<br>24C 2.1GHz, NVIDIA<br>Tesla V100, Mellanox<br>HDR Infiniband, Dell<br>EMC<br>Eni S.p.A.<br>Italy                                                                                              | 669,760   | 35,450.0  | 2,252  | 15.740         |
|                                                                                          | 7    | 421  | Satori - IBM Power<br>System AC922, IBM<br>POWER9 20C 2.4GHz,<br>Infiniband EDR, NVIDIA<br>Tesla V100 SXM2, IBM<br>MIT/MGHPCC Holyoke,<br>MA<br>United States                                                                              | 23,040    | 1,464.0   | 94     | 15.574         |
|                                                                                          |      | 2    | Summit - IBM Power<br>System AC922, IBM<br>POWER9 22C 3.07GHz,<br>NVIDIA Volta GV100,<br>Dual-rail Mellanox EDR<br>Infiniband, IBM<br>DOE/SC/Oak Ridge<br>National Laboratory<br>United States                                             | 2,414,592 | 148,600.0 | 10,096 | 14.719         |
|                                                                                          | 9    | 1    | Supercomputer Fugaku<br>- Supercomputer<br>Fugaku, A64FX 48C<br>2.26Hz, Tofu<br>interconnect D, Fujitsu<br>RIKEN Center for<br>Computational Science<br>Japan                                                                              | 7,299,072 | 415,530.0 | 28,335 | 14.665         |
| Fundamentals of Accelerated Computing with CUDA C/C++   CSC   09.09.2020; Allalen@Irz.de | 10   |      | Marconi-100 - IBM<br>Power System AC922,<br>IBM POWER9 16C 3GHz,                                                                                                                                                                           | 347,776   | 21,640.0  | 1,476  | 14.661         |

# **GPU vs CPU Architecture**



VSB

CSC

TECHNTCAL

\* Small number of large cores \* More control structures and less processing units \*Optimised for latency which requires quite a lot of power



\* Large number of small cores
\* Less control structured and more processing units
\*Less flexible program model
\*There're more restrictions but Requires a lot less power

IT4INNOVATIONS

CENTER

NATIONAL SUPERCOMPUTING

•GPU devotes more transistors data processing rather than data caching and flow control. Same problem executed on many data elements in parallel.



#### PCI Express 4.0 Host Interface



Fundamentals of Accelerated Computing with CUDA C/C++ | CSC | 09.09.2020; Allalen@Irz.de

**OF OSTRAVA** CENTER

CSC

# What is CUDA C/C++ ?



- CUDA = "Compute Unified Device Architecture"
  - \* Introduced in 2006 \*
  - GPU = dedicated super-threaded, massively data parallel co-processor

C/C++ plus a few simple extensions - Compute oriented drivers, language, and tools

Documentations:

CUDA\_C\_Programming\_Guide.pdf CUDA\_C\_Getting\_Started.pdf CUDA\_C\_Toolkit\_Release.pdf



# **CUDA Programming Model**

- A kernel is executed as a grid of thread blocks
- All threads share data memory space
- A thread block is a batch of threads that can cooperate with each other by:
  - Synchronizing their execution
  - Efficiently sharing data through a low latency shared memory
- Tow threads from two different blocks cannot cooperate
- Sequential code launches asynchronously GPU kernels



NATIONAL SUPERCOMPUTING

CENTER

CSC





# Terminology:

Host: The CPU and ist memory (host memory)



Host

**Device**: The GPU and ist memory (device memory)







# CUDA Devices and Threads Execution Model







The CPU allocates memory on the GPU The CPU copies data from CPU to GPU The CPU launches kernels on the GPU The CPU copies data to CPU from GPU

# CUDA C/C++



IT4INNOVATIONS

CENTER

CSC

NATIONAL SUPERCOMPUTING



# **NVCC Compiler**



• NVIDIA provides a CUDA-C compiler

#### $\rightarrow$ nvcc

- NVCC splits your code in 2: Host code and **Device** code.
- **Device** code sent to NVIDIA device compiler.

• nvcc is capable of linking together both host and device code into a single executable.

• Convention: C++ source files containing CUDA syntax are typically given the extension .cu.

TIONAL SUPERCOMPUTING

CENTER

CSC









VSB TECHNICAL | IT4 ||||| UNIVERSITY | NAT OF OSTRAVA | CEN

4INNOVATIONS TIONAL SUPERCOMPUTING NTER

# Lab1: Accelerating Applications with CUDA C/C++ Dr. Momme Allalen Leibniz Computing Centre, Munich Germany - www.lrz.de Deep Learning Certified Instructor, NVIDIA Deep Learning Institute NVIDIA Corporation.

# Lab1: Accelerating Applications with CUDA C/C++



#### Prerequisites

You should already be able to:

- Declare variables, write loops, and use if / else statements in C.
- Define and invoke functions in C.
- Allocate arrays in C.
- No previous CUDA knowledge is required.

#### Objectives

By the time you complete this lab, you will be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
  - Control parallel threadhierarchy using execution configuration.
    - Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free memory available to both CPUs and GPUs.

CSC

- Handle errors generated by CUDA code.
  - Accelerate **CPU-only applications**.

CENTER

TIONAL SUPERCOMPUTING





DEEP LEARNING INSTITUTE



VSB TECHNICAL UNIVERSITY OF OSTRAVA T4INNOVATIONS ATIONAL SUPERCOMPUTING ENTER

# Lab2: Managing Accelerated Application Memory with CUDA Unified Memory and nsys Dr. Momme Allalen Leibniz Computing Centre, Munich Germany - www.lrz.de Deep Learning Certified Instructor, NVIDIA Deep Learning Institute NVIDIA Corporation.

# Lab2: Managing Accelerated Application Memory with CUDA Unified Memory and nsys



IATIONAL SUPERCOMPUTING

CENTER

#### **Prerequisites**

You should already be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
- Control parallel thread hierarchy using execution configuration.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free Unified Memory.

#### **Objectives**

| • | By the time you complete this lab, you will be able to:<br>Use the <b>NVIDIA Command Line Profiler (nprof)</b> to |
|---|-------------------------------------------------------------------------------------------------------------------|
| • |                                                                                                                   |
|   | profile accelerated application performance.                                                                      |
|   | Understanding of Streaming Multiprocessors to                                                                     |
|   | optimize execution configurations.                                                                                |
|   | Understand the behavior of Unified Memory with                                                                    |
|   | regard to page faulting and data migrations.                                                                      |
| • | Use asynchronous memory prefetching to reduce                                                                     |
|   | page faults and data migrations for increased                                                                     |
|   | performance.                                                                                                      |
|   | • Employ an iterative development cycle to rapidly                                                                |
|   | accelerate and deploy applications.                                                                               |

CSC

# CUDA® PROFILING TOOLS



IAL SUPERCOMPUTING

nvvp: NVIDIA visual profiler wprof: tool to understand and optimize the performance of your CUDA, OpenACC or OpenMP applications, Application level opportunities Overall application performance Overlap CRU and GPU work, identify the bottlenecks (CPU or GPU) Overall GPU utilization and efficiency -Overlap compute and memory copies **Nsight Systems** -Utilize compute and copy engines effectively. **Nsight Compute** Kernel level opportunities Use memory bandwidth efficiently • Use compute resources efficiently Hide instruction and memory latency There are more features, example for Dependency Analysis Command: nvprof --dependency-analysis --cpu-thread-tracing on ./executable\_cuda

CSC

# **NSIGHT PRODUCT FAMILY**





Docs/product: https://developer.nvidia.com/nsight-systems

Fundamentals of Accelerated Computing with CUDA C/C++ | CSC | 09.09.2020; Allalen@lrz.de



CSC

# **NSIGHT SYSTEMS**



System-wide application algorithm tuning Multi-process tree support

Locate optimization opportunities Visualize millions of events on a very fast GUI timeline Or gaps of unused CPU and GPU time

Balance your workload across multiple CPUs and GPUs CPU algorithms, utilization, and thread state GPU streams, kernels, memory transfers, etc

Multi-platform: Linux & Windows, x86-64, Tegra, Power, MacOSX (host only)

GPUs: Volta, Turing

Docs/product: https://developer.nvidia.com/nsight-systems



# **NSIGHT COMPUTE**



## **CUDA Kernel profiler**

Targeted metric sections for various performance aspects (Debug/&Profile)

Very high freq GPU perf counter, customizable data collection and presentation (tables, charts ..,)

Python-based rules for guided analysis (or postprocessing)

GPUs: Volta, Turing, Amper

Docs/product: https://developer.nvidia.com/nsight-systems



#### NVIDIA System Profiler 4.0 Eile View Help Select device for profiling. - ++ More info... Project 2 DGXV8-in-4GPU.gdrep bace\_DGX1\_TF\_synthetic\_Reshiet50 with-bace-backtraces.gdrep 🔯 trace\_DGX1\_TF\_synthetic\_Reshiet50 with-bace-backtraces.ad system-brace-20s.gdrep 🔯 trace\_DGX1V\_C2\_synthetic\_Reshiet50 with-brace-backtraces.gdrep Timeline View +850ms +900m +950m 65 A PAT PERAT BALANCE Baller 1 & Barran Manna et Marb ptr | p.|| System Thread/core CUDA API 01 **cuDNN** cuBLAS migration Profiler overhead Millink v 🖓 [178] python -System [p. pt\_] Processes and . . . ALL DEAL LES 01-101101101101101101101 CLIDA API 84 64 JOHN | LIN | 0.1 | 0.0 **cuDNN** Sharpen Bartall maker threads **cuBLAS** Thread state Profiler overhead And all and a second 2 [165] python System CUDA API 8.440440044 1.14 4.64 **cuDNN** to A sead **CUDA and OpenGL** cuBLAS Profiler overheat / [166] python API trace System 48410408-81 CUDA API 14.4 81 64 81 4 1. BL M. 1. C.O. 04 **cuDNN** 1.11.1 **cuBLAS** المتحاوة المتنا ملكم المرضة محملية بربي متعاور المتعافية والمتعادة والمتعادة والمتعادة والمتعاد ~ 🖸 [199] python cuDNN and \_\_\_\_pt\_\_\_ System A Lot of a lot I have p. 100. 8 84 CUDA API 8 44 14 and the distantial in the distant in the **cuDNN** 100100 cuBLAS trace cuBLAS 53 threads hidden. CUDA (Tesla P100-SXM2-16GB Stream 174 Memory A D AND AND AND AND ADA D ✓ Kernels 11 1 10 > maxwell\_fp16\_scudnn\_fp16\_128x128\_stridedB\_splitK\_interior\_nn Kernel and memory ................ # maxivell\_fp16\_scudnn\_fp16\_128x128\_relu\_interior\_nn ... > maxivell\_fp16\_scudnn\_fp16\_128x128\_stridedB\_interior\_nn > dgrad\_engine transfer activities 1111 > cudnn\_maxwell\_gcgemm\_64x64\_tn\_batched 28 kernel group(s) hiddes. Stream 12 at the first of the later of th ✓ Kerneb (a) and a construction block and a second construction of a second se AllReduceKernelSmall 0 1 1 AlReduceKernel 1 kernel group(s) hidden 67 stream(s) hudden. CUDA (Tesla P100-SXM2-16GB Stream 173 Multi-GPU Stream 20 66 stream(s) hulden. CUDA (Tesla P100-5XM2-16GB) CUDA (Tesla P100-SXM2-16GB) CUDA (Tesla P100-SXM2-16GB CUDA (Tesla P100-SXM2-16G8) A PROPER OF ADDRESS OF ADDRESS And PERSONAL PROPERTY AND ADDRESS OF TAXABLE PROPERTY. ALC: U.S. 49.10 CUDA (Tesla P100-SXM2-16GB) CUDA (Tesla P100-SXM2-16GB)

# **NSIGHT PRODUCT FAMILY**





Nsight Systems - Analyze application algorithm system-wide Nsight Compute - Debug/optimize CUDA kernel Nsight Graphics - Debug/optimize graphics workloads







DEEP LEARNING INSTITUTE



VSB TECHNICAL UNIVERSITY OF OSTRAVA T4INNOVATIONS ATIONAL SUPERCOMPUTING ENTER

# Lab3: Asynchronous Streaming, and Visual Profiling with CUDA C/C++

**Dr. Momme Allalen** Leibniz Computing Centre, Munich Germany - www.lrz.de Deep Learning Certified Instructor, NVIDIA Deep Learning Institute NVIDIA Corporation.

## Lab3: Asynchronous Streaming, and Visual Profiling With CUDA C/C++

#### Prerequisites

To get the most out of this lab you should already be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
- Control parallel thread hierarchy using execution configuration.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free CUDA Unified Memory.
- Understand the behaviour of Unified Memory with regard to page faulting and data migrations.
- Use asynchronous memory prefetching to reduce page faults and data migrations.

#### Objectives

CSC

By the time you complete this lab you will be able to:

 Use the Nsight Systems to visually profile the timeline of GPU-accelerated CUDA applications.

DEEP

LEARNING INSTITUTF

ATIONAL SUPERCOMPUTING

CENTER

- Use Nsight Systems to identify, and exploit, optimization opportunities in GPUaccelerated CUDA applications.
- Utilize CUDA streams for concurrent kernel execution in accelerated applications.
- (Optional Advanced Content) Use manual memory allocation, including allocating pinned memory, in order to asynchronously transfer data in concurrent CUDA streams.

## **Multiple Streams**



OF OSTRAVA

CSC

CENTER



## **Multiple Streams**



TECHNICAL

CSC

TIONAL SUPERCOMPUTING

for (int i=0; i<FULL\_SIZE; i+= N\*2) {</pre>

// copy the locked memory to the device, async

cudaMemcpyAsync(dev\_a0, host\_a+i, N \* sizeof(int), cudaMemcpyHostToDevice, stream0);

cudaMemcpyAsync(dev\_b0, host\_b+i, N \* sizeof(int), cudaMemcpyHostToDevice, stream0);

kernel << < N/256,256,0, stream0>>> ( dev a0, dev b0, dev c0 );

// copy the data from device to locked memory
cudaMemcpyAsync(host\_c+i, dev\_c0, N \* sizeof(int),cudaMemcpyDeviceToHost, stream0);
// copy the locked memory to the device, async
cudaMemcpyAsync(dev\_al,host\_a+i+N, N \* sizeof(int),cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(dev\_bl,host\_b+i+N, N \* sizeof(int),cudaMemcpyHostToDevice, stream1);

kernel <<< N/256,256,0,stream1>>> ( dev\_a1, dev\_b1, dev\_c1 );

// copy the data from device to locked memory
cudaMemcpyAsync(host\_c+i+N,dev\_c1, N \* sizeof(int),cudaMemcpyDeviceToHost, stream1);







DEEP LEARNING INSTITUTE



VSB TECHNICAL UNIVERSITY OF OSTRAVA 4INNOVATIONS TIONAL SUPERCOMPUTING NTER

# THANK YOU

# Instructor: Dr. Momme Allalen www.nvidia.com/dli