Computationally intensive CUDA® C++ applications in high-performance computing, data science, bioinformatics, and deep learning can be accelerated by using multiple GPUs, which can increase throughput and/or decrease your total runtime. When combined with the concurrent overlap of computation and memory transfers, computation can be scaled across multiple GPUs without increasing the cost of memory transfers. For organisations with multi-GPU servers, whether in the cloud or on NVIDIA DGX systems, these techniques enable you to achieve peak performance from GPU-accelerated applications. And it’s important to implement these single-node, multi-GPU techniques before scaling your applications across multiple nodes. 

This course covers how to write CUDA C++ applications that efficiently and correctly utilise all available GPUs in a single node, dramatically improving the performance of your applications and making the most cost-effective use of systems with multiple GPUs.

The course is co-organised by LRZ and NVIDIA Deep Learning Institute (DLI). NVIDIA DLI offers hands-on training for developers, data scientists, and researchers looking to solve challenging problems with deep learning.

Learning Objectives

By participating in this workshop, you’ll:

  • Use concurrent CUDA streams to overlap memory transfers with GPU computation
  • Utilise all available GPUs on a single node to scale workloads across all available GPUs
  • Combine the use of copy/compute overlap with multiple GPUs
  • Rely on the NVIDIA Nsight Systems Visual Profiler timeline to observe improvement opportunities and the impact of the techniques covered in the workshop

Important information

After you are accepted, please create an account under

Ensure your laptop / PC will run smoothly by going to Make sure that WebSockets work for you by seeing under Environment, WebSockets is supported and Data Receive, Send and Echo Test all check Yes under WebSockets (Port 80).If there are issues with WebSockets, try updating your browser. If you have any questions, please contact Marjut Dieringer at mdieringer"at"


Basic CUDA knowledge as teached in the DLI course "Fundamentals of Accelerated Computing with CUDA C/C++".


The lectures are interleaved with many hands-on sessions using Jupyter Notebooks. The exercises will be done on a fully configured GPU-accelerated workstation in the cloud.




Dr. Momme Allalen (LRZ, NVIDIA certified University Ambassador)

Prices and Eligibility

The course is open and free of charge for academic participants.


Please register with your official e-mail address to prove your affiliation.

Withdrawal Policy

See Withdrawal

Legal Notices

For registration for LRZ courses and workshops we use the service edoobox from Etzensperger Informatik AG ( Etzensperger Informatik AG acts as processor and we have concluded a Data Processing Agreement with them.


130.11.202110:00 – 16:00Momme AllalenONLINE

  • No labels