2025-03-26 Bootcamp on Accelerated Distributed Computing - powered by NVIDIA (hdta5w24)
Course | Bootcamp on Accelerated Distributed Computing - powered by NVIDIA |
Number | hdta5w24 |
Available places | 2 |
Date | 26.03.2025 – 26.03.2025 |
Price | EUR 0.00 |
Location | Leibniz Rechenzentrum Boltzmannstr. 1 85748 Garching b. München |
Room | Seminarraum 1 |
Registration deadline | 28.02.2025 13:59 |
education@lrz.de |
Contents
The Accelerated Distributed Model Training Bootcamp is designed from a real-world perspective on how to efficiently utilise GPUs in training models in a distributed manner. Attendees walk through the system topology to learn the dynamics of multi-GPU and multi-node connections and architecture. They will also learn and understand state-of-the-art strategies for training models in a multi-GPU and multi-node environment using the PyTorch Framework. Furthermore, attendees will learn to profile code, inspect & analyse, and optimise using NVIDIA® Nsight™ Systems, a tool that helps identify optimisation opportunities and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.
Participation in this bootcamp is strongly recommended for teams who wish to apply for the EuroCC AI Hackathon, taking place in October 2025. For this reason, we strongly recommend participants to apply as a team of 2 or more for this bootcamp, and priority will be given to those with a team.
Topics
- Training strategy
- Data Parallelism
- Model Parallelism
- Message Passing
- Horovod
- Pipeline Parallelism
- Mixed Precision
- ZeRO, Fully Sharded Data Parallelism (FSDP), Mixture-of-Experts (MoE)
- PyTorch SLURM
- System Topology
- Communication concepts
- Intra-Node Communication Topology
- NCCL
- Implementation
- NeMo Megatron Core/Nemotron
- Profiler
Prelimiary Agenda
All times are in Central European Time (CET).
- 09:00 - 09:15: Welcome and Introduction
- 09:15 - 09:30: Cluster connection walkthrough (Demo)
- 09:30 - 10:30: Fundamentals of accelerated distributed model training methods (Lecture)
- 10:30 - 11:30: Instructor Lab Walk through (Demo)
- 11:30 - 12:00: Break
- 12:00 - 13:30: Multi-GPU Multi-node Training strategy (Lab)
- 13:30 - 14:00: Nsight System Profiling (Lab)
- 14:00 - 14:15: Wrap up and Q&A
The bootcamp is co-organised by LRZ, NVIDIA and the OpenACC organization.
Prerequisites
- Background knowledge of Python programming and Pytorch framework is required.
Language
English
Lecturers
The lecturers will be from NVIDIA.
Prices and Eligibility
The course is open and free of charge for academic participants from Germany. Priority admission to the event will be given to members of MCML.
Registration
Please apply with your official email address to prove your affiliation. The final participants will be selected and informed after the registration deadline has passed. Priority will be given to members of the Munich Center for Machine Larning (MCML).
Withdrawal Policy
See Withdrawal
Legal Notices
This bootcamp is co-organised with NVIDIA. Some of your personal data will be transferred to NVIDIA (salutation, title, first name, surname, institution, country, email and bootcamp-specific information provided in the registration form). The legal basis is in accordance with Article 6(1)(b) GDPR. Please see also our data protection notice (in German: https://www.lrz.de/datenschutzerklaerung/).
For registration for LRZ courses and workshops we use the service edoobox from Etzensperger Informatik AG (www.edoobox.com). Etzensperger Informatik AG acts as processor and we have concluded a Data Processing Agreement with them.
See Legal Notices
No. | Date | Time | Teacher | Location | Room | Description |
---|---|---|---|---|---|---|
1 | 26.03.2025 | 09:00 – 14:15 | Leibniz Rechenzentrum | Seminarraum 1 | Lecture |