Data Parallelism - How to Train Deep Learning Models on Multiple GPUs @ LRZ 2023

Screen Shot 2017-12-13 at 12.24.46

Overview

Modern deep learning challenges leverage increasingly larger datasets and more complex models. As a result, significant computational power is required to train models effectively and efficiently. Learning to distribute data across multiple GPUs during deep learning model training makes possible an incredible wealth of new applications utilizing deep learning.

Additionally, the effective use of systems with multiple GPUs reduces training time, allowing for faster application development and much faster iteration cycles. Teams who are able to perform training using multiple GPUs will have an edge, building models trained on more data in shorter periods of time and with greater engineer productivity.

This workshop teaches you techniques for data-parallel deep learning training on multiple GPUs to shorten the training time required for data-intensive applications. Working with deep learning tools, frameworks, and workflows to perform neural network training, you’ll learn how to decrease model training time by distributing data to multiple GPUs, while retaining the accuracy of training on a single GPU.

The course is co-organised by LRZ and NVIDIA Deep Learning Institute (DLI).  All instructors are NVIDIA certified University Ambassadors.

Lecturer

PD Dr. Juan Durillo Barrionuevo (LRZ, NVIDIA certified University Ambassador)

NVIDIA Deep Learning Institute

The NVIDIA Deep Learning Institute delivers hands-on training for developers, data scientists, and engineers. The program is designed to help you get started with training, optimizing, and deploying neural networks to solve real-world problems across diverse industries such as self-driving cars, healthcare, online services, and robotics.

Training Setup

To get started, follow these steps:

  1. Create an NVIDIA Developer account at http://courses.nvidia.com/join Select "Log in with my NVIDIA Account" and then '"Create Account".
  2.  Make sure that WebSockets works for you:
    • Test your Laptop at http://websocketstest.com
    • Under ENVIRONMENT, confirm that '"WebSockets" is checked yes.
    • Under WEBSOCKETS (PORT 80]. confirm that "Data Receive", "Send", and "Echo Test" are checked yes.
  3. lf there are issues with WebSockets, try updating your browser.
  4. Visit http://courses.nvidia.com/dli-event and enter the event code provided by the instructor.
  5. You're ready to get started. Please complete the survey at the end of the course to share your feedback.

Agenda (all times in CEST)

10:00-10:15    Introduction

10:15-11:15    Neural Network Training and Stochastic Gradient Descent 

11:15-11:30   Coffee Break

11:30-12:30    Neural Network Training and Intro to Parallel Training

12:30-13:30   Lunch Break

13:30-15:00    Data Parallelism using Pytorch Distributed Data Parallel

15:00-15:15   Coffee break

15:15-16:45    Challenges of Data Parallel using Multiple GPUs

16:45-17:00    Q&A, Final Remarks

Slides

Survey

  • Please fill out the online survey under https://tinyurl.com/hdli2w23-survey
  • This helps us and GCS to
    • increase the quality of the courses,
    • design the future training programme at LRZ and GCS according to your needs and wishes,
    • get future funding for training events.

Next Steps

Visit the NVIDIA Deep Learning lnstitute's website at https://www.nvidia.com/en-us/training/ to access more training and resources.

  • Start online, self-paced training in deep learning and accelerated computing (using the account you created today).
  • View upcoming workshops around the world and request an onsite workshop at your company or organization.
  • Learn about the University Ambassador Program.