This page is meant to give you a simple step by step introduction to the AI-Systems.
You should be able to follow along the example and directly apply it on the AI-Systems.
We will provide you with informational links you can dive deeper if you want to.

We especially encourage you to join one of your free courses on LINK at LRZ.

We assume that you already have access to the AI-Systems. If not check out LINK to get access.

with SLURM

When starting out with SLURM on an HPC system, you typically follow this structured sequence of commands to explore and understand your system:

Check general cluster status, nodes, partitions, and job activity

sinfo

Detailed node configuration and availability

sinfo -Nel

List all partitions and their configurations

scontrol show partitions

Detailed information about nodes

scontrol show nodes

with PyTorch

This section provides a simple, practical introduction to training a neural network using PyTorch on the MNIST dataset. The example covers data loading, model definition, training, evaluation, logging results to a CSV file, and saving the trained model.

Steps Covered:

Data Preparation: The MNIST dataset is loaded with transformations.
Model Definition: A simple neural network with two hidden layers is defined.
Training Loop: The model is trained using cross-entropy loss and SGD optimizer.
Evaluation: Accuracy is calculated on both training and test datasets after each epoch.
Logging and Saving: Results are logged to a CSV file, and the trained model is saved.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Data preparation
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor())
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transforms.ToTensor())
test_loader = DataLoader(dataset=test_dataset, batch_size=64, shuffle=False)

# Model definition
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        )

    def forward(self, x):
        return self.fc(x)

model = SimpleNN()

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training and evaluation loop
for epoch in range(5):
    model.train()
    for images, labels in DataLoader(train_dataset, batch_size=64, shuffle=True):
        optimizer.zero_grad()
        loss = criterion(model(images), labels)
        loss.backward()
        optimizer.step()

    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print(f'Epoch {epoch+1}, Accuracy: {100 * correct / total}%')

# Save model
torch.save(model.state_dict(), 'mnist_model.pth')

with Batch Jobs

#!/bin/bash
#SBATCH --job-name=introduction_test
#SBATCH --partition=<partition_name>
#SBATCH --nodes=1
#SBATCH --gpus=1                # Adjust GPU resources if required
#SBATCH --time=00:10:00         # Job time limit
#SBATCH --output=./logs/output.out
#SBATCH --error=./logs/output.err

# Run the container with Pyxis using `srun`
srun --container-image=pytorch-container --container-mounts=$HOME:/workspace --gpus=1 bash -c "\
    nvidia-smi;\
    python -c 'import torch; print(torch.__version__)'\
"

with Notebooks

Steps to allocate resources

Steps to plot/infer the results

with Free Courses

If you made it his far, you are clearly interested and we strongly encourage you to join our free courses at LRZ LINK.