0. Getting Started
This page is meant to give you a simple step by step introduction to the AI-Systems.
You should be able to follow along the example and directly apply it on the AI-Systems.
We will provide you with informational links you can dive deeper if you want to.
We especially encourage you to join one of your free courses on LINK at LRZ.
We assume that you already have access to the AI-Systems. If not check out LINK to get access.
with SLURM
When starting out with SLURM on an HPC system, you typically follow this structured sequence of commands to explore and understand your system:
Check general cluster status, nodes, partitions, and job activity
sinfo
Detailed node configuration and availability
sinfo -Nel
List all partitions and their configurations
scontrol show partitions
Detailed information about nodes
scontrol show nodes
with PyTorch
This section provides a simple, practical introduction to training a neural network using PyTorch on the MNIST dataset. The example covers data loading, model definition, training, evaluation, logging results to a CSV file, and saving the trained model.
Steps Covered:
Data Preparation: The MNIST dataset is loaded with transformations.
Model Definition: A simple neural network with two hidden layers is defined.
Training Loop: The model is trained using cross-entropy loss and SGD optimizer.
Evaluation: Accuracy is calculated on both training and test datasets after each epoch.
Logging and Saving: Results are logged to a CSV file, and the trained model is saved.
import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader from torchvision import datasets, transforms # Data preparation train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor()) train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transforms.ToTensor()) test_loader = DataLoader(dataset=test_dataset, batch_size=64, shuffle=False) # Model definition class SimpleNN(nn.Module): def __init__(self): super(SimpleNN, self).__init__() self.fc = nn.Sequential( nn.Flatten(), nn.Linear(28*28, 128), nn.ReLU(), nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, 10) ) def forward(self, x): return self.fc(x) model = SimpleNN() # Loss and optimizer criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=0.01) # Training and evaluation loop for epoch in range(5): model.train() for images, labels in DataLoader(train_dataset, batch_size=64, shuffle=True): optimizer.zero_grad() loss = criterion(model(images), labels) loss.backward() optimizer.step() model.eval() correct, total = 0, 0 with torch.no_grad(): for images, labels in test_loader: outputs = model(images) _, predicted = torch.max(outputs, 1) total += labels.size(0) correct += (predicted == labels).sum().item() print(f'Epoch {epoch+1}, Accuracy: {100 * correct / total}%') # Save model torch.save(model.state_dict(), 'mnist_model.pth')
with Batch Jobs
#!/bin/bash #SBATCH --job-name=introduction_test #SBATCH --partition=<partition_name> #SBATCH --nodes=1 #SBATCH --gpus=1 # Adjust GPU resources if required #SBATCH --time=00:10:00 # Job time limit #SBATCH --output=./logs/output.out #SBATCH --error=./logs/output.err # Run the container with Pyxis using `srun` srun --container-image=pytorch-container --container-mounts=$HOME:/workspace --gpus=1 bash -c "\ nvidia-smi;\ python -c 'import torch; print(torch.__version__)'\ "
with Notebooks
Steps to allocate resources
Steps to plot/infer the results
with Free Courses
If you made it his far, you are clearly interested and we strongly encourage you to join our free courses at LRZ LINK.