Commit 331876e3 authored by Sören Wacker's avatar Sören Wacker
Browse files

add Kerberos docs and fix multi-GPU examples for daic-new

- document Kerberos requirement for SSH key users (storage, quickstart, tutorials)
- add mount delay warning for network storage
- fix Python version constraint (<3.13) for PyTorch compatibility
- fix UV paths and SLURM_SUBMIT_DIR in job scripts
- use extra-index-url for PyTorch wheels
- add data pre-download to avoid race conditions
parent b142d306
Loading
Loading
Loading
Loading
Loading
+35 −0
Original line number Diff line number Diff line
@@ -49,6 +49,41 @@ You may see "Permission denied" errors for other filesystems. These can be ignor

On first login, symlinks are created in your home directory pointing to TU Delft network storage:

### Kerberos authentication

TU Delft network storage requires a valid Kerberos ticket. Without it, you will get "Permission denied" or "Stale file handle" errors when accessing `linuxhome`, `windowshome`, project, group, or bulk storage.

**When logging in via SSH with a password**, a Kerberos ticket is created automatically.

**When logging in via SSH with a public key** or through OpenOndemand, you must manually obtain a ticket:

```bash
kinit
```

Enter your NetID password when prompted.

Check your current ticket status:

```bash
klist
```

Example output with a valid ticket:

```
Ticket cache: KCM:656519
Default principal: <NetID>@TUDELFT.NET

Valid starting     Expires            Service principal
03/23/26 11:05:12  03/23/26 21:05:12  krbtgt/TUDELFT.NET@TUDELFT.NET
        renew until 03/30/26 12:05:03
```

{{% alert title="First access delay" color="info" %}}
Network storage locations may take up to 30 seconds to mount on first access. If you get a "Stale file handle" error, wait a moment and try again.
{{% /alert %}}

- `~/linuxhome` - Your Linux home on TU Delft storage
- `~/windowshome` - Your Windows home on TU Delft storage

+28 −0
Original line number Diff line number Diff line
@@ -27,6 +27,34 @@ $ ssh <YourNetID>@daic01.hpc.tudelft.nl

You should connect without entering a password.

## Kerberos for network storage

{{% alert title="Important" color="warning" %}}
SSH key login does not create a Kerberos ticket. Without a ticket, you cannot access network storage (`~/linuxhome`, project storage, etc.).
{{% /alert %}}

After connecting with SSH keys, run:

```shell-session
$ kinit
Password for <YourNetID>@TUDELFT.NET:
```

Verify your ticket:

```shell-session
$ klist
Ticket cache: KCM:656519
Default principal: <YourNetID>@TUDELFT.NET

Valid starting     Expires            Service principal
03/23/26 11:05:12  03/23/26 21:05:12  krbtgt/TUDELFT.NET@TUDELFT.NET
```

{{% alert title="First access delay" color="info" %}}
Network storage may take up to 30 seconds to mount on first access. If you see "Stale file handle", wait and retry.
{{% /alert %}}

## SSH config shortcut

Add to `~/.ssh/config` on your local machine:
+5 −1
Original line number Diff line number Diff line
@@ -48,7 +48,7 @@ These tutorials take you from first login to running GPU workloads. Each tutoria
┌─────────────────────────────────────────────────────────────────────────┐
│                           STORAGE                                        │
│    /home/<netid>              - 5 MB, config only                       │
│    ~/linuxhome                - 8 GB, personal files                    
│    ~/linuxhome                - ~25 GB, personal files                  │
│    /tudelft.net/staff-umbrella/<project>  - Project data                │
│    /tudelft.net/staff-bulk/<project>      - Large datasets              │
└─────────────────────────────────────────────────────────────────────────┘
@@ -68,6 +68,10 @@ These tutorials take you from first login to running GPU workloads. Each tutoria
**I just got access to DAIC**
→ Start with [Bash Basics](/tutorials/bash/), then [Slurm Basics](/tutorials/slurm/)

{{% alert title="Using SSH keys?" color="info" %}}
If you log in with SSH keys instead of a password, run `kinit` after connecting to access network storage (`linuxhome`, project storage). See [Storage](/docs/storage/) for details.
{{% /alert %}}

**I know Linux but not clusters**
→ Start with [Slurm Basics](/tutorials/slurm/)

+13 −5
Original line number Diff line number Diff line
#!/bin/bash
#SBATCH --job-name=accelerate-multi-gpu
#SBATCH --account=<your-account>
# SBATCH --account=<your-account>  # Uncomment and set if required
#SBATCH --partition=all
#SBATCH --time=00:30:00
#SBATCH --nodes=1
@@ -18,8 +18,13 @@ set -e
module purge
module load 2025/gpu cuda/12.9

# Navigate to script directory
cd "$(dirname "$0")"
# Set up UV paths (adjust if UV is installed elsewhere)
export PATH="$HOME/linuxhome/.local/bin:$PATH"
export UV_CACHE_DIR="$HOME/linuxhome/.cache/uv"
export UV_PYTHON_INSTALL_DIR="$HOME/linuxhome/.local/share/uv/python"

# Navigate to script directory (SLURM_SUBMIT_DIR is where sbatch was called)
cd "$SLURM_SUBMIT_DIR"

# Print job info
echo "========================================"
@@ -44,8 +49,11 @@ if [ ! -d ".venv" ]; then
    uv sync
fi

# Run training with accelerate
srun uv run accelerate launch \
# Pre-download data to avoid race conditions
uv run python -c "from torchvision import datasets; datasets.MNIST('${TMPDIR:-/tmp}/mnist', train=True, download=True); datasets.MNIST('${TMPDIR:-/tmp}/mnist', train=False, download=True)"

# Run training with accelerate (handles multi-GPU spawning)
uv run accelerate launch \
    --num_processes=$NUM_GPUS \
    --mixed_precision=fp16 \
    train.py \
+2 −2
Original line number Diff line number Diff line
@@ -2,7 +2,7 @@
name = "accelerate-multi-gpu"
version = "0.1.0"
description = "Multi-GPU training example with Hugging Face Accelerate"
requires-python = ">=3.10"
requires-python = ">=3.10,<3.13"
dependencies = [
    "torch>=2.0.0",
    "torchvision>=0.15.0",
@@ -10,4 +10,4 @@ dependencies = [
]

[tool.uv]
index-url = "https://download.pytorch.org/whl/cu124"
extra-index-url = ["https://download.pytorch.org/whl/cu124"]
Loading