Commit edcf0cc2 authored by Sören Wacker's avatar Sören Wacker
Browse files

add Kerberos docs, contact info, troubleshooting, fix multi-GPU tutorial

parent 331876e3
Loading
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
@@ -12,8 +12,8 @@ DAIC provides access to multiple storage areas. Understanding their purposes and
| Storage | Location | Quota | Purpose | Backup |
|---------|----------|-------|---------|--------|
| Cluster home | `/trinity/home/<NetID>` | ~5 MB | Config files only | No |
| Linux home | `~/linuxhome` | ~25 GB | Personal files | Yes |
| Windows home | `~/windowshome` | ~25 GB | Personal files | Yes |
| Linux home | `~/linuxhome` | ~30 GB | Personal files | Yes |
| Windows home | `~/windowshome` | ~30 GB | Personal files | Yes |
| Project | `/tudelft.net/staff-umbrella/<project>` | By request | Research data | Yes |
| Group | `/tudelft.net/staff-groups/<faculty>/<dept>/<group>` | Fair use | Shared files | Yes |
| Bulk | `/tudelft.net/staff-bulk/<faculty>/<dept>/<group>` | Fair use | Large datasets | Yes |
+3 −1
Original line number Diff line number Diff line
@@ -8,3 +8,5 @@ menu:
    weight: 40
---

- [Contact](/support/contact/) - Mattermost, Service Desk, request forms
- [Troubleshooting](/support/troubleshooting/) - Common issues and solutions
+27 −3
Original line number Diff line number Diff line
@@ -5,6 +5,30 @@ type: docs
description: "Ways to contact the DAIC support team."
---

{{% alert color="warning" %}}
This page is under construction.
{{% /alert %}}
## Community support

Join the DAIC Mattermost channel for questions, discussions, and announcements:

{{< external-link "https://mattermost.tudelft.nl/signup_user_complete/?id=cb1k3t6ytpfjbf7r397395axyc&md=link&sbr=su" "Join DAIC Mattermost" >}}

This is the best place to get help from fellow users and the DAIC team.

## Service Desk

For technical issues, account problems, or storage issues, submit a ticket through the TU Delft Self-Service Portal:

{{< external-link "https://tudelft.topdesk.net/tas/public/ssp/" "TU Delft Self-Service Portal" >}}

## Request forms

| Request | Link |
|---------|------|
| DAIC account access | {{< external-link "https://tudelft.topdesk.net/tas/public/ssp/content/serviceflow?unid=89811f26713645a89a5ca1cdef263ac5" "Request Access" >}} |
| Project storage | {{< external-link "https://tudelft.topdesk.net/tas/public/ssp/content/detail/service?unid=846ebb16181c43b5836c063a917dd199" "Request Storage" >}} |
| General inquiry | {{< external-link "https://tudelft.topdesk.net/tas/public/ssp/content/serviceflow?unid=889f49ca2fe440539cbd713918432046&openedFromService=true" "Contact Form" >}} |

## Scientific output

Share your DAIC-based publications in the ScientificOutput channel:

{{< external-link "https://mattermost.tudelft.nl/daic/channels/scientificoutput" "ScientificOutput Channel" >}}
+52 −3
Original line number Diff line number Diff line
@@ -5,6 +5,55 @@ type: docs
description: "Common issues and troubleshooting steps for DAIC."
---

{{% alert color="warning" %}}
This page is under construction.
{{% /alert %}}
## Storage access errors

### "Permission denied" or "Stale file handle" when accessing linuxhome

**Cause:** Missing Kerberos ticket. This happens when you log in with SSH keys instead of a password.

**Solution:** Run `kinit` and enter your NetID password:

```bash
kinit
```

Verify your ticket with `klist`:

```bash
klist
```

You should see output like:

```
Default principal: <YourNetID>@TUDELFT.NET
Valid starting     Expires            Service principal
03/23/26 11:05:12  03/23/26 21:05:12  krbtgt/TUDELFT.NET@TUDELFT.NET
```

### Storage takes long to access on first use

**Cause:** Network storage mounts on-demand and may take up to 30 seconds on first access.

**Solution:** Wait and retry. Subsequent accesses will be fast.

## Job submission errors

### "Disk quota exceeded" in home directory

**Cause:** Cluster home (`/trinity/home`) has a 5 MB quota for config files only.

**Solution:** Store code and data in `~/linuxhome` or project storage. Check quota with:

```bash
quota -s
```

### Job fails immediately with no output

**Cause:** Often a missing module or incorrect path.

**Solution:**
1. Check error file: `cat <jobname>_<jobid>.err`
2. Verify modules load correctly: `module load 2025/gpu cuda/12.9`
3. Check working directory in job script uses `$SLURM_SUBMIT_DIR`
+29 −38
Original line number Diff line number Diff line
@@ -14,44 +14,35 @@ menu:

These tutorials take you from first login to running GPU workloads. Each tutorial builds on the previous one, so we recommend following them in order.

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           YOUR COMPUTER                                  │
│    You write code, prepare data, connect via SSH                        │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │ SSH

┌─────────────────────────────────────────────────────────────────────────┐
│                           LOGIN NODE                                     │
│    daic01.hpc.tudelft.nl                                                │
│    - Prepare scripts                                                    │
│    - Submit jobs (sbatch)                                               │
│    - Monitor jobs (squeue)                                              │
│    - Edit files (vim, nano)                                             │
│    - Transfer data (scp, rsync)                                         │
│                                                                         │
│    DO NOT run computations here!                                        │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │ Slurm

┌─────────────────────────────────────────────────────────────────────────┐
│                         COMPUTE NODES                                    │
│    gpu01, gpu02, ... gpu45                                              │
│    - Run your training scripts                                          │
│    - Access GPUs (L40, A40, RTX Pro 6000)                              │
│    - Process large datasets                                             │
│                                                                         │
│    Managed by Slurm - request resources with sbatch/salloc              │
└────────────────────────────────┬────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────┐
│                           STORAGE                                        │
│    /home/<netid>              - 5 MB, config only                       │
│    ~/linuxhome                - ~25 GB, personal files                  │
│    /tudelft.net/staff-umbrella/<project>  - Project data                │
│    /tudelft.net/staff-bulk/<project>      - Large datasets              │
└─────────────────────────────────────────────────────────────────────────┘
```mermaid
flowchart TB
    subgraph local["YOUR COMPUTER"]
        L1["Write code, prepare data"]
    end

    subgraph login["LOGIN NODE - daic01.hpc.tudelft.nl"]
        L2["Prepare scripts"]
        L3["Submit jobs (sbatch)"]
        L4["Monitor jobs (squeue)"]
        L5["Transfer data (scp, rsync)"]
        L6["DO NOT run computations here!"]
    end

    subgraph compute["COMPUTE NODES - gpu01...gpu45"]
        C1["Run training scripts"]
        C2["Access GPUs (L40, A40, RTX Pro 6000)"]
        C3["Process large datasets"]
    end

    subgraph storage["STORAGE"]
        S1["/home - 5 MB, config only"]
        S2["~/linuxhome - ~30 GB, personal files"]
        S3["staff-umbrella - Project data"]
    end

    local -->|SSH| login
    login -->|Slurm| compute
    compute --> storage
```

## The learning path
Loading