Commit b3a68e1f authored by Sören Wacker's avatar Sören Wacker
Browse files

improve tutorials with fixes, module docs, troubleshooting, and exercise verification

parent 1e9c4681
Loading
Loading
Loading
Loading
+128 −1
Original line number Diff line number Diff line
@@ -30,7 +30,7 @@ Containerization packages your software, libraries, and dependencies into a sing
On DAIC specifically, users often encounter issues with limited home directory space or Windows-based `/tudelft.net` mounts (see [Storage](/docs/system/storage)), which can complicate the use of `conda/mamba` and/or `pip`. Containers offer a solution by encapsulating all software and dependencies in a self-contained environment. You can, for instance, store containers on `staff-umbrella` with all required dependencies, including those installed via `pip`, and run them reliably and reproducibly without being limited by home directory size or mount compatibility.

## Containerization on DAIC: Apptainer
DAIC supports [Apptainer](https://apptainer.org/docs/user/main/introduction.html)  (previously Apptainer), an open-source container platform, designed to run on High-performance computing environments. Apptainer runs container images securely on shared clusters and allows you to use Docker images directly, without needing Docker itself.
DAIC supports [Apptainer](https://apptainer.org/docs/user/main/introduction.html) (formerly known as Singularity), an open-source container platform designed for high-performance computing environments. Apptainer runs container images securely on shared clusters and allows you to use Docker images directly, without needing Docker itself.

A typical Apptainer workflow revolves around three key components:

@@ -584,6 +584,16 @@ Pull the `python:3.11-slim` image from Docker Hub and explore it:
4. List the contents of `/usr/local/lib/python3.11/`
5. Exit the container

{{% alert title="Check your work" color="info" %}}
After pulling, you should have `python_3.11-slim.sif`. Inside the container:
```shell-session
Apptainer> python --version
Python 3.11.x
Apptainer> ls /usr/local/lib/python3.11/
...  site-packages  ...
```
{{% /alert %}}

### Exercise 2: Run a command in a container

Using the Python image from Exercise 1:
@@ -592,6 +602,19 @@ Using the Python image from Exercise 1:
2. Use `apptainer exec` to run the script inside the container
3. Try running it with the `-C` flag - what happens to your script?

{{% alert title="Check your work" color="info" %}}
Without `-C`:
```shell-session
$ apptainer exec python_3.11-slim.sif python hello.py
Hello from Apptainer!
```
With `-C`, you get an error because the container can't see your files:
```shell-session
$ apptainer exec -C python_3.11-slim.sif python hello.py
python: can't open file 'hello.py': [Errno 2] No such file or directory
```
{{% /alert %}}

### Exercise 3: Build a custom image

Create a definition file for a container with your favorite tools:
@@ -601,6 +624,17 @@ Create a definition file for a container with your favorite tools:
3. Add a `%runscript` that displays a welcome message
4. Build the image and test it with `apptainer run`

{{% alert title="Check your work" color="info" %}}
After building:
```shell-session
$ apptainer run mytools.sif
Welcome to my custom container!
$ apptainer exec mytools.sif which curl jq
/usr/bin/curl
/usr/bin/jq
```
{{% /alert %}}

### Exercise 4: GPU container on DAIC

Test GPU access with a prebuilt image:
@@ -610,6 +644,16 @@ Test GPU access with a prebuilt image:
3. Run a Python command that checks `torch.cuda.is_available()`
4. Verify the GPU is detected with `nvidia-smi` inside the container

{{% alert title="Check your work" color="info" %}}
```shell-session
$ srun apptainer exec --nv pytorch.sif python -c "import torch; print(torch.cuda.is_available())"
True
$ srun apptainer exec --nv pytorch.sif nvidia-smi
... (GPU info displayed) ...
```
If you see `False`, check that you used `--nv` and requested a GPU with `--gres=gpu:1`.
{{% /alert %}}

### Exercise 5: Bind mounts

Practice data isolation:
@@ -619,6 +663,89 @@ Practice data isolation:
3. Inside the container, verify you can access the test file but not your home directory
4. Try mounting the directory as read-only with `--mount`

{{% alert title="Check your work" color="info" %}}
```shell-session
$ mkdir testdir && echo "test" > testdir/data.txt
$ apptainer shell -C --bind testdir:/mnt ubuntu_latest.sif
Apptainer> cat /mnt/data.txt
test
Apptainer> ls /home/$USER
ls: cannot access '/home/...': No such file or directory
```
With read-only mount, writing fails:
```shell-session
$ apptainer shell -C --mount type=bind,source=testdir,destination=/mnt,ro ubuntu_latest.sif
Apptainer> echo "new" >> /mnt/data.txt
bash: /mnt/data.txt: Read-only file system
```
{{% /alert %}}

---

## Troubleshooting

### Build fails with "no space left on device"

Apptainer uses your home directory for temporary files during builds. Since `/home` on DAIC is limited to 5 MB, builds often fail.

**Solution**: Set a different cache directory before building:

```shell-session
$ export APPTAINER_CACHEDIR=/tudelft.net/staff-umbrella/<project>/apptainer/cache
$ export APPTAINER_TMPDIR=/tudelft.net/staff-umbrella/<project>/apptainer/tmp
$ mkdir -p $APPTAINER_CACHEDIR $APPTAINER_TMPDIR
```

Add these to your `~/.bashrc` to make them permanent.

### GPU not visible inside container

Your container runs but `torch.cuda.is_available()` returns `False` or `nvidia-smi` fails.

**Possible causes and solutions**:

1. **Missing `--nv` flag**: Always pass `--nv` to enable GPU access:
   ```shell-session
   $ apptainer exec --nv myimage.sif python -c "import torch; print(torch.cuda.is_available())"
   ```

2. **Not running on a GPU node**: Check that you requested a GPU and are using `srun`:
   ```shell-session
   $ salloc --gres=gpu:1 ...
   $ srun apptainer exec --nv myimage.sif nvidia-smi
   ```

3. **CUDA version mismatch**: The container's CUDA version must be compatible with the host driver. Check host driver version:
   ```shell-session
   $ nvidia-smi | grep "Driver Version"
   ```

### Cache filling up disk space

Apptainer caches pulled images and build layers. This can consume significant space over time.

**Solution**: Periodically clean the cache:

```shell-session
$ apptainer cache clean
```

To see cache usage:

```shell-session
$ apptainer cache list
```

### Container can't access my files

By default, Apptainer mounts your home directory and current working directory. With `-C` (contain), the container is isolated.

**Solution**: Explicitly bind the directories you need:

```shell-session
$ apptainer exec -C --bind /tudelft.net/staff-umbrella/myproject:/data myimage.sif ls /data
```

---

## Summary
+44 −3
Original line number Diff line number Diff line
@@ -95,8 +95,8 @@ $ cd ~
$ pwd
```

{{% alert title="Question" color="info" %}}
What did you see in `/tudelft.net/staff-umbrella`? These are project directories - you'll have access to at least one for your work.
{{% alert title="Check your work" color="info" %}}
You should see project directories when listing `/tudelft.net/staff-umbrella`. After `cd ~` and `pwd`, you should see your home directory path (e.g., `/home/netid01`).
{{% /alert %}}

## Part 2: Understanding DAIC storage
@@ -202,6 +202,18 @@ $ echo "Author: $(whoami)" >> nlp-project/README.md
$ cat nlp-project/README.md
```

{{% alert title="Check your work" color="info" %}}
`ls nlp-project` should show:
```
data  notebooks  outputs  src
```
`cat nlp-project/README.md` should show:
```
# NLP Project
Author: <your-netid>
```
{{% /alert %}}

## Part 4: Working with files

Let's create some actual code to work with.
@@ -309,6 +321,14 @@ $ ls src
evaluate.py  train.py
```

{{% alert title="Check your work" color="info" %}}
`ls src` should show both files:
```
evaluate.py  train.py
```
If `train.py` is missing, you may have forgotten to copy before moving.
{{% /alert %}}

## Part 5: Viewing and editing files

### Viewing file contents
@@ -418,6 +438,10 @@ $ grep -l "import" src/*.py # Just show filenames
   $ find . -type d -name "data"
   ```

{{% alert title="Check your work" color="info" %}}
The `find . -mtime -1` command should list files you recently created. The `grep -n` command shows line numbers where "print" appears. The directory search should show `./data` (and any other data directories you created).
{{% /alert %}}

## Part 7: Transferring files

You'll often need to move data between your local computer and DAIC.
@@ -471,6 +495,14 @@ $ cat ~/linuxhome/test.txt
test data
```

{{% alert title="Check your work" color="info" %}}
After the `scp` command, you should see:
```
test.txt                      100%   10     0.0KB/s   00:00
```
On DAIC, `cat ~/linuxhome/test.txt` should display "test data".
{{% /alert %}}

## Part 8: Automating with scripts

When you find yourself typing the same commands repeatedly, it's time to write a script.
@@ -600,6 +632,15 @@ $ chmod +x cleanup_logs.sh
$ ./cleanup_logs.sh logs/
```

{{% alert title="Check your work" color="info" %}}
Verify the script is executable:
```shell-session
$ ls -l cleanup_logs.sh
-rwxr-xr-x 1 netid01 netid01 ... cleanup_logs.sh
```
The `x` in the permissions confirms it's executable. When run, it prints "Cleaning logs in logs/" and "Done!" (plus any files it removes).
{{% /alert %}}

## Part 9: Useful shortcuts and tips

### Tab completion
@@ -684,4 +725,4 @@ Now that you're comfortable with the command line:

## Quick reference

See the [Bash Cheatsheet](/reference/bash-cheatsheet/) for a compact command reference.
For more advanced shell customization, see [Shell Setup](/quickstart/shell-setup/).
+70 −2
Original line number Diff line number Diff line
@@ -309,7 +309,9 @@ srun python train.py
echo "End time: $(date)"
```

### Understanding module load
### Understanding the module system

DAIC uses an *environment modules* system to manage software. Instead of having every version of every library available at once (which would cause conflicts), software is organized into modules that you load when needed.

The `module` commands set up your software environment:

@@ -319,7 +321,23 @@ module load 2025/gpu # Load the 2025 GPU software stack
module load cuda/12.9   # Load CUDA 12.9
```

Different software requires different modules. Use `module avail` to see what's available and `module load <name>` to load them.
Why use modules?

- **Version control**: Run `module load python/3.11` today, `python/3.12` tomorrow
- **Avoid conflicts**: Different projects can use different library versions
- **Clean environment**: `module purge` gives you a fresh start

Common module commands:

| Command | Purpose |
|---------|---------|
| `module avail` | List all available modules |
| `module avail cuda` | List modules matching "cuda" |
| `module list` | Show currently loaded modules |
| `module load <name>` | Load a module |
| `module purge` | Unload all modules |

For a complete guide, see [Loading Software](/howto/loading-software/).

### Submit and monitor

@@ -761,18 +779,68 @@ Try these on your own to solidify your understanding:
### Exercise 1: Basic job submission
Create and submit a job that prints your username, hostname, and current date. Check the output.

{{% alert title="Check your work" color="info" %}}
Your output file should contain something like:
```
netid01
gpu15.ethernet.tudhpc
Fri Mar 20 10:30:00 CET 2026
```
The hostname should be a compute node (not `daic01`).
{{% /alert %}}

### Exercise 2: GPU job
Modify the basic job to request a GPU. Add `nvidia-smi` to verify the GPU is available.

{{% alert title="Check your work" color="info" %}}
Your output should include `nvidia-smi` output showing a GPU:
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI ...    Driver Version: ...    CUDA Version: ...                  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        ...
```
If you see "NVIDIA-SMI has failed", check that you requested a GPU with `--gres=gpu:1`.
{{% /alert %}}

### Exercise 3: Resource tuning
Submit a job, then use `seff` to check its efficiency. Was your resource request appropriate?

{{% alert title="Check your work" color="info" %}}
Run `seff <jobid>` after your job completes. Good efficiency looks like:
```
CPU Efficiency: 70-95%
Memory Efficiency: 50-90%
```
If efficiency is below 50%, reduce your request next time.
{{% /alert %}}

### Exercise 4: Job array
Create a job array that runs 5 tasks. Each task should print its array task ID.

{{% alert title="Check your work" color="info" %}}
You should see 5 output files (e.g., `job_12345_1.out` through `job_12345_5.out`). Each should contain its task ID:
```shell-session
$ cat job_*_1.out
Task ID: 1
$ cat job_*_5.out
Task ID: 5
```
{{% /alert %}}

### Exercise 5: Dependencies
Submit two jobs where the second depends on the first completing successfully.

{{% alert title="Check your work" color="info" %}}
After submitting both jobs, `squeue -u $USER` should show:
```
  JOBID PARTITION     NAME     USER ST  REASON
  12346       all   second  netid01 PD  (Dependency)
  12345       all    first  netid01  R
```
The second job shows `(Dependency)` while waiting. After the first completes, the second starts automatically.
{{% /alert %}}

## Next steps

- [Apptainer Tutorial](/tutorials/apptainer/) - Package your environment in containers
+36 −0
Original line number Diff line number Diff line
@@ -765,21 +765,57 @@ Practice these tasks to build muscle memory:
### Exercise 1: Basic editing
Create a new file, add three lines of text, save and quit. Then reopen it and verify your changes.

{{% alert title="Check your work" color="info" %}}
After `:wq`, verify with:
```shell-session
$ cat myfile.txt
line one
line two
line three
```
If the file is empty, you may have quit without saving (`:q!` instead of `:wq`).
{{% /alert %}}

### Exercise 2: Navigation
Open a Python file and practice: go to end (`G`), go to beginning (`gg`), jump by words (`w`, `b`), go to specific line (`10G`).

{{% alert title="Check your work" color="info" %}}
Check your position with `:set number` to show line numbers. After `G`, you should be on the last line. After `gg`, you should be on line 1. After `10G`, you should be on line 10.
{{% /alert %}}

### Exercise 3: Delete and undo
Open a file, delete a line (`dd`), undo (`u`), delete a word (`dw`), undo again.

{{% alert title="Check your work" color="info" %}}
After each `u`, the deleted content should reappear. If undo doesn't work, make sure you're in Normal mode (press `Esc` first).
{{% /alert %}}

### Exercise 4: Copy and paste
Copy a line (`yy`), move to a new location, paste it (`p`). Then try with multiple lines using `V`.

{{% alert title="Check your work" color="info" %}}
After `yy` and `p`, you should see the same line duplicated. With `V`, select multiple lines (they highlight), then `y` to copy and `p` to paste them elsewhere.
{{% /alert %}}

### Exercise 5: Search and replace
Open a file and search for a word (`/word`). Then replace all occurrences of one word with another (`:%s/old/new/g`).

{{% alert title="Check your work" color="info" %}}
After `/word` and pressing Enter, the cursor jumps to the first match. Press `n` to see subsequent matches. After `:%s/old/new/g`, Vim reports how many substitutions were made (e.g., "5 substitutions on 3 lines").
{{% /alert %}}

### Exercise 6: Real task
Edit a SLURM batch script: change the time limit, add a new `#SBATCH` directive, and save.

{{% alert title="Check your work" color="info" %}}
After saving, verify your changes:
```shell-session
$ grep -E "time|gres" submit.sh
#SBATCH --time=4:00:00
#SBATCH --gres=gpu:1
```
{{% /alert %}}

## Keep learning

- Run `vimtutor` for a 30-minute interactive tutorial