3. Check working directory in job script uses `$SLURM_SUBMIT_DIR`
## Multi-GPU training issues
### Training hangs with multiple GPUs
**Symptoms:** Training hangs after "Initializing distributed" or "All distributed processes registered". NCCL all_reduce operations never complete.
**Cause:** DAIC GPU nodes have GPUs on different NUMA nodes (CPU sockets). NCCL P2P (peer-to-peer) communication fails between GPUs that aren't directly connected.
**Solution:** Add this to your job script:
```bash
export NCCL_P2P_DISABLE=1
```
This forces NCCL to use shared memory instead of P2P, which works across NUMA boundaries.
### Verify GPU topology
Check how GPUs are connected:
```bash
nvidia-smi topo -m
```
If you see `SYS` between GPUs (not `NV#` for NVLink), you need `NCCL_P2P_DISABLE=1`.
See [Multi-GPU Training](/tutorials/multi-gpu/#nccl-configuration-on-daic) for details.