v1.0
docs / slurm-instructions
HPC Reference Guide

SLURM
Instructions

A complete reference for submitting, monitoring, and managing jobs on a SLURM-managed HPC cluster — including GPU allocation, distributed training, and Prajna-specific configuration.

NVIDIA A100 OpenNMT / PyTorch Prajna Cluster CUDA 12.4 Multi-Node DDP
01

GPU Partition

The GPU partition includes nodes equipped with NVIDIA A100 GPUs. Jobs submitted here can leverage A100 GPU cards for high-performance parallel processing.

The GPU partition exclusively contains GPU nodes. To submit jobs to GPU nodes, you must specify both the partition name and the number of GPU cards required.
02

How to Get Partitions

List all available partitions:

bash
sinfo

Show only partition names:

bash
sinfo -h -o "%P"

Detailed view — state, nodes, time limits:

bash
sinfo -o "%P %a %l %D %t %N"

Show specific partition info:

bash
scontrol show partition <partition_name>
03

Partition & Cluster Information

bash
# View all partitions and status
sinfo

# Detailed partition info (nodes, GPUs, time limits)
sinfo -o "%P %N %G %l %c %m"

# Specific partition details
scontrol show partition <partition_name>

# Node-level details
scontrol show node <node_name>
04

Job Submission & Testing

bash
# Submit a job
sbatch job_script.sh

# Run an interactive job
srun --pty bash

# Test job without submitting (dry run)
sbatch --test-only job_script.sh
To check the estimated start time, use --test-only or the checktime command (uses check-time.sh).
05

Resources

Request GPUs on compute nodes:

bash — sbatch directives
#SBATCH --gres=gpu:1   # 1 GPU
#SBATCH --gres=gpu:2   # 2 GPUs

#SBATCH --mem=[MB]          # Total memory per node
#SBATCH --mem-per-cpu=[MB]  # Memory per CPU core
Resource Guidelines
CPUs in GPU env: use 8 as a safe choice.
RAM: always allocate at least as much RAM as total GPU memory.
Example: 2× A100 (80 GB each) → use --mem=160000.
Safe CPU Count
8 CPUs
2× A100 RAM
160 GB (160000 MB)
GPU Memory (A100)
80 GB per card
06

Job Monitoring & Queue Utilities

bash
# View all jobs in queue
squeue

# View only your jobs
squeue --me

# Custom formatted view
squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

# View completed jobs
sacct

# Detailed job accounting
sacct -j <jobid> --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed
07

Viewing & Modifying Job Details

bash
# View details of a specific job
scontrol show job <jobid>

# Update the time limit
scontrol update jobid=106 TimeLimit=4-00:00:00
08

Checking GPU Usage

1
Check your Job ID
bash
squeue --me
2
Run nvidia-smi within your allocated job
bash
srun --jobid=<jobid> nvidia-smi
The --jobid flag attaches srun to an already-allocated job.
09

Opening a Shell Inside a Worker Node

Open an interactive bash shell inside the allocated worker node:

bash
srun --jobid=<jobid> --pty -i bash

This lets you inspect the node environment, check GPU state, validate paths, and debug issues directly.

10

Cancel / Modify Jobs

bash
# Cancel a specific job
scancel <jobid>

# Cancel all your jobs
scancel -u $USER

# Hold a job (prevent it from starting)
scontrol hold <jobid>

# Release a held job
scontrol release <jobid>
11

Debugging & Logs

bash
# Check the default job output file
cat slurm-<jobid>.out

# Follow logs live
tail -f slurm-<jobid>.out

# Check node allocation and job errors
scontrol show job <jobid>
12

Useful Tips & Common Pitfalls

Check pending reason for a job in PD state:

bash
squeue -j <jobid> -o "%i %t %r"

Common Pending Reasons

ReasonDescription
ResourcesRequested resources not yet available
PartitionTimeLimitRequested time exceeds partition's maximum
PriorityOther jobs have higher priority
QOSMaxGRESPerUserGPU limit per user reached
NodeDownAllocated node is unavailable/down
ReqNodeNotAvailSpecific requested node is unavailable

Other Common Pitfalls

  • Don't over-request memory — jobs may be rejected or delayed if memory exceeds node capacity.
  • Avoid long --time values — unnecessarily high time limits reduce scheduling priority.
  • Module env mismatch — ensure all required modules (module load) are loaded inside your job script, not just in your interactive shell.
  • Path issues — always use absolute paths in job scripts.
  • Job array pitfalls — use %A (array job ID) and %a (task index) in output filenames.
  • NCCL errors in multi-GPU jobs — usually caused by incorrect --ntasks-per-node settings. See §13.
13

Training

Single Node Training

For initial experimentation and debugging, run training manually inside a single SLURM task.

bash — job script
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
Setting --ntasks-per-node=2 without proper distributed setup (srun / DDP) leads to duplicate training processes, GPU contention, and NCCL crashes. Keep it at 1 and let your script handle process spawning.
Key Principle
There are two independent ways processes can be created:
  1. SLURM via --ntasks-per-node
  2. Your training script via manual spawning or framework internals

→ Use only ONE of these at a time.

Multi-Node Training

bash — multi-node job script
#SBATCH -N 2                        # 2 nodes
#SBATCH --ntasks-per-node=2         # 2 processes per node (1 per GPU)
#SBATCH --gres=gpu:2                # 2 GPUs per node
#SBATCH --cpus-per-task=8
#SBATCH --mem=0

# ---- Distributed Setup ----
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

# ---- Launch ----
srun onmt_train -config yamls/a40.train.multilingual.en-xx-en.yaml
Do NOT manually set CUDA_VISIBLE_DEVICES in multi-node jobs — SLURM assigns GPUs per process automatically. Do NOT mix srun torchrun ... — this causes duplicate processes and NCCL hangs.

Mental Model

Topology
Single-node:  1 machine → multiple GPUs → shared memory

Multi-node:   Node 0 <──network──> Node 1
                 GPU GPU             GPU GPU

Bottleneck shifts from compute → network (NCCL)

srun vs torchrun

Featuretorchrunsrun
Cluster-aware
Multi-node via scheduler manual
Resource allocation
Recommended on SLURM

DDP Environment Variables Set by Launchers

VariableExample
RANK0, 1, 2, 3
LOCAL_RANKGPU id on node
WORLD_SIZETotal number of processes
MASTER_ADDRNode 0 hostname
MASTER_PORTCommunication port

Final Checklist (Before Multi-Node Jobs)

  • Nodes can communicate (test with ping)
  • Same software environment on all nodes (modules, conda envs)
  • Shared filesystem accessible from all nodes (data & checkpoints)
  • CUDA_VISIBLE_DEVICES is not manually set
  • Training launched using srun
  • MASTER_ADDR and MASTER_PORT are exported correctly
14

Prajna Server Configuration

Rules and settings specific to the Prajna cluster. May not apply to other SLURM-managed servers.

Partition & QoS Must Match

On Prajna, --partition and --qos values must be identical. Mismatching causes a QoS error.

bash
#SBATCH --partition=a40
#SBATCH --qos=a40

GPU Must Be Explicitly Requested

Even after specifying a GPU partition, you must also set --gres=gpu:<n>. Without it, no GPUs will be visible.

bash — 4 GPUs on dgx partition
#SBATCH --partition=dgx
#SBATCH --qos=dgx
#SBATCH --gres=gpu:4
Common Prajna Pitfalls
QoS mismatch (e.g. --partition=a40 with --qos=dgx) → job rejected immediately.
Missing --gres → job runs but sees 0 GPUs; CUDA fails silently with no device found.
15

Environment Setup: Spack & Conda

Specific to the Prajna server environment. System CUDA version: 12.4.

Session Setup Checklist

Run these in order at the start of every new terminal session on Prajna:

bash
# 1. Load Spack (use main path — NOT the outdated /scratch path)
source /lustre-flash/apps/spack/share/spack/setup-env.sh

# 2. Load Miniconda via Spack
spack load miniconda3

# 3. Activate your conda environment
conda activate <your_env>
Do NOT use the outdated path:
/scratch/apps/spack/share/spack/setup-env.sh

Fixing Spack Errors

If you encounter broken package repo or stale cache issues:

bash
rm -rf ~/.spack/package_repos       # Remove broken local repo cache
spack clean -a                      # Clean all cached/built data
source /lustre-flash/apps/spack/share/spack/setup-env.sh
spack load miniconda3