SLURM Instructions

01

GPU Partition

The GPU partition includes nodes equipped with NVIDIA A100 GPUs. Jobs submitted here can leverage A100 GPU cards for high-performance parallel processing.

ℹ

The GPU partition exclusively contains GPU nodes. To submit jobs to GPU nodes, you must specify both the partition name and the number of GPU cards required.

02

How to Get Partitions

List all available partitions:

bash

sinfo

Show only partition names:

bash

sinfo -h -o "%P"

Detailed view — state, nodes, time limits:

bash

sinfo -o "%P %a %l %D %t %N"

Show specific partition info:

bash

scontrol show partition <partition_name>

03

Partition & Cluster Information

bash

# View all partitions and status
sinfo

# Detailed partition info (nodes, GPUs, time limits)
sinfo -o "%P %N %G %l %c %m"

# Specific partition details
scontrol show partition <partition_name>

# Node-level details
scontrol show node <node_name>

04

Job Submission & Testing

bash

# Submit a job
sbatch job_script.sh

# Run an interactive job
srun --pty bash

# Test job without submitting (dry run)
sbatch --test-only job_script.sh

ℹ

To check the estimated start time, use --test-only or the checktime command (uses check-time.sh).

05

Resources

Request GPUs on compute nodes:

bash — sbatch directives

#SBATCH --gres=gpu:1   # 1 GPU
#SBATCH --gres=gpu:2   # 2 GPUs

#SBATCH --mem=[MB]          # Total memory per node
#SBATCH --mem-per-cpu=[MB]  # Memory per CPU core

⚠

Resource Guidelines
CPUs in GPU env: use 8 as a safe choice.
RAM: always allocate at least as much RAM as total GPU memory.
Example: 2× A100 (80 GB each) → use --mem=160000.

Safe CPU Count

8 CPUs

2× A100 RAM

160 GB (160000 MB)

GPU Memory (A100)

80 GB per card

06

Job Monitoring & Queue Utilities

bash

# View all jobs in queue
squeue

# View only your jobs
squeue --me

# Custom formatted view
squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

# View completed jobs
sacct

# Detailed job accounting
sacct -j <jobid> --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed

07

Viewing & Modifying Job Details

bash

# View details of a specific job
scontrol show job <jobid>

# Update the time limit
scontrol update jobid=106 TimeLimit=4-00:00:00

08

Checking GPU Usage

1

Check your Job ID

bash

squeue --me

2

Run nvidia-smi within your allocated job

bash

srun --jobid=<jobid> nvidia-smi

ℹ

The --jobid flag attaches srun to an already-allocated job.

09

Opening a Shell Inside a Worker Node

Open an interactive bash shell inside the allocated worker node:

bash

srun --jobid=<jobid> --pty -i bash

This lets you inspect the node environment, check GPU state, validate paths, and debug issues directly.

10

Cancel / Modify Jobs

bash

# Cancel a specific job
scancel <jobid>

# Cancel all your jobs
scancel -u $USER

# Hold a job (prevent it from starting)
scontrol hold <jobid>

# Release a held job
scontrol release <jobid>

11

Debugging & Logs

bash

# Check the default job output file
cat slurm-<jobid>.out

# Follow logs live
tail -f slurm-<jobid>.out

# Check node allocation and job errors
scontrol show job <jobid>

12

Useful Tips & Common Pitfalls

Check pending reason for a job in PD state:

bash

squeue -j <jobid> -o "%i %t %r"

Common Pending Reasons

Reason	Description
Resources	Requested resources not yet available
PartitionTimeLimit	Requested time exceeds partition's maximum
Priority	Other jobs have higher priority
QOSMaxGRESPerUser	GPU limit per user reached
NodeDown	Allocated node is unavailable/down
ReqNodeNotAvail	Specific requested node is unavailable

Other Common Pitfalls

Don't over-request memory — jobs may be rejected or delayed if memory exceeds node capacity.
Avoid long --time values — unnecessarily high time limits reduce scheduling priority.
Module env mismatch — ensure all required modules (module load) are loaded inside your job script, not just in your interactive shell.
Path issues — always use absolute paths in job scripts.
Job array pitfalls — use %A (array job ID) and %a (task index) in output filenames.
NCCL errors in multi-GPU jobs — usually caused by incorrect --ntasks-per-node settings. See §13.

13

Training

Single Node Training

For initial experimentation and debugging, run training manually inside a single SLURM task.

bash — job script

#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2

⚠

Setting --ntasks-per-node=2 without proper distributed setup (srun / DDP) leads to duplicate training processes, GPU contention, and NCCL crashes. Keep it at 1 and let your script handle process spawning.

Key Principle

There are two independent ways processes can be created:
1. SLURM via --ntasks-per-node
2. Your training script via manual spawning or framework internals

→ Use only ONE of these at a time.

Multi-Node Training

bash — multi-node job script

#SBATCH -N 2                        # 2 nodes
#SBATCH --ntasks-per-node=2         # 2 processes per node (1 per GPU)
#SBATCH --gres=gpu:2                # 2 GPUs per node
#SBATCH --cpus-per-task=8
#SBATCH --mem=0

# ---- Distributed Setup ----
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

# ---- Launch ----
srun onmt_train -config yamls/a40.train.multilingual.en-xx-en.yaml

✕

Do NOT manually set CUDA_VISIBLE_DEVICES in multi-node jobs — SLURM assigns GPUs per process automatically. Do NOT mix srun torchrun ... — this causes duplicate processes and NCCL hangs.

Mental Model

Topology

Single-node: 1 machine → multiple GPUs → shared memory

Multi-node: Node 0 <──network──> Node 1
GPU GPU GPU GPU

Bottleneck shifts from compute → network (NCCL)

srun vs torchrun

Feature	torchrun	srun
Cluster-aware	✗	✓
Multi-node via scheduler	✗ manual	✓
Resource allocation	✗	✓
Recommended on SLURM	⚠	✓

DDP Environment Variables Set by Launchers

Variable	Example
RANK	0, 1, 2, 3
LOCAL_RANK	GPU id on node
WORLD_SIZE	Total number of processes
MASTER_ADDR	Node 0 hostname
MASTER_PORT	Communication port

Final Checklist (Before Multi-Node Jobs)

Nodes can communicate (test with ping)
Same software environment on all nodes (modules, conda envs)
Shared filesystem accessible from all nodes (data & checkpoints)
CUDA_VISIBLE_DEVICES is not manually set
Training launched using srun
MASTER_ADDR and MASTER_PORT are exported correctly

14

Prajna Server Configuration

ℹ

Rules and settings specific to the Prajna cluster. May not apply to other SLURM-managed servers.

Partition & QoS Must Match

On Prajna, --partition and --qos values must be identical. Mismatching causes a QoS error.

bash

#SBATCH --partition=a40
#SBATCH --qos=a40

GPU Must Be Explicitly Requested

Even after specifying a GPU partition, you must also set --gres=gpu:<n>. Without it, no GPUs will be visible.

bash — 4 GPUs on dgx partition

#SBATCH --partition=dgx
#SBATCH --qos=dgx
#SBATCH --gres=gpu:4

⚠

Common Prajna Pitfalls
QoS mismatch (e.g. --partition=a40 with --qos=dgx) → job rejected immediately.
Missing --gres → job runs but sees 0 GPUs; CUDA fails silently with no device found.

15

Environment Setup: Spack & Conda

ℹ

Specific to the Prajna server environment. System CUDA version: 12.4.

Session Setup Checklist

Run these in order at the start of every new terminal session on Prajna:

bash

# 1. Load Spack (use main path — NOT the outdated /scratch path)
source /lustre-flash/apps/spack/share/spack/setup-env.sh

# 2. Load Miniconda via Spack
spack load miniconda3

# 3. Activate your conda environment
conda activate <your_env>

✕

Do NOT use the outdated path:
/scratch/apps/spack/share/spack/setup-env.sh

Fixing Spack Errors

If you encounter broken package repo or stale cache issues:

bash

rm -rf ~/.spack/package_repos       # Remove broken local repo cache
spack clean -a                      # Clean all cached/built data
source /lustre-flash/apps/spack/share/spack/setup-env.sh
spack load miniconda3

SLURMInstructions

GPU Partition

How to Get Partitions

Partition & Cluster Information

Job Submission & Testing

Resources

Job Monitoring & Queue Utilities

Viewing & Modifying Job Details

Checking GPU Usage

Opening a Shell Inside a Worker Node

Cancel / Modify Jobs

Debugging & Logs

Useful Tips & Common Pitfalls

Common Pending Reasons

Other Common Pitfalls

Training

Single Node Training

Multi-Node Training

Mental Model

srun vs torchrun

DDP Environment Variables Set by Launchers

Final Checklist (Before Multi-Node Jobs)

Prajna Server Configuration

Partition & QoS Must Match

GPU Must Be Explicitly Requested

Environment Setup: Spack & Conda

Session Setup Checklist

Fixing Spack Errors

SLURM
Instructions