GPU Partition
The GPU partition includes nodes equipped with NVIDIA A100 GPUs. Jobs submitted here can leverage A100 GPU cards for high-performance parallel processing.
How to Get Partitions
List all available partitions:
sinfoShow only partition names:
sinfo -h -o "%P"
Detailed view — state, nodes, time limits:
sinfo -o "%P %a %l %D %t %N"
Show specific partition info:
scontrol show partition <partition_name>
Partition & Cluster Information
# View all partitions and status sinfo # Detailed partition info (nodes, GPUs, time limits) sinfo -o "%P %N %G %l %c %m" # Specific partition details scontrol show partition <partition_name> # Node-level details scontrol show node <node_name>
Job Submission & Testing
# Submit a job sbatch job_script.sh # Run an interactive job srun --pty bash # Test job without submitting (dry run) sbatch --test-only job_script.sh
--test-only or the checktime command (uses check-time.sh).Resources
Request GPUs on compute nodes:
#SBATCH --gres=gpu:1 # 1 GPU #SBATCH --gres=gpu:2 # 2 GPUs #SBATCH --mem=[MB] # Total memory per node #SBATCH --mem-per-cpu=[MB] # Memory per CPU core
CPUs in GPU env: use
8 as a safe choice.RAM: always allocate at least as much RAM as total GPU memory.
Example: 2× A100 (80 GB each) → use
--mem=160000.
Job Monitoring & Queue Utilities
# View all jobs in queue squeue # View only your jobs squeue --me # Custom formatted view squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R" # View completed jobs sacct # Detailed job accounting sacct -j <jobid> --format=JobID,JobName,Partition,AllocCPUS,State,Elapsed
Viewing & Modifying Job Details
# View details of a specific job scontrol show job <jobid> # Update the time limit scontrol update jobid=106 TimeLimit=4-00:00:00
Checking GPU Usage
squeue --me
srun --jobid=<jobid> nvidia-smi
--jobid flag attaches srun to an already-allocated job.Opening a Shell Inside a Worker Node
Open an interactive bash shell inside the allocated worker node:
srun --jobid=<jobid> --pty -i bash
This lets you inspect the node environment, check GPU state, validate paths, and debug issues directly.
Cancel / Modify Jobs
# Cancel a specific job scancel <jobid> # Cancel all your jobs scancel -u $USER # Hold a job (prevent it from starting) scontrol hold <jobid> # Release a held job scontrol release <jobid>
Debugging & Logs
# Check the default job output file cat slurm-<jobid>.out # Follow logs live tail -f slurm-<jobid>.out # Check node allocation and job errors scontrol show job <jobid>
Useful Tips & Common Pitfalls
Check pending reason for a job in PD state:
squeue -j <jobid> -o "%i %t %r"
Common Pending Reasons
| Reason | Description |
|---|---|
| Resources | Requested resources not yet available |
| PartitionTimeLimit | Requested time exceeds partition's maximum |
| Priority | Other jobs have higher priority |
| QOSMaxGRESPerUser | GPU limit per user reached |
| NodeDown | Allocated node is unavailable/down |
| ReqNodeNotAvail | Specific requested node is unavailable |
Other Common Pitfalls
- Don't over-request memory — jobs may be rejected or delayed if memory exceeds node capacity.
- Avoid long
--timevalues — unnecessarily high time limits reduce scheduling priority. - Module env mismatch — ensure all required modules (
module load) are loaded inside your job script, not just in your interactive shell. - Path issues — always use absolute paths in job scripts.
- Job array pitfalls — use
%A(array job ID) and%a(task index) in output filenames. - NCCL errors in multi-GPU jobs — usually caused by incorrect
--ntasks-per-nodesettings. See §13.
Training
Single Node Training
For initial experimentation and debugging, run training manually inside a single SLURM task.
#SBATCH -N 1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:2
--ntasks-per-node=2 without proper distributed setup (srun / DDP) leads to duplicate training processes, GPU contention, and NCCL crashes. Keep it at 1 and let your script handle process spawning.1. SLURM via
--ntasks-per-node2. Your training script via manual spawning or framework internals
→ Use only ONE of these at a time.
Multi-Node Training
#SBATCH -N 2 # 2 nodes #SBATCH --ntasks-per-node=2 # 2 processes per node (1 per GPU) #SBATCH --gres=gpu:2 # 2 GPUs per node #SBATCH --cpus-per-task=8 #SBATCH --mem=0 # ---- Distributed Setup ---- export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=29500 export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE)) # ---- Launch ---- srun onmt_train -config yamls/a40.train.multilingual.en-xx-en.yaml
CUDA_VISIBLE_DEVICES in multi-node jobs — SLURM assigns GPUs per process automatically. Do NOT mix srun torchrun ... — this causes duplicate processes and NCCL hangs.Mental Model
Multi-node: Node 0 <──network──> Node 1
GPU GPU GPU GPU
Bottleneck shifts from compute → network (NCCL)
srun vs torchrun
| Feature | torchrun | srun |
|---|---|---|
| Cluster-aware | ✗ | ✓ |
| Multi-node via scheduler | ✗ manual | ✓ |
| Resource allocation | ✗ | ✓ |
| Recommended on SLURM | ⚠ | ✓ |
DDP Environment Variables Set by Launchers
| Variable | Example |
|---|---|
| RANK | 0, 1, 2, 3 |
| LOCAL_RANK | GPU id on node |
| WORLD_SIZE | Total number of processes |
| MASTER_ADDR | Node 0 hostname |
| MASTER_PORT | Communication port |
Final Checklist (Before Multi-Node Jobs)
- Nodes can communicate (test with
ping) - Same software environment on all nodes (modules, conda envs)
- Shared filesystem accessible from all nodes (data & checkpoints)
CUDA_VISIBLE_DEVICESis not manually set- Training launched using
srun MASTER_ADDRandMASTER_PORTare exported correctly
Prajna Server Configuration
Partition & QoS Must Match
On Prajna, --partition and --qos values must be identical. Mismatching causes a QoS error.
#SBATCH --partition=a40 #SBATCH --qos=a40
GPU Must Be Explicitly Requested
Even after specifying a GPU partition, you must also set --gres=gpu:<n>. Without it, no GPUs will be visible.
#SBATCH --partition=dgx #SBATCH --qos=dgx #SBATCH --gres=gpu:4
QoS mismatch (e.g.
--partition=a40 with --qos=dgx) → job rejected immediately.Missing
--gres → job runs but sees 0 GPUs; CUDA fails silently with no device found.
Environment Setup: Spack & Conda
Session Setup Checklist
Run these in order at the start of every new terminal session on Prajna:
# 1. Load Spack (use main path — NOT the outdated /scratch path) source /lustre-flash/apps/spack/share/spack/setup-env.sh # 2. Load Miniconda via Spack spack load miniconda3 # 3. Activate your conda environment conda activate <your_env>
/scratch/apps/spack/share/spack/setup-env.shFixing Spack Errors
If you encounter broken package repo or stale cache issues:
rm -rf ~/.spack/package_repos # Remove broken local repo cache spack clean -a # Clean all cached/built data source /lustre-flash/apps/spack/share/spack/setup-env.sh spack load miniconda3