Slurm Partition and Resource Quota

Partitions are work queues that have a set of rules/policies and computational nodes included in it to run the jobs. The available partitions are normal, preempt, and cpu. You can run sinfo to find the available list of partitions in discovery.

Resource Request Policy

Computational resource in HKUST SuperPOD is requested in units of H800 (80GB) GPU that each GPU is associated with the default CPU cores and system memory in Slurm as below:
- 14 CPU cores with 28 Threads
- 224GB system memory
In general, we recommend that users just specify --gpus-per-node for number of GPUs per node and --nodes for number of nodes in a job request, and let Slurm allocate the cores and memory among the nodes for the optimized resource utilization.
For normal partition, it supports job requests for GPU computation that varies from 1 GPU to to 96 GPUs (equivalent to 12 nodes). Jobs requested with multiple GPUs will be assigned a higher priority to run.
For preempt partition, jobs are run on idle resources on reserved nodes. When idle resources are available, jobs submitted to this partition will have a 15-minute execution window. After that window, jobs may be terminated without prior notification if the reserved resources are reclaimed.
Please refer to "Charging for Use of HKUST SuperPOD" for detailed information regarding usage charges for these partitions.

Partition Table

Slurm Partition	normal	preempt	cpu
No. of nodes	Nodes that are not reserved*	Idle resources on reserved nodes	2 Intel nodes
Purpose	Mainstream GPU computation	Utilization of idle resources on reserved nodes	Data pre-processing for GPU computation
Max Wall Time	3 days	3 days	12 hours
Min resource requested per job	1 GPU	1 GPU	1 CPU core
Max resource requested per account	96 GPUs (or equivalent to 12 nodes)	96 GPUs (or equivalent to 12 nodes)	8 CPU cores (per job)
Concurrent running jobs quota per user	8	8	28
Queuing and running jobs limit per user	10	10	28
Chargeable	Yes	Yes	No
Interactive job	Maximum 8 hours wall time	Maximum 8 hours wall time	Not Allow
Remarks	Jobs requested with more GPUs will be assigned a higher priority.	Jobs submitted to this partition have 15 minutes run time when idle resources on reserved nodes are available. After this period, jobs may be cancelled when reserved resources are reclaimed.	No access to the /scratch directory

Notes:

The number of available nodes can be obtained by using sinfo

HKUST ITSO AI Chatbot

Support