Slurm Partition and Resource Quota

Partitions are work queues that have a set of rules/policies and computational nodes included in it to run the jobs. The available partitions are normal, large, and cpu. You can run sinfo to find the available list of partitions in discovery.

Resource Request Policy

  • Computational resource in HKUST SuperPOD is requested in units of H800 (80GB) GPU that each GPU is associated with the default CPU cores and system memory in Slurm as below:
    • 14 CPU cores with 28 Threads
    • 224GB system memory
  • In general, we recommend that users just specify --gpus-per-node for number of GPUs per node and --nodes for number of nodes in a job request, and let Slurm allocate the cores and memory among the nodes for the optimized resource utilization.
  • For normal partition, it supports job request for mainstream GPU computation that varies from a single H800 GPU and up to 16 GPUs in maximum.
  • For large partition, it supports large job request for multi-nodes that the request unit must be in multiple of 8 H800 GPUs i.e. a full node, The minimum number of requested nodes is 2 (16 GPUs) and up to 12 nodes (96 GPUs) in maximum.
  • The number of nodes assigning to large and normal partitions may vary depending on the different workload condition.
  • For job request on very large number of nodes, i.e. large than 12 nodes, such request must be arranged by reservation only,    

 

Partition Table

Slurm Partition large normal cpu
Slurm Partition

No. of nodes

large

32 DGX nodes

normal

22 DGX nodes

cpu

2 Intel nodes

Slurm Partition

Purpose

large

For large scale GPU computation with multi-nodes

normal

For mainstream GPU computation

cpu

Data pre-processing for GPU computation

Slurm Partition

Max Wall Time

large

3 days

normal

3 days

cpu

12 hours

Slurm Partition

Min resource requested per job

large

16 GPUs (or equivalent to 2 nodes)

normal

1 GPU

cpu

1 CPU core

Slurm Partition

Max resource requested per account

large

96 GPUs (or equivalent to 12 nodes)

normal

16 GPUs

cpu

8 CPU cores (per job)

Slurm Partition

Concurrent running jobs quota per user

large

4

normal

8

cpu

28

Slurm Partition

Queuing and running jobs limit per user

large

5

normal

10

cpu

28

Slurm Partition

Chargeable

large

Yes

normal

Yes

cpu

No

Slurm Partition

Interactive job

large

Allow one session with maximum 8 hours wall time

normal

Allow one session with maximum 8 hours wall time

cpu

Not Allow

Slurm Partition

Remarks

large

GPU resources must be requested in multiple of 8 (full node)

normal

GPU resources can be requested in any quantity not more than max

cpu

No access to the /scratch directory for the time being