Partitions are work queues that have a set of rules/policies and computational nodes included in it to run the jobs. The available partitions are normal, large, and cpu
. You can run sinfo
to find the available list of partitions in discovery.
Resource Request Policy
- Computational resource in HKUST SuperPOD is requested in units of H800 (80GB) GPU that each GPU is associated with the default CPU cores and system memory in Slurm as below:
- 14 CPU cores with 28 Threads
- 224GB system memory
- In general, we recommend that users just specify
--gpus-per-node
for number of GPUs per node and--nodes
for number of nodes in a job request, and let Slurm allocate the cores and memory among the nodes for the optimized resource utilization. - For
normal
partition, it supports job request for mainstream GPU computation that varies from a single H800 GPU and up to 16 GPUs in maximum. - For
large
partition, it supports large job request for multi-nodes that the request unit must be in multiple of 8 H800 GPUs i.e. a full node, The minimum number of requested nodes is 2 (16 GPUs) and up to 12 nodes (96 GPUs) in maximum. - The number of nodes assigning to large and normal partitions may vary depending on the different workload condition.
- For job request on very large number of nodes, i.e. large than 12 nodes, such request must be arranged by reservation only,
Partition Table
Slurm Partition | large | normal | cpu |
---|---|---|---|
Slurm Partition
No. of nodes |
large
32 DGX nodes |
normal
22 DGX nodes |
cpu
2 Intel nodes |
Slurm Partition
Purpose |
large
For large scale GPU computation with multi-nodes |
normal
For mainstream GPU computation |
cpu
Data pre-processing for GPU computation |
Slurm Partition
Max Wall Time |
large
3 days |
normal
3 days |
cpu
12 hours |
Slurm Partition
Min resource requested per job |
large
16 GPUs (or equivalent to 2 nodes) |
normal
1 GPU |
cpu
1 CPU core |
Slurm Partition
Max resource requested per account |
large
96 GPUs (or equivalent to 12 nodes) |
normal
16 GPUs |
cpu
8 CPU cores (per job) |
Slurm Partition
Concurrent running jobs quota per user |
large
4 |
normal
8 |
cpu
28 |
Slurm Partition
Queuing and running jobs limit per user |
large
5 |
normal
10 |
cpu
28 |
Slurm Partition
Chargeable |
large
Yes |
normal
Yes |
cpu
No |
Slurm Partition
Interactive job |
large
Allow one session with maximum 8 hours wall time |
normal
Allow one session with maximum 8 hours wall time |
cpu
Not Allow |
Slurm Partition
Remarks |
large
GPU resources must be requested in multiple of 8 (full node) |
normal
GPU resources can be requested in any quantity not more than max |
cpu
No access to the /scratch directory for the time being |