Slurm Partition and Resource Quota

The system supports 2 types of user accounts: namely project-based and individual students with approved UROP

  • Project-based accounts
    • Allow to access more computational resources, with allocation granted during project approval
    • Computational resources are shared among all group members of the project
    • Usage accounting for computational resource is implemented, details to be announced later
    • Provide shared storage space for the group
       
  • Individual student accounts
    • Computational resources are allocated to each student individually
    • No usage accounting

 

Resource Request

  • Resource request is counted by GPU Resource Unit (GRU). Each GRU associated with different maximum CPU cores and system memory in slurm partitions.
     
  • For the project & large-project partitions, 1 GRU corresponds to
    • One H800 GPU with 80GB GPU memory
    • 14 CPU cores with 28 Threads
    • 224GB system memory

 

  • For the student partition, each H800 GPU is partitioned into different size of GPU instances using Nvidia MIG technology, with 1 GRU corresponds to either 3g.40gb, 4g.40gb or 7g.80gb MIG device
    • For 3g.40gb, 1 GRU is
      • 3/7 of one H800 GPU system computational power with 40GB GPU memory
    • For 4g.40gb, 1 GRU is
      • 4/7 of one H800 GPU system computational power with 40GB GPU memory
    • For 7g.40gb, 1 GRU is
      • equivalent to whole H800 GPU of computational power and memory
    • 8 CPU cores with 16 Threads
    • 160GB system memory
       
  • For the debug partition, 1 GRU corresponds to

    • One H800 GPU with 80GB GPU memory

    • 14 CPU cores with 28 Threads

    • 224GB system memory

 

Partition Table

Slurm Partition project & large-project student debug cpu
Slurm Partition

No. of DGX nodes

project & large-project

52

student

2 with GPU MIG partitioned

debug

1

cpu

2 CPU nodes

Slurm Partition

Who can access

project & large-project

Project based users only

student

Non-project based student users only

debug

All

cpu

Project based users only

Slurm Partition

Purpose

project & large-project

Computation

student

Computation

debug

Compile, build container, interactive debug, code profiling

cpu

Data pre-processing for GPU computation

Slurm Partition

Max Wall Time

project & large-project

3 days

student

1 day

debug

2 hours

cpu

12 hours

Slurm Partition

Max resource requested

project & large-project

Varies with projects,
default is 8 GRU

student

1 GRU

debug

1 GRU

cpu

8 CPU cores (per job)

Slurm Partition

Concurrent running jobs quota per user

project & large-project

8

student

1

debug

1

cpu

28

Slurm Partition

Queuing and running jobs limit per user

project & large-project

10

student

2

debug

1

cpu

28

Slurm Partition

Usage Accounting

project & large-project

Yes

student

No

debug

No

cpu

No

Slurm Partition

Job Preemption

project & large-project

In large-project partition, jobs from approved projects can preempt other jobs that can run for at least 2 hours before getting preempted

student

No

debug

No

cpu

No

Slurm Partition

Remarks

project & large-project

Resources quota are per-project unless specified

student

Resources quota are per-user instead of per project

debug

Resources quota are per-user instead of per project

cpu

Resources quota are per-project unless specified

No access to the /scratch directory