Cluster Resource Limits

Accounting

The Cluster account is organized with the principal investigator (PI) or the group leader of a research team. Each member of the team has an individual user account under PI’s group to access the cluster and run jobs on the partitions (queues) with SLURM. With this accounting scheme, the system can impose resource limits (usage quota) on different partitions for different groups of users.

Resource limits

Compute node has processors, memory, swap and local disk as resources. Our cluster resource allocation is based on CPU core only. In particular, no core can run more than one job in a partition at a time. In case one needs to use a number of nodes exclusively for a job, user can specify exclusive option in the slurm script. The resource limits on partitions are imposed on PI group as a whole. This implies that individual users in the same group share the quota limit.

Partitions

The ownership of HPC2 compute nodes are diversified. The partitions and their resource limits are summarized as follows.

Partition No. of Nodes CPU Memory Coprocessor
standard 20 2 x Intel Xeon E5-2670 v3 (12-core) 64G DDR4-2133
himem 15 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
gpu 5 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133 2 x Nvidia Tesla K80
ssci 15 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
cbme 1 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133 2 x Nvidia Tesla K80
ce 2 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
ch 10 2 x Intel Xeon E5-2670 v3 (12-core) 64G DDR4-2133
ch1 1 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
cse 1 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133 2 x Nvidia Tesla K80
ece 1 2 x Intel Xeon E5-2683 v4 (16-core) 128G DDR4-2400
ias 6 2 x Intel Xeon E5-2670 v3 (12-core) 256G DDR4-2133
lifs 8 2 x Intel Xeon E5-2650 v4 (12-core) 128G DDR4-2400
ph 3 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
sbm 4 2 x Intel Xeon E5-2683 v4 (16-core) 128G DDR4-2400
Partition No. of Nodes Access (SSCI/SENG) GrpJobs (Max) GrpNodes (Max) GrpSubmit Jobs (Max) MaxWallTime
standard 20 Both 4 5 4 3 days
himem 15 Both 3 5 3 3 days
gpu 5 GPU user 2 2 2 3 days
ssci 15 SSCI only 3 5 3 3 days
Partition No. of Nodes MaxCPUs/User MaxJobs/User MaxSubmit/User(Max) MaxWallTime
cbme 1 24 10 10 60 days
ce 2 48 4 4 20 days
ch 10 96 3 4 7 days
ch1 1 24 1 3 7 days
cse 1 -- -- -- --
ece 1 32 10 10 30 days
ias 6 144 10 50 15 days
lifs 8 192 8 10 5 days
ph 3 72 72 108 30 days
sbm 4 128 4 8 15 days

For the quota terminology, please refer here.

Job Scheduling

Currently the SLURM jobs are scheduled with basic priority, i.e. first in fist out depending on the order of arrival.

Community Cluster

In order to maximize the usage of computational resources, ITSO has configured a community cluster strategy such that idle resources on the HPC2 cluster can be used by anybody. The community cluster can be accessed via the partition “general”. Jobs submitted on this partition are scheduled ONLY when there is idle resources and the maximum wall-time is 12 hours. Usage of this community cluster is open to all users. The usage quota is summarized as follows.

Partition GrpJobs (Max) GrpNodes (Max) GrpSubmitJobs (Max) MaxWallTime
general 2 6 2 12 hours

Disk Quota

The disk quota for each SSCI and SENG PI group in hpc2 is 2 TB, and for other PI group is 500 GB. The quota is shared among all members of the group. Usage exceeding the quota have 24-hour grace period to clean up the extra data. The total disk space available in the cluster is 340TB.


To check the disk usage and quota of your group:

lfs quota -h -g <your_group> /home

 

Group Share Directory

A share directory is assigned to each PI group. Users from the same group can access, create and modify files in the share directory.

To access the share directory:

cd /home/share/<your_group>

or

cd $PI_HOME

Note that the group disk quota is also applied to the share directory.  

 

Backup

There is NO backup service on the cluster and user is required to manage the backup of the data themselves.

Scratch Files

There is about 900GB with the /tmp of the compute nodes for local scratch files. User is advised to make use of it and clear them up as soon as you are finished with your application. The files in the /tmp of all nodes will be removed by the system automatically if they are not accessed for more than 10 days.