Cluster Resource Limits | Information Technology Services Office

Cluster Resource Limits

Accounting

The Cluster account is organized with the principal investigator (PI) or the group leader of a research team. Each member of the team has an individual user account under PI’s group to access the cluster and run jobs on the partitions (queues) with SLURM. With this accounting scheme, the system can impose resource limits (usage quota) on different partitions for different groups of users.

Resource limits

Compute node has processors, memory, swap and local disk as resources. Our cluster resource allocation is based on CPU core only. In particular, no core can run more than one job in a partition at a time. In case one needs to use a number of nodes exclusively for a job, user can specify exclusive option in the slurm script. The resource limits on partitions are imposed on PI group as a whole. This implies that individual users in the same group share the quota limit.

Partitions

The ownership of HPC2 compute nodes are diversified. The partitions and their resource limits are summarized as follows.

Overview
SSCI and SENG PI Groups
Private PI Groups

Partition	No. of Nodes	CPU	Memory	Coprocessor
standard	20	2 x Intel Xeon E5-2670 v3 (12-core)	64G DDR4-2133	–
himem	15	2 x Intel Xeon E5-2670 v3 (12-core)	128G DDR4-2133	–
gpu	5	2 x Intel Xeon E5-2670 v3 (12-core)	128G DDR4-2133	2 x Nvidia Tesla K80
ssci	15	2 x Intel Xeon E5-2670 v3 (12-core)	128G DDR4-2133	–
cbme	1	2 x Intel Xeon E5-2670 v3 (12-core)	128G DDR4-2133	2 x Nvidia Tesla K80
ce	2	2 x Intel Xeon E5-2670 v3 (12-core)	128G DDR4-2133	–
ch	10	2 x Intel Xeon E5-2670 v3 (12-core)	64G DDR4-2133	–
ch1	1	2 x Intel Xeon E5-2670 v3 (12-core)	128G DDR4-2133	–
cse	1	2 x Intel Xeon E5-2670 v3 (12-core)	128G DDR4-2133	2 x Nvidia Tesla K80
ece	1	2 x Intel Xeon E5-2683 v4 (16-core)	128G DDR4-2400	–
ias	6	2 x Intel Xeon E5-2670 v3 (12-core)	256G DDR4-2133	–
lifs	8	2 x Intel Xeon E5-2650 v4 (12-core)	128G DDR4-2400	–
ph	3	2 x Intel Xeon E5-2670 v3 (12-core)	128G DDR4-2133	–
sbm	4	2 x Intel Xeon E5-2683 v4 (16-core)	128G DDR4-2400	–

Partition	No. of Nodes	Access (SSCI/SENG)	GrpJobs (Max)	GrpNodes (Max)	GrpSubmit Jobs (Max)	MaxWallTime
standard	20	Both	4	5	4	3 days
himem	15	Both	3	5	3	3 days
gpu	5	GPU user	2	2	2	3 days
ssci	15	SSCI only	3	5	3	3 days

Partition	No. of Nodes	MaxCPUs/User	MaxJobs/User	MaxSubmit/User(Max)	MaxWallTime
cbme	1	24	10	10	60 days
ce	2	48	4	4	20 days
ch	10	96	3	4	7 days
ch1	1	24	1	3	7 days
cse	1	--	--	--	--
ece	1	32	10	10	30 days
ias	6	144	10	50	15 days
lifs	8	192	8	10	5 days
ph	3	72	72	108	30 days
sbm	4	128	4	8	15 days

For the quota terminology, please refer here.

Job Scheduling

Currently the SLURM jobs are scheduled with basic priority, i.e. first in fist out depending on the order of arrival.

Community Cluster

In order to maximize the usage of computational resources, ITSO has configured a community cluster strategy such that idle resources on the HPC2 cluster can be used by anybody. The community cluster can be accessed via the partition “general”. Jobs submitted on this partition are scheduled ONLY when there is idle resources and the maximum wall-time is 12 hours. Usage of this community cluster is open to all users. The usage quota is summarized as follows.

Partition	GrpJobs (Max)	GrpNodes (Max)	GrpSubmitJobs (Max)	MaxWallTime
general	2	6	2	12 hours

Disk Quota

The disk quota for each SSCI and SENG PI group in hpc2 is 2 TB, and for other PI group is 500 GB. The quota is shared among all members of the group. Usage exceeding the quota have 24-hour grace period to clean up the extra data. The total disk space available in the cluster is 340TB.

To check the disk usage and quota of your group:

lfs quota -h -g <your_group> /home

A share directory is assigned to each PI group. Users from the same group can access, create and modify files in the share directory.

To access the share directory:

cd /home/share/<your_group>

cd $PI_HOME

Note that the group disk quota is also applied to the share directory.

Backup

There is NO backup service on the cluster and user is required to manage the backup of the data themselves.

Scratch Files

There is about 900GB with the /tmp of the compute nodes for local scratch files. User is advised to make use of it and clear them up as soon as you are finished with your application. The files in the /tmp of all nodes will be removed by the system automatically if they are not accessed for more than 10 days.