How to Run Jobs
Many researchers run their program on PDC’s computer systems, often simultaneously. For this, the computer systems need workload management and job scheduling. For job scheduling PDC uses Slurm Workload Manager .
When you login to the supercomputer with ssh, you login to a designated login node in your Klemming home directory. Here you can modify your scripts and manage your files.
To run your script/program on the computer nodes, you either write a job script describing what resources you need and submit this using the sbatch
command as described here, or run the script/program interactively as described here.
How jobs are scheduled
The queue system uses two main methods to decide which jobs are run. These are called fair-share and backfill. Unlike some other centers, the time a job has been in the queue is not a factor.
Fair share
The goal of the fair share algorithm is to make sure that all projects can use their fair share of the available resources within a reasonable time frame. The priority that a job (belonging to a particular project) is given will depend on how much of that project’s time quota has been used recently in relation to the quotas of jobs belonging to other projects - the effect of this on the priority declines gradually with a half-life of 14 days. So jobs submitted by projects that have not used much of their quota recently will be given high priority, and vice versa.
Backfill
As well as having a main queue to ensure that the systems are as full as possible, the job scheduling system also implements “backfill”. If the next job in the queue is large (i.e., it requires a significant number of nodes to run), the scheduler accumulates nodes as they become available until there are enough to start running the large job. Backfill means that the scheduler looks for smaller jobs that could start on nodes that are free now, and which would finish before there are enough nodes free for the large job to start. For backfill to work well, the scheduler needs to know how long jobs will take. So, to take advantage of the possibility of backfill, you should set the maximum time your job needs to run as accurately as possible in your submit scripts.
This graph shows the percentage of the nodes on Beskow (previous PDC’s supercomputer) that were in use on different dates from early 2015 till late 2016. You can see how the scheduler makes good use of Beskow as nearly all of the available nodes are being used all the time.
Note
All researchers sharing a particular time allocation have the same priority. This means that if other people in your time project have used up lots of the allocated time recently, then any jobs you (or they) submit within that project will be given the same low priority.
Example of scheduling
Consider a couple of researchers, Anna and Björn with projects A and B. Of course both Anna and Björn would like their jobs to be run as soon as possible.
Assume now that project B has used less of its time allocation than project A. In this case, the scheduler will give priority to Björn’s job.
Even if Anna has not used any time herself, this does not make any difference as it is the total amount of time recently used by each project that is taken into consideration when deciding which job will be scheduled next.
Dardel compute nodes
Compute nodes on Dardel have five different flavors with different amounts of memory. A certain amount of the memory is reserved for the operating system and system software. Therefore the amount of memory available for user software is also listed in the table below. All nodes have the same dual socket processors, for a total of 128 physical cores per node.
Node type | Node count | RAM | Partition | Available | Example used flag |
---|---|---|---|---|---|
Thin node | 700 | 256 GB | main, shared, long | 227328 MB | |
Large node | 268 | 512 GB | main, memory | 456704 MB | --mem=440GB |
Huge node | 8 | 1 TB | main, memory | 915456 MB | --mem=880GB |
Giant node | 10 | 2 TB | memory | 1832960 MB | --mem=1760GB |
GPU node | 62 | 512 GB | gpu | 456704 MB | --mem=440GB |
More details on the hardware is available at https://www.pdc.kth.se/hpc-services/computing-systems/about-dardel-1.1053338.
Different node types are allocated based on the SLURM partition (-p
) and the --mem
or --mem-per-cpu
flags. By default, any free node in the partition can be allocated to the job. Using the --mem=X
flag restricts the set of eligible nodes to those with at least X amount of available memory.
The following configuration allocates one node with at least 250 GB of available memory in the main partition, i.e. a large or huge node:
#SBATCH --mem=250GB
#SBATCH -p main
The --mem
flag imposes a hard upper limit on the amount of memory the job can use.
In the above example, even if the job is allocated a node with 1 TB of memory, it will not be able to use more than 250 GB of memory due to this limitation.
Dardel partitions
The compute nodes on Dardel are divided into five partitions. Each job must specify one of these partitions using the -p
flag. The table below explains the difference between the partitions. See the table above for descriptions of the various node types.
Partition name | Characteristics |
---|---|
main | Thin, large and huge nodes Job gets whole nodes (exclusive) Maximum job time 24 hours |
long | Thin nodes Job gets whole nodes (exclusive) Maximum job time 7 days |
shared | Thin nodes Jobs are allocated to cores, not nodes By default granted one core, get more with -n or -c Job shares node with other jobs Maximum job time 7 days |
memory | Large, huge and giant nodes Job gets whole nodes (exclusive) Maximum job time 7 days |
gpu | GPU nodes Job gets whole nodes (exclusive) Maximum job time 24 hours |
How to run on shared partitions
Running on the shared partition is a little bit different than running on exclusive nodes. First you need to specify the number of cores you will be using, and, at the same time, you get an equivalent size of RAM for your job.
Defining the number of cores or memory
When running on shared nodes, you need to add the number of cores you will be using or the amount of memory you need. The amount of memory is equivalent to the amount of cores you are asking for and viceversa. For example, we are using a node with 128 cores and 256 Gbytes of RAM. If you are asking for 20 cores, you will receive 40 Gbytes of RAM. Instead of you are asking for 80 Gbytes of RAM, you will automatically receive 40 cores. The cores or memory that is the largest for the job will dictate what is needed.
Paremeters needed on shared nodes
Parameter | Description |
---|---|
-n [tasks] |
Allocates ntasks |
--cpus-per-task=[cores] |
Allocates physical [cores]=ntasks*cpu-per-task. (Default: cpus-per-task=1) |
--mem=[RAM in Mbytes] |
The max amount of RAM allocated for your job |
Example 1: On a shared node with 128 cores, 256 Gbytes RAM. In this case you will receive 20 cores, 40 GBytes RAM
#SBATCH -p shared
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=2
Example 2: On a shared node with 128 cores, 256 Gbytes RAM. In this case you will receive 20 cores, 40 GBytes RAM. The amount of cores you ask for does not cover your need for RAM.
#SBATCH -p shared
#SBATCH --ntasks=2
#SBATCH --mem=40G