Running jobs at DKRZ

Job scheduler

A job scheduler, or ‘batch’ scheduler, is a tool that manages how user jobs (running a model or script) are queued and run on a set of compute resources.

For the German Climate Computing Center - DKRZ, the compute resources are the set of compute nodes that make up the supercomputer Levante. Each user can submit jobs to the scheduler which then decides which jobs to run and where to execute them. The scheduler manages the jobs to ensure that the compute resources are being used efficiently and that users get appropriate access to those resources.

At DKRZ, the Simple Linux Utility for Resource Management (SLURM) is used for submission, scheduling, execution, and monitoring of jobs on Levante. SLURM is a free open-source resource manager and scheduler, which is used at many high-perfomance computing centres around the world.

Key SLURM commands

  • sinfo
    … shows information about all partitions and nodes managed by SLURM as well as about general system state. It has a wide variety of filtering, sorting, and formatting options.

  • sbatch
    … submits a batch script (your model run script). The script will be executed on the first node of the allocation. The working directory coincides with the working directory of the sbatch directory. Within the script one or multiple srun commands can be used to create job steps and execute parallel applications.

  • squeue
    … queries the list of pending and running jobs. By default it reports the list of pending and running jobs sorted by priority, respectively. The most relevant job states are running (R.), pending (PD), completing (CG), completed (CD) and cancelled (CA). The TIME field shows the actual job execution time. Use squeue -u ${USER} to list only your jobs.

  • scontrol provides some functionality for the users to manage jobs or get some information about the system configuration such as nodes, partitions, jobs, and configurations. Use scontrol show partition compute to show configuration and limits for one specific partition (here compute).

  • scancel
    … cancels a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

  • srun
    … launches parallel tasks within a job or starts an interactive job. Note, the srun command is usually included in the run script of your application or model.

Allocating Resources with SLURM

A job allocation, which is a set of computing resources (nodes or cores) assigned to a user’s request for a specified amount of time, can be created using the SLURM commands sbatch and srun.

The usual way to allocate resources and execute a job on Levante is to write a batch script OR turn your model run script into a batch script and submit it to SLURM with the sbatch command.

The batch script is a shell script consisting of two parts:

  1. Resources requests for number of required nodes, time duration of the job etc. These options are prefixed by ‘#SBATCH ’ directives and must precede any executable commands in the batch script.

  2. Job steps (e.g., model start) are user’s tasks that shall be executed.

For example:

#!/bin/bash
#SBATCH --partition=compute
#SBATCH --account=xz0123
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --time 00:30:00
 
# Begin of section with executable commands
set -e
ls -l
srun ./my_program

The script itself is regarded by SLURM as the first job step and is serially executed on the first compute node in the job allocation. To execute parallel (MPI) tasks users call SLURM srun command within the script.

For further explanations and commands see also: https://docs.dkrz.de/doc/levante/running-jobs/slurm-introduction.html