SLURM tasks
Support for the SLURM workload manager.
The hpc.slurm.sbatch
task
This task allows to send ScriptEngine jobs to SLURM queues by providing the
functionallity of the SLURM sbatch
command to ScriptEngine scripts. The
usage pattern is:
- hpc.slurm.sbatch:
scripts: <SE_SCRIPT | LIST_OF_SE_SCRIPTS> # optional
hetjob_spec: <LIST_OF_SBATCH_OPTIONS> # optional
submit_from_sbatch: <true | false> # optional, default False
stop_after_submit: <true | false> # optional, default True
set_jobid: <CONTEXT_NAME> # optional
<SBATCH_OPTIONS> # optional
The main usage principle of hpc.slurm.sbatch
is that a new batch job is
created and sent into a SLURM queue. Once SLURM executes the job, one or more
ScriptEngine scripts are run.
There are two ways to specify which scripts are run in the batch job. By default
(no script
argument is given), the batch job runs the script(s) given on the
se
command line. For example, if the following script (assumed name
sbatch.yml
):
- hpc.slurm.sbatch:
account: <MY_SLURM_ACCOUNT>
time: !noparse 0:10:00
- base.echo:
msg: Hello world, from batch job!
is run with se sbatch.yml
, a batch job will be queued, which eventually
writes “Hello world, from batch job!” to the default job logfile. Using this
default will be the desired behavior in most use cases. However, it is possible
to have the batch job run a different script (or scripts) and not the initiall
one, by specifying one or more other ScriptEngine scripts with the scripts
arguments. More than one scripts have to specified as a list.
Most of the hpc.slurm.sbatch
arguments will be passed right through to the
sbatch
command. Thus, in the above example, the command executed under the
hood is:
sbatch --account MY_SLURM_ACCOUNT --time 0:10:00 se sbatch.yml
Only few arguments are processed by the hpc.slurm.sbatch
task itself, see
the usage pattern above. Thus, it is possible to use any sbatch
argument, as
long as they are valid long arguments (i.e. with the double dash syntax). Note
that no checking is done for validity of the sbatch
arguments and options!
An important principle of hpc.slurm.sbatch
is that on the initial execution,
it will stop the processing of the current script once the batch job is queued.
Hence, when the above example script is run, a job is put in the batch queue
(first task), but the base.echo
task is not executed. When the script is run
(again) from within the batch job, the hpc.slurm.sbatch
task detects that it
is in a batch job and does nothing. Therefore, the following echo task is run as
part of the job.
Again, this behavior will be appropriate in most use cases. The script is run
until the sbatch task, a job is queued and processing stops. Once the job is
running, hpc.slurm.sbatch
does nothing and all other tasks are run.
Sometimes, though, it makes sense to submit a batch job even if the current script already runs in a batch job itself. For example, one may want to queue a follow-on job at the end of the script. In order to do this, one needs to set:
- hpc.slurm.sbatch:
[...]
submit_from_sbatch: true
[...]
If submit_from_sbatch
is set to true
a new job is queued, even if the
current script is itself running in a batch job on its own.
A related switch is stop_after_submit
, which defaults to True
. If it is
set to False
the script will continue after a new SLURM job was submitted.
If stop_after_submit
is not explicitly set (or set to True
) the script
execution will be stopped, as described above.
Saving the SLURM JOBID
When the job submission via SLURM sbatch succeeds, it is possible to save the
JOBID of the new job in the ScriptEngine context. For this, the set_jobid
task argument can be set to a key for the context dictionary. If set_jobid
is not given (or set to False
), the JOBID is not stored in the context.
Note that only simple context keys, no dot-separated values, are supported.
Example:
- hpc.slurm.sbatch:
[...]
set_jobid: jobid
[...]
- base.echo:
msg: "Submitted job with ID {{jobid}}."
SLURM Heterogeneous Job Support
The hpc.slurm.sbatch
task support submitting heterogeneous SLURM jobs by providing the
hetjob_spec
option:
- hpc.slurm.sbatch:
- time: 10
- hetjob_spec:
- nodes: 1
- nodes: 2
- base.command:
name: srun
args: [
-l,
--ntasks, 1, /usr/bin/hostname, ':',
--ntasks, 10, --ntasks-per-node, 5, /usr/bin/hostname
]
In this example, a heterogeneous job with two components is submitted to SLURM,
the first requesting one node and the second two nodes. The srun
command in
the second task of the script starts executables on this allocated nodes while
specifying further job characteristics (such as the number of tasks and tasks
per node).
The hetjob_spec
argument takes a list of dictionaries and passes the keys of
each dictionary on to sbatch
as specification for each respective component
of the heterogeneous job. Note that in the example above, each dictionary
contains only one key-value pair, the number of requested nodes.