Table of Contents
- Introduction
- Scheduling Fundamentals on chip: Partitions, QOS’s, and more
- Interacting with the SLURM Scheduling System
- Running Serial Jobs
- Running Parallel Jobs
- Parallel Runs on the Production Partitions
- Some details about the batch system
Introduction
Running a program on chip is different than running one on a standard workstation. When we log into the cluster, we are interacting with the usernode. But we would like our programs to run on the compute nodes, which is where the real computing power of the cluster is. We will walk through the processes of running serial and parallel code on the cluster, and then later discuss some of the finer details. This page uses the code examples from the compile tutorial. Please download and compile those examples first, before following the run examples below. But first some more general explanations.
On chip, jobs must be run on the compute nodes of the cluster. You cannot execute jobs directly on the compute nodes yourself; you must request the cluster’s batch system do it on your behalf. To use the batch system, you will submit a special script which contains instructions to execute your job on the compute nodes. When submitting your job, you specify a partition (group of nodes, e.g., 2018, 2021, or 2024) and a QOS (a classification that determines what kind of resources your job will need). Your job will wait in the queue until it is “next in line”, and free processors on the compute nodes become available. Once a job is started, it continues to run until it either completes (with or without error) or reaches its time limit, in which case it is terminated by the scheduler.
During the runtime of your job, your instructions will be executed across the compute nodes. These instructions will have access to the resources of the nodes on which they are running. Notably, the memory, processors, and local disk space (/scratch space). Note that the /scratch space is cleared after every job terminates.
Scheduling Fundamentals on chip: Partitions, QOS’s, and more
The examples below on this page are designed for running code on the CPU cluster such as the C programs in the compile tutorial.
Interacting with the SLURM Scheduling System
There are several basic commands you need to know to submit jobs, cancel them, and check their status. These are:
-
- sbatch – submit a job to the batch queue system
- squeue – check the current jobs in the batch queue system
- sinfo – view the current status of the queues
- scancel – cancel a job
scancel
The first command we will mention is scancel. If you have submitted a job that you no longer want, you should be a responsible user and kill it. This will prevent resources from being wasted, and allows other users’ jobs to run. Jobs can be killed while they are pending (waiting to run), or while they are actually running. To remove a job from the queue or to cancel a running job cleanly, use the scancel command with the identifier of the job to be deleted, for instance
[gobbert@c24-41 Nodesused]$ scancel --cluster=chip-cpu 110031
The job identifier can be obtained from the job listing from squeue (see below) or immediately after using sbatch, when you originally submitted the job (also below). See “man scancel” for more information.
sbatch
Now that we know how to cancel a job, we will see how to submit one. You can use the sbatch command to submit a script to the queue system.
[gobbert@c24-41 Nodesused]$ sbatch run-nodesused-n2ppn4mpi.slurm Submitted batch job 110031 on cluster chip-cpu
In this example, run-nodesused-n2ppn4mpi.slurm is the script we are sending to the slurm scheduler. We will see shortly how to formulate such a script. Notice that sbatch returns a job identifier. We can use this to kill the job later if necessary (as in the scancel example above), or to check its status. For more information, see “man sbatch”.
squeue
You can use the squeue command to check the status of jobs in the batch queue system. Here’s an example of the basic usage on chip-cpu and for the user gobbert; to see your jobs, replace gobbert by your username on chip:
[gobbert@c24-41 Nodesused]$ squeue --cluster=chip-cpu -u gobbert CLUSTER: chip-cpu JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 110031 2024 nodesuse gobbert R 0:01 2 c24-[41-42] 110023 2024 bash gobbert R 47:04 1 c24-41
The most interesting column is the one titled ST for “status”. It shows what a job is doing at this point in time. The state “PD” indicates that the job has been queued. When enough free processor cores become available, it will change to the “R” state and begin running. You may also see a job with status “CG” or “CF”, which means it is completing (such as still writing stdout and stderr), and about to exit the batch system. Other statuses are possible too, see “man squeue”. Once a job has exited the batch queue system, it will no longer show up in the squeue display.
We can also see several other pieces of useful information. The TIME column shows the current walltime used by the job up to the present time. For example, job 110031 has been running for 1 second so far. The NODES column shows the number of nodes used by the job and the NODELIST column shows which compute node(s) has(/have) been assigned to the job. For job 110031, the 2 nodes are c24-41 and c24-42, which gets contracted together as c24-[41-42].
scontrol
After submitting a job while it is queued, during a job, and for a limited period of time after the job has finished (a few hours), the scontrol can give access to the full detail of the job. See “man scontrol” for complete information, here I just demonstrate the basic output with show job , using the job ID from above as example again:
[gobbert@c24-41 Nodesused]$ scontrol --cluster=chip-cpu show job 110031 JobId=110031 JobName=nodesused UserId=gobbert(32296) GroupId=pi_gobbert(1152) MCS_label=N/A Priority=1 Nice=0 Account=pi_gobbert QOS=shared JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2025-05-28T13:36:41 EligibleTime=2025-05-28T13:36:41 AccrueTime=2025-05-28T13:36:41 StartTime=2025-05-28T13:36:41 EndTime=2025-05-28T13:36:43 Deadline=N/A PreemptEligibleTime=2025-05-28T13:36:41 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-05-28T13:36:41 Scheduler=Main Partition=2024 AllocNode:Sid=c24-41:122675 ReqNodeList=(null) ExcNodeList=(null) NodeList=c24-[41-42] BatchHost=c24-41 NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=8,mem=8G,node=2,billing=8 AllocTRES=cpu=8,mem=8G,node=2,billing=8 Socks/Node=* NtasksPerN:B:S:C=4:0:*:1 CoreSpec=* MinCPUsNode=4 MinMemoryNode=4G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/gobbert/Nodesused/run-nodesused-n2ppn4mpi.slurm WorkDir=/home/gobbert/Nodesused StdErr=/home/gobbert/Nodesused/slurm.err StdIn=/dev/null StdOut=/home/gobbert/Nodesused/slurm.out Power=
This is very dense output, but if reading carefully, you can find all details of the job, such as number of nodes, node list, time limit, directory of the job, slurm script used, stdout and stderr files, etc.
sinfo
The sinfo command also shows the current status of the batch system, but from the point of view of the SLURM partitions. Here is a sample output that can show how many nodes in which partition are available:
[gobbert@chip ~]$ sinfo --cluster=chip-cpu CLUSTER: chip-cpu PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 2024 up infinite 12 alloc c24-[29-40] 2024 up infinite 39 idle c24-[01-28,41-51] 2021 up infinite 5 mix c21-[01-05] 2021 up infinite 13 idle c21-[06-18] 2018 up infinite 3 down* c18-[17,19,23] 2018 up infinite 1 drain c18-39 2018 up infinite 5 mix c18-[18,20-21,30,40] 2018 up infinite 33 idle c18-[01-16,22,24-29,31-38,41-42]
To see more details of available equipment, use sinfo with options such as
sinfo -o "%10N %4c %10m %40f %10G"
A sample output of this is
[gobbert@chip ~]$ sinfo -o "%10N %4c %10m %40f %10G" CLUSTER: chip-cpu NODELIST CPUS MEMORY AVAIL_FEATURES GRES c24-[14-51 64 476837 location=local,low_mem (null) c24-[01-13 64 953674 location=local,high_mem (null) c18-[01,05 36+ 182524+ location=local (null) CLUSTER: chip-gpu NODELIST CPUS MEMORY AVAIL_FEATURES GRES g20-[01,03 96 385581 RTX_2080TI,RTX_2080ti,rtx_2080TI,2080,20 gpu:8 g20-[12-13 96 238418 RTX_8000,rtx_8000,8000 gpu:8 g24-[01-08 32 257443 L40S,l40s,L40s,l40S gpu:4 g24-[09-10 32 257443 h100,H100 gpu:2 g20-[02,04 96 385581 RTX_2080TI,RTX_2080ti,rtx_2080TI,2080,20 gpu:6 g20-[05-11 96 385581 RTX_6000,rtx_6000,6000 gpu:8
The key use of this output is to know how many nodes have what type of equipment exactly and how to spell their names to request them in your srun or sbatch commands.
Running Serial Hello World
This section assumes you have already compiled the serial “Hello, world!” example. Now we will see how to run it several different ways.
Interactive run on a compute node
The most obvious way to run the program interactively, that is, from the Linux command-line on the compute now, where you are while compiling.
[gobbert@c24-01 Hello_Serial]$ ./hello_serial Hello world from c24-01
We can see the reported hostname which confirms that the program ran on the compute node c24-01 that we had as interactive session when I did these steps.
Batch run using srun without a slurm script
For jobs that take more than a few seconds, interactive running is not really appropriate. The srun command reserves a compute node for your job, runs the job there, and then releases the compute node again, so others can use it. A most basic srun command can be issued without a slurm script (see next sub-section), meaning that you issue the srun command from the Linux command-line. However, note carefully that this srun command works only from the usernode, not if you are already in an interactive session on a compute node! This is as opposed to the sbatch command in the following sub-section that can be issued either from the usernode or a compute node.
srun --cluster=chip-cpu --account=pi_gobbert --partition=2024 --qos=shared --time=00:05:00 --mem=4G ./hello_serial
Batch run using sbatch with a slurm script
To submit a batch job, it is best to assemble all srun options in a slurm file and then use sbatch with this file.
Download the slurm script using wget to your workspace, where the executable “hello-serial” is already located.
Here, the job-name flag simply sets the string that is displayed as the name of the job in squeue. The output and error flags set the file names for capturing standard output (stdout) and standard error (stderr), respectively. The next flag chooses the develop partition of the CPU cluster to request for the job to run on. The QOS flag requests the short queue, since this particular “Hello, world!” job should just run for a brief moment, and the time flag provides a more precise estimate of the maximum possible time for the job to take. After a job has reached its time limit, it is stopped by the scheduler. This is done to ensure that everyone has a fair chance to use the cluster. The next two flags set the total number of nodes requested, and the number of MPI tasks per node; by choosing both of these as 1, we are requesting space for a serial job. Now we are ready to submit our job to the scheduler. To accomplish this, use the sbatch command as follows
[gobbert@c24-01 Hello_Serial]$ sbatch run-hello-serial.slurm Submitted batch job 109328 on cluster chip-cpu
If the submission is successful, the sbatch command returns a job ID, here 109328. We can use this to check the status of the job (squeue), or delete it (scancel) if necessary. To check on running jobs by user “gobbert” in this example, use the squeue command like
[gobbert@c24-01 Hello_Serial]$ squeue --cluster=chip-cpu -u gobbert CLUSTER: chip-cpu JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 109298 2024 bash gobbert R 1:38:59 1 c24-01
Notice that this is actually NOT the job that we just submitted, see the job ID! Rather, this is showing the job of my interactive session itself that I did the compiling and submission in. The issue is that this “Hello, world!” job just takes fractions of a second, so we are typically not able to catch it in squeue while it is running.
But looking at the files in the directory now, we see that the files slurm.err and slurm.out exist, so the job definitely ran.
If slurm.err is not empty, check the contents carefully as something may have gone wrong. The file slurm.out contains our stdout output; it should contain the hello world message from our program.
[gobbert@c24-01 Hello_Serial]$ ll total 192 -rwxrwx--- 1 gobbert pi_gobbert 16600 May 27 16:46 hello_serial* -rw-rw---- 1 gobbert pi_gobbert 184 Feb 1 2014 hello_serial.c -rw-rw---- 1 gobbert pi_gobbert 628 May 27 17:06 run-hello-serial.slurm -rw-rw---- 1 gobbert pi_gobbert 0 May 27 17:27 slurm.err -rw-rw---- 1 gobbert pi_gobbert 24 May 27 17:27 slurm.out [gobbert@c24-01 Hello_Serial]$ more slurm.err [gobbert@c24-01 Hello_Serial]$ more slurm.out Hello world from c24-01
Running Parallel Hello World
This section assumes you have already compiled the parallel “Hello, world!” example. Now we will see how to run.
The slurm script is very similar to the serial slurm script in the options for sbatch on top of the file, except that we choose 2 nodes by --nodes=2 and 4 processes per node by --ntasks-per-node=4. These are just examples. You can choose numbers of nodes from 1 to however many are available and processes per node from 1 to how many cores there are on one node; on a 2024 node, there are two 32-core CPUs for a total of 64 cores.
The real key difference of a parallel slurm script is the use of mpirun in front of the executable in the last line of the slurm script:
The above text focused on the material issues. The mpirun option -print-rank-map prints the hostnames of the nodes assigned and a list of the MPI ranks on each node to stdout; this is perfectly optional, you do not have to use this option. The two lines before the mpirun fix the MPI processes to the assigned cores; that can make the job potentially more efficient; these lines are also optional, you do not have to use them.
Submit the script to the batch queue system
[gobbert@c24-41 Hello_Parallel]$ sbatch run-hello-n2ppn4mpi.slurm Submitted batch job 110025 on cluster chip-cpu
Also the parallel Hello World program does not run long, but we managed to capture some output in squeue:
[gobbert@c24-41 Hello_Parallel]$ squeue --cluster=chip-cpu -u gobbert CLUSTER: chip-cpu JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 110025 2024 hello gobbert R 0:01 2 c24-[41-42] 110023 2024 bash gobbert R 14:25 1 c24-41 [gobbert@c24-41 Hello_Parallel]$ squeue --cluster=chip-cpu -u gobbert CLUSTER: chip-cpu JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 110023 2024 bash gobbert R 14:30 1 c24-41
This time, squeue shows the job running briefly, with 1 seconds so far, before the job is again done quickly and gone from the queue in the second output of squeue. Notice again that the job with name bash is my interactive shell and now the parallel Hello World job.
We check the output that we obtained now:
[gobbert@c24-41 Hello_Parallel]$ ll total 192 -rwxrwx--- 1 gobbert pi_gobbert 16776 May 28 11:57 hello_parallel* -rw-rw---- 1 gobbert pi_gobbert 490 Feb 1 2014 hello_parallel.c -rw-rw---- 1 gobbert pi_gobbert 726 May 28 12:09 run-hello-n2ppn4mpi.slurm -rw-rw---- 1 gobbert pi_gobbert 0 May 28 12:09 slurm.err -rw-rw---- 1 gobbert pi_gobbert 539 May 28 12:09 slurm.out [gobbert@c24-41 Hello_Parallel]$ more slurm.err [gobbert@c24-41 Hello_Parallel]$ more slurm.out (c24-41:0,1,2,3) (c24-42:4,5,6,7) Hello world from process 002 out of 008, processor name c24-41 Hello world from process 003 out of 008, processor name c24-41 Hello world from process 005 out of 008, processor name c24-42 Hello world from process 001 out of 008, processor name c24-41 Hello world from process 006 out of 008, processor name c24-42 Hello world from process 007 out of 008, processor name c24-42 Hello world from process 004 out of 008, processor name c24-42 Hello world from process 000 out of 008, processor name c24-41
We see that the error file slurm.err is empty again, indicating that no error occurred. The output file slurm.out lists the MPI ranks and the compute node that each ran on, here c24-41 for MPI ranks 0, 1, 2, 3 and c24-42 for MPI ranks 4, 5, 6, 7. Clearly, the output mixed up and in random order. This is to be expected, as several output streams write to the same file. In any case, we clearly see that 2 nodes were used and that each node had 4 MPI processes per node running. The first two lines of slurm.out are caused by the -print-rank-map option to mpirun; they show the hostname and the MPI ranks run on eac node.
Parallel Nodesused Run
Now we show the results possible with the nodesused program from the Compile tutorial. The submission script reads for the nodesused program, with obvious changes to the job-name and the mpirun line, as follows.
Running the code using sbatch gives the following results:
[gobbert@c24-41 Nodesused]$ ll total 240 -rwxrwx--- 1 gobbert pi_gobbert 17504 May 28 12:40 nodesused* -rw-rw---- 1 gobbert pi_gobbert 3809 Oct 22 2018 nodesused.c -rw-rw---- 1 gobbert pi_gobbert 440 May 28 12:42 nodesused_cpuid.log -rw-rw---- 1 gobbert pi_gobbert 320 May 28 12:42 nodesused.log -rw-rw---- 1 gobbert pi_gobbert 721 May 28 12:41 run-nodesused-n2ppn4mpi.slurm -rw-rw---- 1 gobbert pi_gobbert 0 May 28 12:42 slurm.err -rw-rw---- 1 gobbert pi_gobbert 555 May 28 12:42 slurm.out [gobbert@c24-41 Nodesused]$ more slurm.err [gobbert@c24-41 Nodesused]$ more slurm.out (c24-41:0,1,2,3) (c24-42:4,5,6,7) Hello world from process 0001 out of 0008, processor name c24-41 Hello world from process 0000 out of 0008, processor name c24-41 Hello world from process 0003 out of 0008, processor name c24-41 Hello world from process 0002 out of 0008, processor name c24-41 Hello world from process 0004 out of 0008, processor name c24-42 Hello world from process 0006 out of 0008, processor name c24-42 Hello world from process 0005 out of 0008, processor name c24-42 Hello world from process 0007 out of 0008, processor name c24-42
Notice that the node numbers confirm that the job was run on 2 nodes with 4 processes per node again. As before for the parallel “Hello, world!” program, the order of output lines to stdout is random.
But the listing of the file nodesused.log shows that our code in the nodesused() function ordered the output by the MPI process IDs:
[gobbert@c24-41 Nodesused]$ more nodesused.log MPI process 0000 of 0008 on node c24-41 MPI process 0001 of 0008 on node c24-41 MPI process 0002 of 0008 on node c24-41 MPI process 0003 of 0008 on node c24-41 MPI process 0004 of 0008 on node c24-42 MPI process 0005 of 0008 on node c24-42 MPI process 0006 of 0008 on node c24-42 MPI process 0007 of 0008 on node c24-42