How to Run on chip

Table of Contents

Introduction

Running a program on chip is different than running one on a standard workstation. When we log into the cluster, we are interacting with the usernode. But we would like our programs to run on the compute nodes, which is where the real computing power of the cluster is. We will walk through the processes of running serial and parallel code on the cluster, and then later discuss some of the finer details. This page uses the code examples from the compile tutorial. Please download and compile those examples first, before following the run examples below. But first some more general explanations.

On chip, jobs must be run on the compute nodes of the cluster. You cannot execute jobs directly on the compute nodes yourself; you must request the cluster’s batch system do it on your behalf. To use the batch system, you will submit a special script which contains instructions to execute your job on the compute nodes. When submitting your job, you specify a partition (group of nodes, e.g., 2018, 2021, or 2024) and a QOS (a classification that determines what kind of resources your job will need). Your job will wait in the queue until it is “next in line”, and free processors on the compute nodes become available. Once a job is started, it continues to run until it either completes (with or without error) or reaches its time limit, in which case it is terminated by the scheduler.

During the runtime of your job, your instructions will be executed across the compute nodes. These instructions will have access to the resources of the nodes on which they are running. Notably, the memory, processors, and local disk space (/scratch space). Note that the /scratch space is cleared after every job terminates.

The batch system (also called the scheduler or work load manager) used on chip is called SLURM, which is short for Simple Linux Utility for Resource Management.

Scheduling Fundamentals on chip: Partitions, QOS’s, and more

The examples below on this page are designed for running code on the CPU cluster such as the C programs in the compile tutorial.

Interacting with the SLURM Scheduling System

There are several basic commands you need to know to submit jobs, cancel them, and check their status. These are:

    • sbatch – submit a job to the batch queue system
    • squeue – check the current jobs in the batch queue system
    • sinfo – view the current status of the queues
    • scancel – cancel a job

scancel

The first command we will mention is scancel. If you have submitted a job that you no longer want, you should be a responsible user and kill it. This will prevent resources from being wasted, and allows other users’ jobs to run. Jobs can be killed while they are pending (waiting to run), or while they are actually running. To remove a job from the queue or to cancel a running job cleanly, use the scancel command with the identifier of the job to be deleted, for instance

[gobbert@c24-41 Nodesused]$ scancel --cluster=chip-cpu 110031

The job identifier can be obtained from the job listing from squeue (see below) or immediately after using sbatch, when you originally submitted the job (also below). See “man scancel” for more information.

sbatch

Now that we know how to cancel a job, we will see how to submit one. You can use the sbatch command to submit a script to the queue system.

[gobbert@c24-41 Nodesused]$ sbatch run-nodesused-n2ppn4mpi.slurm
Submitted batch job 110031 on cluster chip-cpu

In this example, run-nodesused-n2ppn4mpi.slurm is the script we are sending to the slurm scheduler. We will see shortly how to formulate such a script. Notice that sbatch returns a job identifier. We can use this to kill the job later if necessary (as in the scancel example above), or to check its status. For more information, see “man sbatch”.

squeue

You can use the squeue command to check the status of jobs in the batch queue system. Here’s an example of the basic usage on chip-cpu and for the user gobbert; to see your jobs, replace gobbert by your username on chip:

[gobbert@c24-41 Nodesused]$ squeue --cluster=chip-cpu -u gobbert
CLUSTER: chip-cpu
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            110031      2024 nodesuse  gobbert  R       0:01      2 c24-[41-42]
            110023      2024     bash  gobbert  R      47:04      1 c24-41

The most interesting column is the one titled ST for “status”. It shows what a job is doing at this point in time. The state “PD” indicates that the job has been queued. When enough free processor cores become available, it will change to the “R” state and begin running. You may also see a job with status “CG” or “CF”, which means it is completing (such as still writing stdout and stderr), and about to exit the batch system. Other statuses are possible too, see “man squeue”. Once a job has exited the batch queue system, it will no longer show up in the squeue display.

We can also see several other pieces of useful information. The TIME column shows the current walltime used by the job up to the present time. For example, job 110031 has been running for 1 second so far. The NODES column shows the number of nodes used by the job and the NODELIST column shows which compute node(s) has(/have) been assigned to the job. For job 110031, the 2 nodes are c24-41 and c24-42, which gets contracted together as c24-[41-42].

scontrol

After submitting a job while it is queued, during a job, and for a limited period of time after the job has finished (a few hours), the scontrol can give access to the full detail of the job. See “man scontrol” for complete information, here I just demonstrate the basic output with show job , using the job ID from above as example again:

[gobbert@c24-41 Nodesused]$ scontrol --cluster=chip-cpu show job 110031
JobId=110031 JobName=nodesused
   UserId=gobbert(32296) GroupId=pi_gobbert(1152) MCS_label=N/A
   Priority=1 Nice=0 Account=pi_gobbert QOS=shared
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2025-05-28T13:36:41 EligibleTime=2025-05-28T13:36:41
   AccrueTime=2025-05-28T13:36:41
   StartTime=2025-05-28T13:36:41 EndTime=2025-05-28T13:36:43 Deadline=N/A
   PreemptEligibleTime=2025-05-28T13:36:41 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-05-28T13:36:41 Scheduler=Main
   Partition=2024 AllocNode:Sid=c24-41:122675
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c24-[41-42]
   BatchHost=c24-41
   NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=8,mem=8G,node=2,billing=8
   AllocTRES=cpu=8,mem=8G,node=2,billing=8
   Socks/Node=* NtasksPerN:B:S:C=4:0:*:1 CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/gobbert/Nodesused/run-nodesused-n2ppn4mpi.slurm
   WorkDir=/home/gobbert/Nodesused
   StdErr=/home/gobbert/Nodesused/slurm.err
   StdIn=/dev/null
   StdOut=/home/gobbert/Nodesused/slurm.out
   Power=

This is very dense output, but if reading carefully, you can find all details of the job, such as number of nodes, node list, time limit, directory of the job, slurm script used, stdout and stderr files, etc.

sinfo

The sinfo command also shows the current status of the batch system, but from the point of view of the SLURM partitions. Here is a sample output that can show how many nodes in which partition are available:

[gobbert@chip ~]$ sinfo --cluster=chip-cpu
CLUSTER: chip-cpu
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
2024         up   infinite     12  alloc c24-[29-40]
2024         up   infinite     39   idle c24-[01-28,41-51]
2021         up   infinite      5    mix c21-[01-05]
2021         up   infinite     13   idle c21-[06-18]
2018         up   infinite      3  down* c18-[17,19,23]
2018         up   infinite      1  drain c18-39
2018         up   infinite      5    mix c18-[18,20-21,30,40]
2018         up   infinite     33   idle c18-[01-16,22,24-29,31-38,41-42]

To see more details of available equipment, use sinfo with options such as

sinfo -o "%10N %4c %10m %40f %10G"

A sample output of this is

[gobbert@chip ~]$ sinfo -o "%10N %4c %10m %40f %10G"
CLUSTER: chip-cpu
NODELIST   CPUS MEMORY     AVAIL_FEATURES                           GRES
c24-[14-51 64   476837     location=local,low_mem                   (null)
c24-[01-13 64   953674     location=local,high_mem                  (null)
c18-[01,05 36+  182524+    location=local                           (null)

CLUSTER: chip-gpu
NODELIST   CPUS MEMORY     AVAIL_FEATURES                           GRES
g20-[01,03 96   385581     RTX_2080TI,RTX_2080ti,rtx_2080TI,2080,20 gpu:8
g20-[12-13 96   238418     RTX_8000,rtx_8000,8000                   gpu:8
g24-[01-08 32   257443     L40S,l40s,L40s,l40S                      gpu:4
g24-[09-10 32   257443     h100,H100                                gpu:2
g20-[02,04 96   385581     RTX_2080TI,RTX_2080ti,rtx_2080TI,2080,20 gpu:6
g20-[05-11 96   385581     RTX_6000,rtx_6000,6000                   gpu:8

The key use of this output is to know how many nodes have what type of equipment exactly and how to spell their names to request them in your srun or sbatch commands.

Running Serial Hello World

This section assumes you have already compiled the serial “Hello, world!” example. Now we will see how to run it several different ways.

Interactive run on a compute node

The most obvious way to run the program interactively, that is, from the Linux command-line on the compute now, where you are while compiling.

[gobbert@c24-01 Hello_Serial]$ ./hello_serial
Hello world from c24-01

We can see the reported hostname which confirms that the program ran on the compute node c24-01 that we had as interactive session when I did these steps.

Batch run using srun without a slurm script

For jobs that take more than a few seconds, interactive running is not really appropriate. The srun command reserves a compute node for your job, runs the job there, and then releases the compute node again, so others can use it. A most basic srun command can be issued without a slurm script (see next sub-section), meaning that you issue the srun command from the Linux command-line. However, note carefully that this srun command works only from the usernode, not if you are already in an interactive session on a compute node! This is as opposed to the sbatch command in the following sub-section that can be issued either from the usernode or a compute node.

srun --cluster=chip-cpu --account=pi_gobbert --partition=2024 --qos=shared --time=00:05:00 --mem=4G ./hello_serial

Batch run using sbatch with a slurm script


To submit a batch job, it is best to assemble all srun options in a slurm file and then use sbatch with this file.
Download the slurm script using wget to your workspace, where the executable “hello-serial” is already located.

Here, the job-name flag simply sets the string that is displayed as the name of the job in squeue. The output and error flags set the file names for capturing standard output (stdout) and standard error (stderr), respectively. The next flag chooses the develop partition of the CPU cluster to request for the job to run on. The QOS flag requests the short queue, since this particular “Hello, world!” job should just run for a brief moment, and the time flag provides a more precise estimate of the maximum possible time for the job to take. After a job has reached its time limit, it is stopped by the scheduler. This is done to ensure that everyone has a fair chance to use the cluster. The next two flags set the total number of nodes requested, and the number of MPI tasks per node; by choosing both of these as 1, we are requesting space for a serial job. Now we are ready to submit our job to the scheduler. To accomplish this, use the sbatch command as follows

[gobbert@c24-01 Hello_Serial]$ sbatch run-hello-serial.slurm
Submitted batch job 109328 on cluster chip-cpu

If the submission is successful, the sbatch command returns a job ID, here 109328. We can use this to check the status of the job (squeue), or delete it (scancel) if necessary. To check on running jobs by user “gobbert” in this example, use the squeue command like

[gobbert@c24-01 Hello_Serial]$ squeue --cluster=chip-cpu -u gobbert
CLUSTER: chip-cpu
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            109298      2024     bash  gobbert  R    1:38:59      1 c24-01

Notice that this is actually NOT the job that we just submitted, see the job ID! Rather, this is showing the job of my interactive session itself that I did the compiling and submission in. The issue is that this “Hello, world!” job just takes fractions of a second, so we are typically not able to catch it in squeue while it is running.
But looking at the files in the directory now, we see that the files slurm.err and slurm.out exist, so the job definitely ran.
If slurm.err is not empty, check the contents carefully as something may have gone wrong. The file slurm.out contains our stdout output; it should contain the hello world message from our program.

[gobbert@c24-01 Hello_Serial]$ ll
total 192
-rwxrwx--- 1 gobbert pi_gobbert 16600 May 27 16:46 hello_serial*
-rw-rw---- 1 gobbert pi_gobbert   184 Feb  1  2014 hello_serial.c
-rw-rw---- 1 gobbert pi_gobbert   628 May 27 17:06 run-hello-serial.slurm
-rw-rw---- 1 gobbert pi_gobbert     0 May 27 17:27 slurm.err
-rw-rw---- 1 gobbert pi_gobbert    24 May 27 17:27 slurm.out
[gobbert@c24-01 Hello_Serial]$ more slurm.err
[gobbert@c24-01 Hello_Serial]$ more slurm.out
Hello world from c24-01

Running Parallel Hello World

This section assumes you have already compiled the parallel “Hello, world!” example. Now we will see how to run.

The slurm script is very similar to the serial slurm script in the options for sbatch on top of the file, except that we choose 2 nodes by --nodes=2 and 4 processes per node by --ntasks-per-node=4. These are just examples. You can choose numbers of nodes from 1 to however many are available and processes per node from 1 to how many cores there are on one node; on a 2024 node, there are two 32-core CPUs for a total of 64 cores.

The real key difference of a parallel slurm script is the use of mpirun in front of the executable in the last line of the slurm script:

The above text focused on the material issues. The mpirun option -print-rank-map prints the hostnames of the nodes assigned and a list of the MPI ranks on each node to stdout; this is perfectly optional, you do not have to use this option. The two lines before the mpirun fix the MPI processes to the assigned cores; that can make the job potentially more efficient; these lines are also optional, you do not have to use them.

Submit the script to the batch queue system

[gobbert@c24-41 Hello_Parallel]$ sbatch run-hello-n2ppn4mpi.slurm
Submitted batch job 110025 on cluster chip-cpu

Also the parallel Hello World program does not run long, but we managed to capture some output in squeue:

[gobbert@c24-41 Hello_Parallel]$ squeue --cluster=chip-cpu -u gobbert
CLUSTER: chip-cpu
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
110025      2024    hello  gobbert  R       0:01      2 c24-[41-42]
110023      2024     bash  gobbert  R      14:25      1 c24-41

[gobbert@c24-41 Hello_Parallel]$ squeue --cluster=chip-cpu -u gobbert
CLUSTER: chip-cpu
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
110023      2024     bash  gobbert  R      14:30      1 c24-41

This time, squeue shows the job running briefly, with 1 seconds so far, before the job is again done quickly and gone from the queue in the second output of squeue. Notice again that the job with name bash is my interactive shell and now the parallel Hello World job.

We check the output that we obtained now:

[gobbert@c24-41 Hello_Parallel]$ ll
total 192
-rwxrwx--- 1 gobbert pi_gobbert 16776 May 28 11:57 hello_parallel*
-rw-rw---- 1 gobbert pi_gobbert   490 Feb  1  2014 hello_parallel.c
-rw-rw---- 1 gobbert pi_gobbert   726 May 28 12:09 run-hello-n2ppn4mpi.slurm
-rw-rw---- 1 gobbert pi_gobbert     0 May 28 12:09 slurm.err
-rw-rw---- 1 gobbert pi_gobbert   539 May 28 12:09 slurm.out
[gobbert@c24-41 Hello_Parallel]$ more slurm.err
[gobbert@c24-41 Hello_Parallel]$ more slurm.out
(c24-41:0,1,2,3)
(c24-42:4,5,6,7)
Hello world from process 002 out of 008, processor name c24-41
Hello world from process 003 out of 008, processor name c24-41
Hello world from process 005 out of 008, processor name c24-42
Hello world from process 001 out of 008, processor name c24-41
Hello world from process 006 out of 008, processor name c24-42
Hello world from process 007 out of 008, processor name c24-42
Hello world from process 004 out of 008, processor name c24-42
Hello world from process 000 out of 008, processor name c24-41

We see that the error file slurm.err is empty again, indicating that no error occurred. The output file slurm.out lists the MPI ranks and the compute node that each ran on, here c24-41 for MPI ranks 0, 1, 2, 3 and c24-42 for MPI ranks 4, 5, 6, 7. Clearly, the output mixed up and in random order. This is to be expected, as several output streams write to the same file. In any case, we clearly see that 2 nodes were used and that each node had 4 MPI processes per node running. The first two lines of slurm.out are caused by the -print-rank-map option to mpirun; they show the hostname and the MPI ranks run on eac node.

Parallel Nodesused Run

Now we show the results possible with the nodesused program from the Compile tutorial. The submission script reads for the nodesused program, with obvious changes to the job-name and the mpirun line, as follows.

Running the code using sbatch gives the following results:

[gobbert@c24-41 Nodesused]$ ll
total 240
-rwxrwx--- 1 gobbert pi_gobbert 17504 May 28 12:40 nodesused*
-rw-rw---- 1 gobbert pi_gobbert  3809 Oct 22  2018 nodesused.c
-rw-rw---- 1 gobbert pi_gobbert   440 May 28 12:42 nodesused_cpuid.log
-rw-rw---- 1 gobbert pi_gobbert   320 May 28 12:42 nodesused.log
-rw-rw---- 1 gobbert pi_gobbert   721 May 28 12:41 run-nodesused-n2ppn4mpi.slurm
-rw-rw---- 1 gobbert pi_gobbert     0 May 28 12:42 slurm.err
-rw-rw---- 1 gobbert pi_gobbert   555 May 28 12:42 slurm.out
[gobbert@c24-41 Nodesused]$ more slurm.err
[gobbert@c24-41 Nodesused]$ more slurm.out
(c24-41:0,1,2,3)
(c24-42:4,5,6,7)
Hello world from process 0001 out of 0008, processor name c24-41
Hello world from process 0000 out of 0008, processor name c24-41
Hello world from process 0003 out of 0008, processor name c24-41
Hello world from process 0002 out of 0008, processor name c24-41
Hello world from process 0004 out of 0008, processor name c24-42
Hello world from process 0006 out of 0008, processor name c24-42
Hello world from process 0005 out of 0008, processor name c24-42
Hello world from process 0007 out of 0008, processor name c24-42

Notice that the node numbers confirm that the job was run on 2 nodes with 4 processes per node again. As before for the parallel “Hello, world!” program, the order of output lines to stdout is random.
But the listing of the file nodesused.log shows that our code in the nodesused() function ordered the output by the MPI process IDs:

[gobbert@c24-41 Nodesused]$ more nodesused.log
MPI process 0000 of 0008 on node c24-41
MPI process 0001 of 0008 on node c24-41
MPI process 0002 of 0008 on node c24-41
MPI process 0003 of 0008 on node c24-41
MPI process 0004 of 0008 on node c24-42
MPI process 0005 of 0008 on node c24-42
MPI process 0006 of 0008 on node c24-42
MPI process 0007 of 0008 on node c24-42