The school has a small compute cluster which is composed of the workstations in Trottier. You can send jobs from mimi which can simultaneously run on 32 machines with GPUs.
When you submit jobs to the cluster, the code that runs on each machine must be entirely self contained. So it is best used for experiments where you are testing a piece of code repeatedly. An obvious example task would be tuning a hyperparameter of a machine learning model.
To begin using the cluster, first log in to mimi
via ssh. mimi
is the control host for the cluster so you will use it to interact with the cluster.
To get some information about the state of the cluster, use the sinfo
command
canton14@teach-vw2:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST teaching-gpu* up 1:00:00 3 down* open-gpu-[1,5,8] teaching-gpu* up 1:00:00 12 drain open-gpu-[2-4,6-7,9-12,14-16] teaching-gpu* up 1:00:00 17 idle open-gpu-[13,17-32]
At this time, the partion teaching-gpu
has 3 nodes down, 12 nodes in a "drained" state, and 17 nodes idle. Down nodes are powered off, drained nodes currently do not have enough free resources to be assigned new jobs (perhaps they are being used by someone else) and idle nodes are ready for you to use. The asterisk by the teaching-gpu
partition name indicates that it is the default partition. As we add more partitions, you may need to select the one you want to use when running commands.
Since this cluster is shared amongst all CS students, there is a time limit of one hour on all jobs. If your job takes longer than that, it will be terminated early, so please ensure that you experiment will take less than an hour before running it on the cluster. Depending on usage, we may adjust the time limit.
Once you have found a partition that will be able to run your job, you can freely launch it. First, let's launch a simple example, running the hostname
command on 9 nodes. To do this, we will use the srun
command on mimi.
canton14@teach-vw2:~$ srun -p teaching-gpu -N9 /bin/hostname open-gpu-21 open-gpu-24 open-gpu-20 open-gpu-19 open-gpu-22 open-gpu-17 open-gpu-18 open-gpu-25 open-gpu-23
As you can see, all it took as appending the srun
command and some options to the front of the command we wanted to run to distribute it across the nodes.
The -p teaching-gpu
argument specified the partition we want to use. Since teaching-gpu
is the default (and only, for now) partition, we could have omitted that argument. The -N9
argument specified that we wanted our command to run on 9 nodes. If we only wanted it to run on 9 CPU cores, we could have used the -n9
option.
canton14@teach-vw2:~$ srun -p teaching-gpu -n9 /bin/hostname open-gpu-17 open-gpu-17 open-gpu-17 open-gpu-17 open-gpu-17 open-gpu-18 open-gpu-18 open-gpu-18 open-gpu-18
The benefits of using -N
are that that you would have all the RAM and GPU of each node for each process. Whereas the benefit of using -n
is that you have 8-12 cores per machine, which means you can run 8-12 times as many processes.
You can also launch an interactive shell across a set of nodes using the salloc
command with the same arguments as srun
. You can then prefix commands with srun
(without arguments) and they will be run across the nodes you previously selected.
'canton14@teach-vw2:~$ ps PID TTY TIME CMD 1229 pts/129 00:00:00 bash 2453 pts/129 00:00:00 ps canton14@teach-vw2:~$ salloc -N 10 salloc: Granted job allocation 31 canton14@teach-vw2:~$ ps PID TTY TIME CMD 1229 pts/129 00:00:00 bash 1301 pts/129 00:00:00 salloc 1303 pts/129 00:00:00 bash 1319 pts/129 00:00:00 ps canton14@teach-vw2:~$ hostname teach-vw2 canton14@teach-vw2:~$ srun hostname open-gpu-19 open-gpu-17 open-gpu-25 open-gpu-23 open-gpu-21 open-gpu-26 open-gpu-18 open-gpu-22 open-gpu-24 open-gpu-20
As you can see from the output of ps
the current shell is running as a subprocess of salloc
. To quit salloc
just exit your shell as nowrmal.
Say you have a (very bad) model which takes an integer as input, adds some random value between zero and one, and returns that as the output. You could
canton14@teach-vw2:~$ cat batch_job.py #!/usr/bin/env python3 import os import random jobid = os.getenv('SLURM_ARRAY_TASK_ID') result = int(jobid) + random.random() print(jobid, result) canton14@teach-vw2:~$ sbatch --array=1-10 -N 10 batch_job.py Submitted batch job 92 canton14@teach-vw2:~$ cat slurm-92_* 10 10.551001157018066 1 1.7942053382823158 2 2.781956597945983 3 3.9022921961241126 4 4.063291931356006 5 5.501764124355088 6 6.5673130218314775 7 7.08193661136367 8 8.412695441528129 9 9.234586397610121
The results of each job will be put in the file slurm-92_n.out
, so you will get n output files. You can change that behaviour using the -o
option to sbatch
.
This pattern would allow you to tune a hyperparameter on your model. You would just need to have your script train the model, and have the hyperparemeter be set via the SLURM_ARRAY_TASK_ID
environment variable. Unfortunately, this variable can only be an integer so you will have to transform it as needed in your code.