Wednesday, August 01, 2007

Supercomputing Course: Batch Scripts

A supercomputer is basically a powerful computer consisting of many interlinked CPUs. It has multiple users who are all competing for a share of the computational resources. A big question is how to best launch jobs and schedule them fairly, making users feel like they are getting their fair share of the machine and preventing any one user from hogging all the resources.

One answer to that question is to implement a batch system. A batch system takes user jobs and puts them in a queue, and launches them when resources are available. Job scheduling is an NP-complete problem, so there's no way to run the jobs absolutely optimally, but scheduler programs decide when jobs should run based on the machine's scheduling policies, taking a user's priority, the length of the job, the number of nodes requested, and the length of time the job has been sitting in the queue into account. Many machines use the Portable Batch System (PBS), which was originally developed for NASA in the 1990s. And many machines use the Maui Scheduler, which is extensively developed and well-supported by much of the computing community. (It's called the Maui Scheduler because it was originally developed for the Maui High-Performance Computing Center. If you become a supercomputing expert, you might be able to get a job there. Back in the day, I tried, but alas, the didn't have any suitable openings at the time!)

There are a few basic concepts to understand. When you run a job on a supercomputer, you tell the scheduler how many processors you want and for how long. So, you want to give the scheduler the lowest estimate of time possible, without going under the actual time you'll need, because it will terminate your job if it goes past the time limit. Furthermore, there are limits on the walltime (the time that elapses between the beginning and end of your job) and number of processors, so if your request exceeds those limits, it will be automatically rejected.

Your basic script will take the following form: it will have some PBS commands, then some variable initializations and the like, and finally invoke your program. Here is a very basic script:

#!/bin/bash
#PBS -V
#PBS -j oe
#PBS -m ae
#PBS -M rebecca@fictitious.com
#PBS -N loadbal
#PBS -l walltime=00:10:00,nodes=2:ppn=2
#PBS -q workq

EXEC=${PBS_O_WORKDIR}/myprog
INPUT_FILE=${PBS_O_WORKDIR}/prog_input.dat

mpiexec $EXEC $INPUT_FILE

Let's go through the script line by line and see what it means. The first line means that we want to use the bash shell when we run this script. There are a lot of different shells; for an overview, see this webpage. You don't have to know much about the different types of shells beyond the fact that the third- and second-from-last lines in this script would take a different syntax if we were using the csh or tcsh shells.

The second line means I want my environment variables to be exported to the run environment. Environment variables are preset variables that evaluate to useful information, e.g. if you type echo $PATH at a command prompt, the result is a list of all the directories, separated by colons, in which *nix will look for executable files.

The next line tells PBS to put standard error (e) and standard output (o) into the same output file. This is useful because it could help you discover exactly where you have an error as your program runs. The two lines following that one tell PBS to e-mail me (first line) at the e-mail address on the second line. The -N command tells PBS that the name of this run is "loadbal" and it will name output files by that name, appended with .o${JOBNUM} (where ${JOBNUM} is the number that PBS assigns to my job when I submit it to the queue).

The next line is the most important: I request a wall time (the amount of time that elapses on a clock on the wall as my job runs) of ten minutes, using two nodes of the machine and two processors per node (ppn). Most machines today have at least two processors bundled together into what's called a node. You can use one processor per node if you so desire, but usually you want to use all the processors that are available, unless you have some sort of very memory-intensive application.

The final line of the PBS commands tells PBS which queue to put my job in. A queue is a set of jobs waiting to be run. The scheduler selects jobs from the queue based on the scheduling policies of the machine and the available resources.

Next I set the variables EXEC and INPUT_FILE, as a matter of convenience. This is so if I change the name of my executable or input file, I just have to change it in one place. The PBS_O_WORKDIR variable is a PBS environment variable indicating the directory I'm launching the job from. Just like we've seen in Makefiles, in order to evaluate the variable, we precede it with a $. The curly braces indicate that the enclosed word is the variable name. We do that to prevent ambiguity.

You launch your job by typing qsub myscript (where myscript is the name of your script). If your job was outright rejected (i.e., your request for resources was outside of the rules; for example, you requested more wall time than is allowed), you will immediately see a message. Otherwise, you should see a job number or other identifying name for your job. If you type qstat you can see the queue and the state that it's in, meaning which jobs are running, how many processors they're using, etc. Typing qstat -u username shows only your jobs. There are other commands, but they will vary depending on your machine and what's implemented there.

In fact, even the commands that I've described may be different. Check your machine's documentation before trying any of this. In particular, IBM supercomputers seem to always have their own unique scripting rules and a batch system called load leveler.

Next topic: Concepts of parallelism

No comments: