Running Jobs on Socrates

(TORQUE/Maui batching system)

All jobs on socrates must be run through the scheduler. No jobs should be run on the "head node", nor should jobs be started directly on the working nodes.

Batch job scripts are UNIX shell scripts-- text files of commands for the UNIX shell to interpret, similar to what you could execute by typing directly at a keyboard. They contain special comment lines that contain TORQUE directives. TORQUE evolved from software called PBS (Portable Batch System). Consequences of that history are that the TORQUE directive lines begin with #PBS, some environment variables contain "PBS" (such as $PBS_O_WORKDIR in the script below) and the script files themselves typically have a .pbs suffix (although that is not required).

You should always specify the amount of RAM your job needs per node, otherwise the default amount is used. At the time of this writing, the default is 950MB - 950 megabytes. The script below has an example memory allocation.

Executables compiled with the OpenMPI API can be run in batch files simply with the command:

mpirun optionalcommands arguments

Socrates - Sample Script:

Here is an example job script, diffuse.pbs, for a job to run an OpenMPI program named diffuse. The command

#PBS -l nodes=3:ppn=2

requests 2 processors on 3 computers - 6 cores in total.

The adjacent command

#PBS -l mem=2GB

requests 2GB (2 gigabytes) of RAM be allocated for this job.

#/bin/sh
#Sample PBS Script for use with OpenMPI on Socrates
#Jason Hlady

# Begin PBS directives for defaults
# All torque (batching scheduler commands) begin with #PBS for historical reasons

# Default is for serial job: one processor on one node
# can override this with qsub at the command line, or alter it in the script
# in the form of nodes=X:ppn=Y
# X = number of computers : Y = number of processors per computer

#PBS -l nodes=3:ppn=2
#PBS -l mem=2GB

# There are other directives to control the maximum time your job will take.
# These are walltime and cput. Both use the format hours:minutes:seconds (hh:mm:ss)
# This would stop your job after 3 days: #PBS -l walltime 72:00:00
# This would stop your job after 200 cpu-hours (total): #PBS -l cput 200:00:00
# There are also maximum values that cannot be overridden.

# Job name which will show up in queue, job output
# Remove the second # from below line and rename job
##PBS -N

#Optional: join error and output into one stream
#PBS -j oe


#------------------------------------------------------
# Debugging section for PBS
echo "Node  $PBS_NODEFILE :"
echo "---------------------"
cat $PBS_NODEFILE
echo "---------------------"
echo "Shell is $SHELL"
NUM_PROCS=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $NUM_PROCS processors."
echo "which mpirun = `which mpirun`"
#-------------------------------------------------------

### Run the application
# shows what node the app started on--useful for serial jobs
echo `hostname`
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
echo "Starting run at: `date`"

# change the program name from "./diffuse" in the below line between the hash marks
# to the name of your executable

############
mpirun ./diffuse
############

echo "Program finished with exit code $? at: `date`"
exit 0

To submit the script called diffuse.pbs to the batch job handling system, use the qsub command as below:
qsub diffuse.pbs

To check on the status of all the jobs on the system, type:
qstat

To limit the listing to show just the jobs associated with your user name, type:
qstat -u username

To delete a job, use the qdel command with the jobid assigned from qsub:
qdel jobid
Last modified on