Job Priorities and Scheduling

Scheduling for a system such as Plato is complicated. Please contact for any clarification or questions about Plato scheduling. We have some capability to alter priorities under certain circumstances, if needed. We have tried to include enough information for you to make decisions to optimize your computing workflow.

Plato uses "fairshare" scheduling, a common scheduling system used around the world, including at Compute Canada/WestGrid. 

The basic mechanism for determining the priority is called fairshare, in which target usage amounts are assigned to each project. In Plato's case, each faculty member is assigned a project group for them and all of their students/staff.

When considering which jobs to run, the scheduling software takes into account the past history (a scheduling window typically over a time span of a couple of weeks, with more recent usage weighted more heavily) and compares the amount of processing completed to the target assigned for the group.

There are a number of factors that affect the priority of the jobs waiting to run. Priorities of the jobs are raised or lowered so as to try to meet the fairshare targets. If a group has used more than their fairshare target within the scheduling window, their jobs will be deprioritized in that scheduling window. If a group has used less than their fairshare target in that same period, their jobs will have increased priority. See Fairshare Details for more information.

The batching system has different queues to accomodate the submitted jobs base on run time and resources. However, for most practical purposes the submission of jobs to the batching system remains the same: sbatch <your_script>. You don't have to specify any particular queue during the submission. The system selects automatically the most suitable execution queue for your job, depending on your credentials, the status of the queues and the characteristics of the submitted job. If you want to know to which queue your job has been directed, use the command: squeue -u <NSID> to see othose details.

To take full advantage of the resources and to ensure that the system is selecting the right queue for your job you may want to modify some of the SBATCH directives in your submission scripts, including walltimes and total resources requested, according to the details listed below.

  1. Maximum walltime for the all queues is 3 weeks. This is the longest walltime possible on Plato. No job with an specified walltime larger than 3 weeks will be accepted in this queue. If you don’t specify a walltime in your script, then a default walltime of 8 hours will be assigned to your job. Any job will be automatically cancelled after the walltime is reached.
  2. Any researcher can have up to 30 jobs assigned to the researchers queue (both running and queued). Once this limit is reached, you can still submit as many jobs as you want: however, these jobs will be directed, if possible, to the common queue.
  3. The common queue has a slightly lower quality of service: maximum walltime is 3 days, default walltime is just 1 hour, and the priority is lower. This means that any job queued in researchers will run before any job queued in common, at any given time when resources become available. So, as a good practice, you must submit your larger (or more important) jobs first, so they can fit in the researchers queue. In some cases, it will be also beneficial to aggregate hundreds of serial jobs into a smaller number of larger jobs. The advanced computing analysts can help you script this.
  4. There are some cores dedicated to rush situations. When all the assigned cores to the normal queues (researchers, students and common queues) are used, you can still run your job in the rush queue if you are in a rush, i.e. expecting the results in a short period of time. For a job to be accepted in the rush queue, the specified walltime must be less than or equal to 4 hours. Default walltime for rush is 10 minutes. Any user can have only one job running on rush at the same time.
  5. There is an absolute limit for every user: the number of processor- seconds (PS). The PS is calculated as the number of cores assigned to all your running jobs, multiplied by the remaining walltime of all of them. So, for example, if you have 1 job running on 3 cores and it will take 2 hours for this job to reach its assigned walltime, then you have, at this exact moment, 3 (cores) x 2 (hours) x 60 (minutes) x 60 (seconds) = 21,600 PS assigned.
  6. The maximum amount of PS that the researchers of a group can have all together, at any given time, is MAXPS = 580,608,000. This means that your group could use up to 320 cores (20 nodes) for three weeks, the maximum walltime allowed in the researcher queue; OR, for example, the entire cluster (1536 cores) for 105 hours. Any submitted job will not run if it makes the number of PS assigned to your group larger than MAXPS.

Fairshare Details

Priorities indicate which jobs will be started next--the job in the queue with most positive (or least negative) priority will be selected from all the jobs available to be scheduled. This has a few implications:

  1. No matter how negative the priority is, if the system is not completely full at time of job submission, the job will start. That is, if there are free nodes and no other jobs in the queue, any job will start, regardless of prior usage patterns.
  2. Assuming all the nodes are full and jobs are waiting to start in the queue, it doesn't matter how many jobs any given user has submitted, or when they submitted those jobs. The next job to be scheduled will be the job with highest priority in the queue.
  3. Scheduling, priority, and usage is based on the amount of resources made unavailable. Getting a good bound on the resource consumption of your code can lead to increased priorities and shorter waits in the queue. The memory and amount of CPU that you request can be viewed as a contract with the scheduler. You indicate that you need at least, but no more than, the amount of resources you request. The extra information you provide the scheduler lets it slot you on a node more quickly, perhaps, than if you had requested more resources; in return, the scheduler has the right to cancel your job if you use more than you requested.
  4. Fairshare scheduling is currently done at the group level (i.e. per faculty group). Within groups reporting to the same supervisor, scheduling will be close to first in, first out, as all jobs submitted within a research group will have nearly the same priority at all times. Users within a research group are expected to coordinate their usage with each other, as the fairshare system provided by Maui does not apply within a group.
Last modified on