survivor: Technical Document: Scheduling

SURVIVOR Technical Document: Scheduling

About This Document
Scheduling
- Staggered Scheduling
- Transient Failure Scheduling

About This Document

This document contains details of the SURVIVOR implementation. It contains no information necessary for the operation of the systems monitor that is not available in the general sections of this manual.

Scheduling

The SURVIVOR scheduler (ss) is responsible for scheduling the execution of checks and alerts. Ordinarily, the scheduler attempts to track as closely as possible to the schedules configured in schedule.cf. However, under certain circumstances, the scheduler (if not told otherwise) will adjust these schedules in a collection of methods called smart scheduling.

Staggered Scheduling

For large sites, checking all hosts for a service simultaneously can result in slowdowns and timeouts. While the threaded check utilities in libcm and Survivor::Check.pm will prevent too many hosts from being checked simultaneously, if the number of hosts is large enough, timeouts will be reported because the master check process will not be able to complete in time.

If smart scheduling is enabled, staggered scheduling will be used to spread the checking of the hosts over the interval defined for the schedule in effect. Staggered scheduling is only used when a minimum number of hosts are monitored for the same service on the same schedule.

For each service, the scheduler will sort the monitored hosts according to the schedules on which they are monitored. The scheduler will then determine the interval for each schedule, divide that by the timeout for the check, and spread the hosts across the result. Once the maximum number of hosts have been scheduled, only hosts in a failed state (where the return code is not MODEXEC_OK) will be added to the list of hosts to be checked until the next run of the scheduler.

For example, the service ping is checked on 100 hosts using a schedule defined with the frequency of every 15 minutes, and another 50 hosts using a schedule defined with the frequency of every 30 minutes, and the service has a timeout of three minutes. The maximum number of hosts to be scheduled simultaneously from the first group will be 100 / (15 / 3) = 20, while the maximum number of hosts from the second group will be 50 / (30 / 3) = 5.

Transient Failure Scheduling

If a check module fails to return any results for any hosts, it may be due to a transient failure of the module. (Programmers aren't perfect, after all.) If smart scheduling is enabled, the scheduler will wait 30 seconds and then retry if a check module fails to return any results and does not time out. If the module fails a second time, it will then be marked as misconfigured.

$Date: 2006/11/20 02:53:03 $
$Revision: 0.4 $

keywords