survivor: Technical Document: Locking

SURVIVOR Technical Document: Locking

About This Document
Locking

About This Document

This document contains details of the SURVIVOR implementation. It contains no information necessary for the operation of the systems monitor that is not available in the general sections of this manual.

Locking

The SURVIVOR scheduler (ss) is written as a multi-threaded application in order to faciliate the concurrent execution of the various modules within the system. In addition, other parts of the package, notably the user interfaces, including sc, sg, and sw, can operate concurrently with the scheduler.

Given all these competing entities demanding access to the resources of the system, including the state storage, locking must be implemented in order to prevent corrupt or inaccurate data. This document describes the locking implemented by SURVIVOR.

State Locking

The most critical resource is the state hierarchy. Host state, service state, and calllist state are protected by lockfiles controlled by lockf(). This ensures different components of the system do not read partial or incorrect information or perform simultaneous writes to the state files.

When state is updated, the mtime of the lockfile protecting that state is also updated. This is because the scheduler may cache state information to reduce disk access, and by checking the modification time on the file can determine whether or not it is necessary to perform the additional disk operations necessary to read state from disk. Note that in the current implementation only check state is cached. Since the same lockfile protects both check and alert state, when alert state is also cached updating either check or alert state will invalidate the cache of both for the service@host in question. (This could be addressed by using atime vs mtime, or by creating a new file to be stat'd.)

See the History Locking section immediately following for important additional information.

History Locking

History files are locked since concurrent updates are possible (although rare). The locking mechanism is similar to state locking, but using a different filename and without the mtime updates.

If a service needs to lock both state and history simultaneously, the correct order for doing so is

Lock state
Lock history
Unlock history
Unlock state

Scheduler Locking

Within the scheduler process, multiple threads of execution may require access to scheduler resources. Of greatest interest from a practical standpoint is the read/write lock protecting the global Configuration object. When any thread requires configuration information, a read lock is obtained. If a new configuration becomes available (say, via SIGHUP), a write lock is requested, and will only be granted upon the release of all outstanding read locks. Execution of a check currently requires a read lock, and so any SIGHUP received will be deferred until the check has completed.

Additionally, access to the internal alert and check processing queues and timer array are controlled. Call list state is controlled globally, and cached state data is protected on a per-data basis.

Fix Locking

Fix locking is required to prevent undesirable concurrent execution of fix modules. When a fix is scheduled for a given service@host, by default no other fix may run for the same service@host. According to the configuration, fixes may also lock out an entire service or an entire host instead. Locks are implemented as symbolic links with expiration times to prevent a failed fix module from indefinitely blocking other fix attempts.

$Date: 2006/11/20 02:48:45 $
$Revision: 0.4 $

keywords