SURVIVOR: Overview
Overview

The Unix Systems Group of Columbia University Information Technology responsible for around 200 hosts, ranging from desktop workstations to file servers for 60,000 users. The monitoring system used to monitor these systems was designed at a time when there were closer to 10 hosts, and it became clear that a replacement was needed.

A search began for a replacement product, but none met the exact requirements, which included the following:

  • Reliable
  • Easy to use
  • Easy to maintain
  • Scalable
  • Modular
  • State that appropriately survives reconfiguration
  • Time aware
  • Check and alert behaviors defined by hosts and/or services
  • Logical groupings of hosts
  • Acknowledgements
  • Escalations
  • History records for long term status monitoring
  • Dependencies
  • Call list maintenance
  • Local vs remote checking
  • Configurable levels of failure
  • Enabling/disabling of services, hosts, or service@host pairs
  • Configuration files that are easy to read and edit
  • Error checking that provides useful information rather than bizarre behavior
And so the survivor project was established.

The central portion of the package is the survivor scheduler (ss). It is a multi-threaded daemon that handles the scheduling and execution of checks and alerts. The scheduler runs an instance, which is a set of configuration files, state, and history. Multiple instances can be created for multiple configurations, with each instance run via a separate scheduler.

The scheduler executes checks and alerts in accordance with its configuration files. Checks and alerts are implemented by modules which may be written in any language so long as they conform with the appropriate specifications. The results from these checks and alerts are stored in state and history files.

Checks, by default, execute on the host where the scheduler runs. This is sufficient to cover many cases, such as checking services like HTTP, SMTP, IMAP, etc. The survivor remote daemon (sr) is provided to facilitate performing checks that must be performed on individual hosts.

The state may be viewed and manipulated by other programs, including the command line interface (sc), the web interface (sw), and the mail gateway (sg).


$Date: 2006/11/19 02:54:58 $
$Revision: 0.6 $
keywords