SURVIVOR: Check Module Specification

Specification

All requirements apply to both scheduler and remote check modules, unless otherwise stated.

  1. Check modules must be reentrant. That is, if a check module is run more than once simultaneously, all instances must run to completion without interference.

  2. Check modules must not change their process group by any means, including via setsid(), setpgid(), setpgrp(), or any similar function.

  3. Check modules should handle their own parallelization. If a module is passed more than one host name to check, it is up to the module to determine the best way to handle it. (This requirement is relaxed to should from must because scripted modules may be run under the parallel module.)

  4. Remote check modules should be written in a scripting language such as perl to make changes easier and more transparent, and to allow for easier portability. Scheduler check modules may also be written in a scripting language. Compiled modules are permitted when necessary, but are actively discouraged for remote check modules.

  5. Each module must place its source code in a directory underneath survivor/src/modules/check/ with the following conventions:
    1. The name of the directory must be check/modulename/.

    2. A Makefile.in must be present, with directives for clean, veryclean, all, install, and install-remote.

      The install should, except in exceptional circumstances, install the module into @prefix@/mod/check, owned by @INST_USER@ and @INST_GROUP@, mode 555.

      The install-remote directive should be the same as install, except where it does not make sense for the module to be installed as part of a remote distribution.

    3. Documentation describing the module should be in doc/cm-modulename.html

  6. Check modules must accept the following arguments:
    • -v
      A flag indicating the module should validate its configuration. The module must test for any dependencies (executables, libraries, modules, configuration files, etc) required for normal successful execution. If valid, exit with MODEXEC_OK (using scalar value 0 and the string "Module OK" as the comment, where Module is the name of the module), otherwise exit with MODEXEC_PROBLEM, following the output format specification described below.

  7. Check modules receive the rest of their data via a SurvivorCheckData document, where

    • Host
      Host to perform the check on. Remote check modules will still be provided this argument, with the value localhost. Absence of this argument should cause the check module to exit immediately with an appropriate return code.

    • Timeout
      The timeout for this module. After timeout seconds, the check module may be gracelessly terminated. The check module may use this timeout value to exit gracefully before time expires. If this option is not provided, the module may act as if there is no timeout.

    • ModuleOption
      The names and values of the arguments provided in check.cf or dependency.cf. This element should conform to the Module XML Argument Specification.

  8. Check modules should not write output files.

  9. Check modules must generate output on stdout consisting of an XML document consisting of a SurvivorCheckResult element for each host specified. These documents must not be interleaved. Each host's element should be generated as soon as information is available, in case the module is timed out. The elements defined are

    • Host
      This must be the name of the host as provided by the SurvivorCheckData argument.

    • ReturnCode
      The numeric return code (as defined in include/survivor.H). Possible values include
      • MODEXEC_OK: No problem was found.
      • MODEXEC_PROBLEM: A critical problem was found, or the check could not be completed for critical reasons.
      • MODEXEC_WARNING: A non-critical problem was found, and is in danger of becoming critical.
      • MODEXEC_NOTICE: A non-critical problem was found, or the check could not be completed for non-critical reasons.
      • MODEXEC_MISCONFIG: The module is misconfigured and is unable to perform its check.
      • MODEXEC_TIMEDOUT: The check timed out.
      but ReturnCode may be a value of 20 through 1000 to transmit custom return information.

    • Scalar
      The scalar value must be an integer, either positive or negative, indicating a value that may be used for long term monitoring. For example, the number might be a load, or a simple '0' (no) or '1' (yes) indicating that a service is responding or not. For disk space usage, it might be between 0 and 100 to indicate fullness, or it might be an actual amount of bytes in use.

    • Comment
      The comment may be an empty string, or it may provide a human readable explanation of the return or scalar values. The comment may be reformatted or truncated.

    • Duration
      The duration of the check execution, in milliseconds. If present, this value must be an integer zero or greater.

  10. Check modules must exit with the highest return value generated by any host checked, unless custom return values are in use, in which case the check module may exit with whatever value the custom specifications require. Check modules executing a Type II dependency must exit with a return value appropriate for the results obtained from the check.


$Date: 2006/11/20 00:05:07 $
$Revision: 0.11 $
keywords