SURVIVOR: Upgrading
About Upgrading

Although every effort is made to minimize incompatibilities between versions, occasionally changes are required to facilitate future enhancements and make the system more flexible.

Upgrading from 0.9.7 to 1.0
  1. v1.0 removes support for the notwarncount and notprobcount arguments to the snmp check module. There is no replacement for this functionality.

  2. v1.0 adds a snmpversion argument to the snmp, ups, storedge-t3, nadisk, and hplj, check modules. The default SNMP version used is '1'. For installations where these modules were previously used with a different SNMP version, add the appropriate version as an argument to all relevant checks.

  3. v1.0 displays the first time a check returned the current result, via both the command line and web interfaces. When upgrading to v1.0, the first check time is not available, so the time the check runs first after the upgrade is used instead. This is inconsistent with the number of instances reported by the same interfaces.
Upgrading from 0.9.6 to 0.9.7

  1. v0.9.7 changes some flags to the web interface (sw) and may require existing bookmarks referencing actions and custom pagesets to be updated.
    • The flags h and s are no longer used to specify service@host for an action to be performed. Instead, sh should be used.
    • The flag ret should now be used when submitting an action (a flag) to provide a return path after the action is processed.

  2. v0.9.7 changes the specification for callliststatus files in order to allow substitutions to properly end. Rotating call lists may lose their place following the upgrade. Before upgrading, determine who is on call for each rotating call list.
         % sc clstat list1
         [list1]
          -> abc was last notified at abc@site.org via this list
          -> List last rotated at Thu Aug 25 09:40:32 2005
          -> abc is now on call
         
    After the upgrade, use the new clset command to reset the call list.
         % sc -o person=abc clset list1
         [list1] OK: Set
         

  3. v0.9.7 changes the specification for report modules slightly. Report modules may no longer assume a TmpDir will be provided. Additionally, a new check style has been defined. For more information, see the updated specification.

    Any custom report modules should be updated.

Upgrading from 0.9.5 to 0.9.6

  1. v0.9.6 removes the parallel check module. This module did not perform any checks, but ran other checks in parallel, reducing the time required to run them.

    Modules run under the parallel module can simply be run serially instead.

    Other methods are available for module parallelization. Most modules included with the package use these methods by default.

  2. v0.9.6 converts check, fix, and transport modules to accept arguments via XML documents. Custom check, fix, and transport modules that do not use libcm or Survivor.pm must be modified. Custom modules that use libcm may need to be modified.

  3. v0.9.6 generalizes the apc check to support Liebert UPSs as well. To reflect this, the module has been renamed to ups. References to apc in check.cf must be changed to ups.

Upgrading from 0.9.4 to 0.9.5

  1. v0.9.5 converts most state files to an XML based format in order to facilitate the addition of new features both in this release and in future releases.

    In order to convert the existing state files, as root stop the v0.9.4 scheduler and then run the convert-state.pl script in src/util/upgrading. This must be done while the scheduler is stopped. Run the script once for each statedir directory defined in /etc/survivor/instance.cf. Then, start the 0.9.5 scheduler.

    If this conversion is not performed, all existing state (but not history) will be lost when the 0.9.5 scheduler is started.

  2. v0.9.5 adds a tmpdir keyword to instance.cf, used for components of the system that require the ability to write temporary files. The default value is /tmp, which is not suitable if the system is installed on a host accessible by non-trusted users. See instance.cf for more information.

Upgrading from 0.9.3 to 0.9.4
  1. v0.9.4 overhauls the internal management of history records in order to facilitate several changes now and to prepare for additional changes later. While most of these changes are not visible, the format of history records has changed.

    In order to convert existing history records, as root run the convert-history.pl script in src/util/upgrading. This must be done while the scheduler is stopped. It can be done after the 0.9.4 scheduler has been installed as long as it is not currently running. Run the script once for each historydir directory defined in /etc/survivor/instance.cf.

          scheduler# /etc/init.d/survivor stop
          scheduler# cd src/util/upgrading
          scheduler# ./convert-history.pl /survivor/sample/history
          scheduler# /etc/init.d/survivor start
          

    If necessary, it is safe to rerun convert-history twice on the same directory, as long as the scheduler is not running.

    Strictly speaking, converting existing history records is not necessary. However, having unconverted history records may prevent the 0.9.4 utilities (including the history retrieval and rotation functions of sc and the reporting function of sw) from completing successfully.

  2. v0.9.4, by default, disables the command line interface for the root user. This is to increase accountability, as at larger sites many administrators may have access to the root account, making it difficult to determine who, for example, inhibited alerts for a host. Since it is not necessary to be root to run sc, no functionality is lost. However, if it is desired to have the root user be able to run the command line interface, simply add the following line to each instance defined in instance.cf:
          allow root
          
    For more information, see the instance configuration file documentation.

  3. The 0.9.4 web interface has been overhauled, with two notable changes. First, cookies are now required for authenticated sessions. Second, the format of several keywords in cgi.cf has changed: For full details of how authentication and authorization now works, see the documentation for cgi.cf and the sample configuration file in the source config directory.

Upgrading from 0.9.2 to 0.9.3

0.9.3 removes a dependency on sendmail. In order to ensure alerts transmitted with the mail transmit module are successfully sent out, the Perl module Mail::Mailer must be installed on the scheduler host.

Additionally, 0.9.3 overhauls the configuration of dependencies. Type II dependencies were improperly implemented in prior releases. For information on Type II dependencies, see the documentation. Any Type II dependencies in dependency.cf must be converted.

Type I dependencies have a new syntax, as defined in the documentation. Any Type I dependencies in dependency.cf must be converted to the new syntax. For example,

  depend foo on bar status
  
would become
  depend {
    checks { foo }
    for all hosts
    on bar status
  }
  
and
  depend all except { foo bar } on baz status
  
would become
  depend {
    all checks except { foo bar }
    for all hosts
    on baz status
  }
  
Upgrading from 0.9 or 0.9.1 to 0.9.2

0.9.2 introduces a small change to increase the flexibility of the alerting infrastructure, splitting alert modules into format and transmit modules. The changes required to use the standard modules are very simple, just add the following to the beginning of calllist.cf (assuming no local alert modules are in use):

  alert via mail {
    format as full
    transmit with mail
  }
 
  alert via mailtopager {
    format as compact
    transmit with mail
  }
 
  alert via mailtonextel {
    format as nextel
    transmit with mail
  }
 
  alert via mailtosms {
    format as sms
    transmit with mail
  }
 
Any custom alert modules written will need to be rewritten into format and/or transmit modules.

Upgrading from 0.8.x to 0.9

0.9 introduces many changes that are incompatible with 0.8.x. Many of these changes will facilitate future enhancements and make the system more flexible. These instructions identify the steps needed to upgrade. We hope that future upgrades will not require nearly so many changes (and preferably none at all).

  1. Build the new version.

  2. Stop the 0.8.x scheduler.

  3. Install the 0.9 remote package on all remotely monitored hosts.

  4. Install the 0.9 package on the scheduler host, but do not try to start the scheduler.

  5. In calllist.cf, change any rotating call lists to rotate using a schedule instead of an explicit time. For example, a call list originally defined as
         call list foo {
           notifies {
     	 jane@foo
             joe@foo
           }
    
           via mail
           rotates monday 12:00
         }
         
    would become
         call list foo {
           notifies {
             jane@foo
    	 joe@foo
           }
    
           via mail
           rotates using mondayNoon schedule
         }
         
    with mondayNoon defined in schedule.cf as
         schedule mondayNoon {
           at {
             monday 12:00
           }
         }
         

  6. Also in calllist.cf, switch to Person-based call lists. Note that because the state file format for call lists has changed, conversion of call lists can be slightly complicated. There are three options.

    1. The easiest procedure is to simply change the definitions. See the documentation for full details, but as an example
      	 call list foo {
      	   notifies {
      	     jane@foo
      	     joe@foo
      	   }
      
      	   via mail
      	   broadcasts to all
      	 }
      	 
      would become
      	 person jane {
      	   notify jane@foo via mail
      	 }
      
      	 person joe {
      	   notify joe@foo via mail
      	 }
      
      	 call list foo {
      	   notifies {
      	     jane
      	     joe
      	   }
      
      	   via mail
      	   broadcasts to all
      	 }
      	 
      Note that this procedure will reset simple and rotating call lists, and may confuse any existing substitutions.

    2. To clear out all previous state, including any existing substitutions, first remove all the directories under the directory calllist in each instance's state directory. To determine the state directory, see the instance configuration file, which is usually /etc/survivor/instance.cf.

      Then, follow the instructions for #1, above. This will still reset simple and rotating call lists, but there will be no confused entries for substitutions.

    3. If it is necessary to maintain previous state, it is possible to manually convert the state files. For assistance, please file a bug report.

  7. The state file formats for check state and alert state have changed to support new features introduced in v0.9 and to improve performance. In order to preserve 0.8.x state, run the movestate.sh script found in src/util/upgrading once for each statedir directory defined in /etc/survivor/instance.cf. If the script is not run, old check and alert state will not carry forward to the v0.9 scheduler and superfluous files will be left lying around.

    Run this script as $INSTUSER, or add an appropriate chown line after the chmod line in the script.

         scheduler% su - survivor
         % cd src/util/upgrading
         % ./movestate.sh /survivor/sample/state
         
    This script should not be run more than once per statedir.

  8. In schedule.cf, redefine the alertplans. Alertplans are now defined in terms of tries rather than the number of check failures. Whereas in v0.8.x an alert action was based on the number of times a check failed (this ability is still retained in v0.9 alertplans, but in a less prominent way), v0.9 selects alert actions based on the number of times an alert is transmitted. See the documentation for full details, but as an example
         alertplan standard {
           default {
             after 2 warnings {
    	   alert unix-mailer
    	   using standard schedule
    	 }
           }
         }
         
    would become
         alertplan standard {
           default {
             after 2 check failures
    
    	 using standard schedule {
    	   try { alert unix-mailer }
    	 }
           }
         }
         
    while in the following example, where multiple returngroup problem stanzas were required to allow 1 failure overnight,
         alertplan replicated {
           on returngroup problem {
             after 2 warnings {
               alert unix-pager
    	   using dayevening schedule
             }
             after 4 warnings {
               alert unix-pager
    	   using dayevening schedule
             }
           }
           on returngroup problem {
             after 2 warnings {
               alert unix-pager
    	   using overnight schedule
    	   allow 1 failure
    	 }
    	 after 4 warnings {
    	   alert unix-pager
    	   using overnight schedule
    	   allow 1 failure
    	 }
           }
           default {
             after 2 warnings {
    	   alert unix-mailer
    	   using extended schedule
    	 }
           }
         }
         
    would become the less redundant
         # Putting this here makes it the default for all alertplans through
         # the end of the file, or until redefined.
         after 2 check failures
        
         alertplan replicated {
           on returngroup problem {
    	 using extended schedule {
    	   try 2 times {
    	     allow 1 failed host during overnight schedule
    	     alert unix-pager
    	   }
    	   try {
    	     allow 1 failed host during overnight schedule
    	     alert unix-pager
    	     flag escalated   # This is actually optional, since the second try
    	                      # is considered escalated by default
    	   }
    	 }
           }
           default {
             using extended schedule {
    	   try { alert unix-mailer }
    	 }
           }
         }
         

  9. Also in schedule.cf, the semantics of global notify on clear have changed. Instead of applying to all alertplans, a global notify on clear is now a default value, and applies to all alertplans defined after it (but not before it), until redefined or until the end of the file. To replicate the v0.8 behavior, make sure the notify on clear statement is before the first alertplan definition, and add the keyword default. For example:
         alertplan foo {
           ...
         }
    
         # enable global notify on clear
         notify using bar schedule on clear
         
    becomes
         # this applies to all subsequently defined alertplans, unless overridden
         # or redefined
         default notify using bar schedule on clear
    
         alertplan foo {
           ...
         }
         

  10. In check.cf, all check modules must now be converted to named argument style. Unfortunately, the only way to do this is to read the documentation for each module and convert each entry appropriately. The reason this is so hard is the exact reason named arguments have been introduced: the old format was inconsistent, hard to read, and hard to use.

  11. Also in check.cf, the remote module is no longer a special type of check module, but is now one of a new class of modules called transport modules. In order for a module to run remotely, a transport module must be defined for it to use. A simple example to port a typical v0.8 entry (including the conversion to named arguments, described above) would change
         check mailq {
           module remote(mailq,/var/spool/mqueue,0,1000:2000)
         }
    
         check swap {
           module remote(swap,80,90)
         }
         
    to
         transport remote {
           module plaintext {}
         }
    
         check mailq {
           module mailq {
             queuedir /var/spool/queue   # This could be omitted, it is the default
    	 age      0                  # This could also be omitted, same reason
    	 warn     1000
    	 prob     2000
           }
           via remote
         }
    
         check swap {
           module swap {
             warn     80
    	 prob     90
           }
           via remote
         }
         

  12. Also in check.cf, the semantics of the global check timeout have changed. Instead of applying to all checks that do not define their own timeout, the global check timeout is now a default value, and applies to all checks defined after it (but not before it), until redefined or until the end of the file. Until it is defined, the initial default timeout of 45 seconds will apply. To replicate the v0.8 behavior, make sure the timeout statement is before the first check definition, and add the keyword default. For example:
         check foo {
           ...
         }
    
         # set global timeout
         timeout 3 minutes
         
    becomes
         # this applies to all subsequently defined checks, unless overridden
         # or redefined
         default timeout 3 minutes
    
         check foo {
           ...
         }
         

  13. In dependency.cf, changes are required for both Type I and Type II dependencies. Type I dependencies simply need the keyword status appended. For example:
         depend all on ping
         
    becomes
         depend all on ping status
         
    Type II dependencies need to be converted to named arguments, the same as for check modules, described earlier. Note that Type II dependencies do not currently support transport modules. This will be addressed with a forthcoming revision of the dependency mechanisms.

  14. Test the new configuration. One way to do this is by using the 0.9 sc to manually run some or all checks.

  15. Start the 0.9 scheduler.


$Date: 2007/03/29 12:17:29 $
$Revision: 0.12 $
keywords