survivor: Confusing Error Messages Explained

About Confusing Error Messages
Parsing Configuration Files
Command Interface (sc)
Scheduler (ss)
Web Interface (sw)

About Confusing Error Messages

Although error messages are generally written to be transparent and understandable in context, some, by virtue of the underlying code design, may seem a little confusing.

Parsing Configuration Files

Error

When parsing the configuration files,

   *lex* Unexpected token at line 79 (state 7): "#commented"

Sample

   1024 *lex* Unexpected token at line 79 (state 7): "#commented"
   1024 *lex* Unexpected token at line 79 (state 7): "out"
   1024 *lex* Unexpected token at line 79 (state 7): "block"
   1024 || 3 errors encountered while parsing /home/symon/sample/config/check.cf
   sc: WARNING: Configuration parse failed

Explanation

The configuration parser, due to the complexities of the underlying parsing mechanism, requires all comments to be terminated with a newline. A comment at the end of the file might not have a newline after it, causing the parser to fail.

Simply add a newline after the comment to fix this problem.

Command Interface (sc)

Error

   State::lock_type_state failed to open .../lock

Sample

   % sc -i instance clstate oncall
   [oncall]
   sc: WARNING: State::lock_calllist_state failed to open
   /var/instance/state/calllist/oncall/lock
   sc: WARNING: State::lock_calllist_state failed to open
   /var/instance/state/calllist/oncall/lock
   sc: WARNING: State::lock_calllist_state failed to open
   /var/instance/state/calllist/oncall/lock
   -> [email protected] is now on call

Explanation

When configuration files are updated, it is the responsibility of the scheduler to update the state directories in accordance with the new configuration. If other utilities, such as the command interface, are run with the new configuration before the scheduler is told of the update, they may try to access state files or directories that do not yet exist.

In the case of the sample above, the oncall call list was added to calllist.cf. Before the scheduler was told of the update, the command interface was run to get the state of the new call list. Since the scheduler has not made the call list state directories consistent, the directory /var/instance/state/calllist/oncall does not yet exist, and so its lock file cannot be opened.

Scheduler (ss)

Error

Lots and lots of spewage in /var/log/survivor-instance.log.

Sample

   ss: WARNING: state_consistency failed to create service state directory
   ss: WARNING: Unable to create host history directory
   ss: WARNING: CheckState::lastcheck failed to open
   ss: WARNING: CheckState::write_results failed to reset permissions on

Explanation

Most likely, the user running the scheduler is not INSTUSER or is not a member of INSTGROUP, and so cannot properly access and update the files. See the building instructions for more details.

Error

On certain Linux platforms, SIGHUP sometimes causes the scheduler to exit, complaining the scheduler is restarting too frequently.

Sample

   
   ss: WARNING: Keepalive process is restarting the scheduler too frequently.
   ss: WARNING: There may be a configuration file error or some other problem.
   ss: WARNING: Keepalive process is exiting as a precaution.

Explanation

This is a bug in the system's implementation of sigwait(3). Instead of returning the proper signal sent to the appropriate thread (in this case, SIGHUP, or 1), the non-existant signal 0 is sent. Since the scheduler cannot tell what signal 0 really means, the scheduler exits. When run under keepalive (ss -k), the keepalive daemon will restart the scheduler, effectively simulating a SIGHUP. However, if this is done too often (more than once per minute), the keepalive exits, assuming there is a bigger problem.

Update: This appears to have been fixed. libc version 2.3.2 is known to work properly.

Error

   Failed to queue check 'service' (check may already be scheduled)

Sample

   ss: WARNING: Failed to queue check 'syslogd' (check may already be
   scheduled)

Explanation

Every minute, the check scheduler attempts to schedule any checks that are due to be executed. If a particular check takes more than a minute to execute, the check scheduler will attempt to schedule that check again.

To prevent the same process from being queued multiple times (and thus causing backlogs or concurrency problems), the scheduler will produce the above error message if a previously scheduled check with the same name has not yet completed.

Error

   CheckState::lastcheck failed to open .../lastcheck

Sample

   ss: WARNING: CheckState::lastcheck failed to open
   /var/instance/state/host/hostname/service/lastcheck (No such file
   or directory)

Explanation

When a new check or host is added, there is no way to guarantee that the check scheduler will notice before the alert scheduler. When both hostname and service are known to be valid, this warning may still be generated. This is due to the alert scheduler noticing the new check or host before the check scheduler, and so there is no lastcheck state to be examined.

This warning should not continue after a minute or two, by which time the new check or host will have been queued by the check scheduler. Note that setting an alert shift time in schedule.cf will not eliminate this message, as the alert shift time only adjusts the period during which alerts are generated, and not the relative times compared to the check scheduler within that period.

Update: Due to changes within the scheduler, this error should no longer occur.

Web Interface (sw)

Error

(blank output) or

   HTTP/1.1 500 Internal Server Error

Sample

   HTTP/1.1 500 Internal Server Error

Explanation

Some dynamic libraries, including the OpenSSL libraries such as libssl or libcrypto, are not in the runtime LD_LIBRARY_PATH of the web interface.

Ordinarily, the paths to these libraries are encoded into the executable when the package is built. If, however, these libraries are moved or removed, when the program is executed the runtime linker will fail to resolve symbol definitions and the program will not run.

It may be possible to replicate this failure by manually running the program:

 % unsetenv LD_LIBRARY_PATH
 % ./sw
 % ld.so.1: ./sw: fatal: libssl.so.0.9.6: open failed: No such file or directory
 Killed

To fix this problem, replace the libraries or rebuild the package with the new locations included in Makefile.inc.

$Date: 2006/11/20 02:54:21 $
$Revision: 0.13 $

keywords