CU Home
Columbia University in the City of New York  

AcIS > Dev > UnixDev > Docs > Uptime


Introduction

Uptime, after an initial investigation, is not as easy to define as might be imagined, but, in the words of Justice Potter Stewart, I know it when I see it.

Perhaps the better analogy is that of the blind men and the elephant.

And so these men of Indostan
Disputed loud and long,
Each in his own opinion
Exceeding stiff and strong,
Though each was partly in the right,
And all were in the wrong!

In order to be able to compare availability in a useful fashion, it is necessary to identify what aspects of availability are being compared. Some views into availability data may be more suited to public relations while others for system maintainability.

Consider the simple example where a pool of three web servers are fronted by a load balancing system. One server crashes for 20 minutes, but the load balancing system distributes incoming requests among the remaining two servers, so no outage is perceived by clients (except perhaps the handful of connections that come in at the exact time of the crash, a case which will be ignored here for simplicity). In this simple example, the server that crashed has an availability of 98.611% for the day (1420 of 1440 minutes), the hardware cluster of all three servers has an availability of 99.537% for the day (4300 of 4320 server minutes), and the service has an availability of 100% for the day, since no outage was perceived by the clients.

(Note that all uptime calculations must have a frame of reference: the time interval over which the calculation is performed. Thus, it is not acceptable to say "our uptime is 99.99%", but rather the statement must be time qualified, for example "our uptime is 99.99% for the period January through June.")

Aspects of Availability

  • Services and Components
    It is useful to define what availability is being measured. The term service will refer to the top-level entity whose availability information is desired, the blank in a sentence like
        Our [blank] has an availability of 99.999% over the past 12 months.
        
    A service is made up of one or more components. (Each component could be considered a service on its own.) Calculating the availability of a service therefore requires knowing the availability of each of its components, and performing an appropriate combinational operation, depending on whether or not any of the components are redundant.

    In the simple example provided in the introduction, the service is web, and its components are the three servers that provide web service, plus the non-redundant load balancing device.

    If a service is made up of non-redundant components (for example, where users are assigned to unique backends), the overall availability of the service is still composed of its components, even though not all users perceive the same availability. This is similar to a train line advertised as 99% on time: for 100% of the riders on the 1% of trains that are late, the service is late, but overall riders can generally expect to be on time.

  • Scheduled vs Unscheduled Unavailability
    An outage that is scheduled is still an outage, but including scheduled outages in availability calculations can distort information about reliability. It may be important to know that a service is unavailable for 10 hours a month, but if 9 of those hours are due to scheduled maintenance, then the corrective action taken is likely different from the scenario where 9 of those hours are due to hardware failure.

  • Affected Users and Potentially Affected Users
    The impact of an outage can be directly related to the number of affected users. Defining affected user can be extremely difficult, so it may be generally more useful to make an order of magnitude estimate rather than try to calculate the exact number of users.

    The number of potentially affected users can also provide another useful datapoint. For example, if an email system that serves 10,000 clients is to be made unavailable for scheduled maintenance, and it is expected that 500 of those clients would access the service at midnight vs 5,000 at noon, then the impact of a midnight downtime has a potential affect on 5% of the userbase instead of 50%.

  • External vs Internal Availability
    When a service is provided via redundant components, the failure of a single component is unlikely to result in a downtime visible to the client. The external availability is what the client sees, whereas the internal availability is what the service provider sees. This is the example provided in the introduction.

With these definitions, it is possible to identify several useful availability calculations.
  • Absolute External Availability
    The availability as viewed by the clients of the service, expressed as percent available over a specified time interval.

    In the example, the web service has an absolute external availability of 100% over 1 day, since there was no client visible outage.

  • Absolute Internal Availability
    The availability as viewed by the administrators of the service.

    In the example, the load balancer and two of the web servers each have an absolute internal availability of 100% over 1 day, while the server that crashed has an absolute internal availability of 98.611% over 1 day.

  • Absolute Scheduled External Availability
    The availability as viewed by the clients of the service, excluding any scheduled maintenance window.

    For example, if a 1 hour outage is scheduled for a service, and it is up for the other 23 hours of the day, then the service has an absolute scheduled external availability of 100% (compared to an absolute external availability of 95.833%) for the day. If, however, the scheduled maintanence exceeded its window by fifteen minutes, then the absolute scheduled external availability drops to 98.913% for the day, reflecting availability of 1365 of 1380 scheduled availabile minutes

  • Absolute Scheduled Internal Availability
    The availability as viewed by the administrators of the service, excluding any scheduled maintenance window.

  • Relative Unavailability
    Relative unavailability attempts to measure the impact of an outage. Because all relative calculations involve estimates of the number of affected users, they are all external calculations. Relative unavailability is defined as the duration of an outage multiplied by the number of affected users.

    For example, an outage that is externally visible for 30 minutes and affects O(1000) clients has a relative unavailability of 30,000 client-minutes, or more compactly has a magnitude of O(10^4).

  • Relative Unscheduled Unavailability
    The same as relative unavailability, except excluding scheduled outages.

  • Potential Unavailability
    For a given outage, the size of the client base that is affected. If O(100) out of O(10000) users are affected by an outage, then the potential unavailability of that outage is 1%.

    Potential unavailability can be fixed, when the number of affected users is unchanged regardless of the outage duration, or it can increase over time, when the number of affected users increases the longer an outage continues.

Some more examples.
  • A service becomes unavailable, but nobody notices because it fails from 03:10 until 03:25. The absolute external availability of the service is 98.958% for the day, but the relative unavailability of the service is 0 client-minutes (15 minutes times O(0) clients).

  • A service is slow from 14:30 until 14:45, but usable. However, 20% of the 10,000 active clients of the service give up in frustration without being able to use it. The absolute external availability of the system remains 100% for the day, but the relative unavailability of the service is 15 * O(1000), or O(10^4).

  • A site runs three services, which have been available for 43,200, 43,140, and 43,170 minutes of the 30-day month. The services have absolute external availabilities of 100%, 99.861%, and 99.931% respectively for the month. The site has an absolute external availability of 99.931% for the month (129,510 of 129,600 service-minutes).

Why Calculate Availability?

Different availability calculations can provide different information, each of which can be used to help plan workflows and upgrades, to help schedule outages, and to help provide service reliability information for budgeting and public relation reasons.

For example, absolute external availability is useful for public relations, absolute internal availability is useful for detecting problematic equipment and software, and relative unavailability and potential unavailability are useful for measuring user impact of planned works.

Tracking Data For Availability Calculations

The specific mechanism for tracking the data isn't so important as ensuring the data is standardized, to make later analysis of it simpler. For the first version of this project, a simple text file will be used, containing the following fields:

Data Description Possible Values
Planned? Whether or not the outage was scheduled U[nplanned]/P[lanned]
Core? Whether or not the service affected is considered "core" Y[es]/N[o]
Start Time Time outage began YYYYMMDD HH:MM
Interval Outage duration HH:MM
Users Affected Estimated number of affected users 0, <10, <100, <1000, <10000, 10000+
Services Affected Service or services unavailable due to outage (list to be determined by experience)
Underlying Reason Why the outage happened Misconfiguration, Hardware Failure, Preventative Maintenance, Corrective Maintenance (additional entries to be determined)

This data will be used primarily to calculate external availability statistics.

Additionally, data from the systems monitor history logs may be used to determine internal availability statistics.

http://www.columbia.edu/acis/dev/unixdev/doc/uptime.shtml Wednesday, 26-Oct-2005 08:26:44 EDT