|
Introduction
Uptime, after an initial investigation, is not as easy to define as might
be imagined, but, in the words of Justice Potter Stewart, I know it when
I see it.
Perhaps the better analogy is that of the blind men and the elephant.
And so these men of Indostan
Disputed loud and long,
Each in his own opinion
Exceeding stiff and strong,
Though each was partly in the right,
And all were in the wrong!
In order to be able to compare availability in a useful
fashion, it is necessary to identify what aspects of availability are
being compared. Some views into availability data may be more suited
to public relations while others for system maintainability.
Consider the simple example where a pool of three web servers are
fronted by a load balancing system. One server crashes for 20
minutes, but the load balancing system distributes incoming requests
among the remaining two servers, so no outage is perceived by clients
(except perhaps the handful of connections that come in at the exact
time of the crash, a case which will be ignored here for simplicity).
In this simple example, the server that crashed has an availability of
98.611% for the day (1420 of 1440 minutes), the hardware cluster of
all three servers has an availability of 99.537% for the day (4300 of
4320 server minutes), and the service has an availability of 100% for
the day, since no outage was perceived by the clients.
(Note that all uptime calculations must have a frame of reference: the time
interval over which the calculation is performed. Thus, it is not
acceptable to say "our uptime is 99.99%", but rather the statement
must be time qualified, for example "our uptime is 99.99% for the period
January through June.")
Aspects of Availability
- Services and Components
It is useful to define what availability is being measured.
The term service will refer to the top-level entity whose
availability information is desired, the blank in a sentence like
Our [blank] has an availability of 99.999% over the past 12 months.
A service is made up of one or more components. (Each component
could be considered a service on its own.) Calculating the
availability of a service therefore requires knowing the
availability of each of its components, and performing an
appropriate combinational operation, depending on whether or not
any of the components are redundant.
In the simple example provided in the introduction, the service is
web, and its components are the three servers that provide
web service, plus the non-redundant load balancing device.
If a service is made up of non-redundant components (for example,
where users are assigned to unique backends), the overall availability
of the service is still composed of its components, even though
not all users perceive the same availability. This is similar
to a train line advertised as 99% on time: for 100% of the riders
on the 1% of trains that are late, the service is late, but overall
riders can generally expect to be on time.
- Scheduled vs Unscheduled Unavailability
An outage that is scheduled is still an outage, but including
scheduled outages in availability calculations can distort
information about reliability. It may be important to
know that a service is unavailable for 10 hours a month, but
if 9 of those hours are due to scheduled maintenance, then the
corrective action taken is likely different from the scenario
where 9 of those hours are due to hardware failure.
- Affected Users and Potentially Affected Users
The impact of an outage can be directly related to the number of
affected users. Defining affected user can be extremely
difficult, so it may be generally more useful to make an order
of magnitude estimate rather than try to calculate the exact
number of users.
The number of potentially affected users can also provide another
useful datapoint. For example, if an email system that serves
10,000 clients is to be made unavailable for scheduled
maintenance, and it is expected that 500 of those clients would
access the service at midnight vs 5,000 at noon, then the impact
of a midnight downtime has a potential affect on 5% of the
userbase instead of 50%.
- External vs Internal Availability
When a service is provided via redundant components, the failure
of a single component is unlikely to result in a downtime visible
to the client. The external availability is what the client sees,
whereas the internal availability is what the service provider
sees. This is the example provided in the introduction.
With these definitions, it is possible to identify several useful
availability calculations.
- Absolute External Availability
The availability as viewed by the clients of the service,
expressed as percent available over a specified time interval.
In the example, the web service has an absolute external
availability of 100% over 1 day, since there was no client visible
outage.
- Absolute Internal Availability
The availability as viewed by the administrators of the service.
In the example, the load balancer and two of the web servers each
have an absolute internal availability of 100% over 1 day, while
the server that crashed has an absolute internal availability of
98.611% over 1 day.
- Absolute Scheduled External Availability
The availability as viewed by the clients of the service,
excluding any scheduled maintenance window.
For example, if a 1 hour outage is scheduled for a service, and it
is up for the other 23 hours of the day, then the service has an
absolute scheduled external availability of 100% (compared to an
absolute external availability of 95.833%) for the day. If,
however, the scheduled maintanence exceeded its window by fifteen
minutes, then the absolute scheduled external availability drops
to 98.913% for the day, reflecting availability of 1365 of 1380
scheduled availabile minutes
- Absolute Scheduled Internal Availability
The availability as viewed by the administrators of the service,
excluding any scheduled maintenance window.
- Relative Unavailability
Relative unavailability attempts to measure the impact of an
outage. Because all relative calculations involve estimates of
the number of affected users, they are all external calculations.
Relative unavailability is defined as the duration of an outage
multiplied by the number of affected users.
For example, an outage that is externally visible for 30 minutes
and affects O(1000) clients has a relative unavailability of 30,000
client-minutes, or more compactly has a magnitude of O(10^4).
- Relative Unscheduled Unavailability
The same as relative unavailability, except excluding scheduled
outages.
- Potential Unavailability
For a given outage, the size of the client base that is affected.
If O(100) out of O(10000) users are affected by an outage, then
the potential unavailability of that outage is 1%.
Potential unavailability can be fixed, when the number of affected
users is unchanged regardless of the outage duration, or it can
increase over time, when the number of affected users increases
the longer an outage continues.
Some more examples.
- A service becomes unavailable, but nobody notices because it fails
from 03:10 until 03:25. The absolute external availability of the
service is 98.958% for the day, but the relative unavailability of
the service is 0 client-minutes (15 minutes times O(0) clients).
- A service is slow from 14:30 until 14:45, but usable. However,
20% of the 10,000 active clients of the service give up in frustration
without being able to use it. The absolute external availability
of the system remains 100% for the day, but the relative
unavailability of the service is 15 * O(1000), or O(10^4).
- A site runs three services, which have been available for 43,200,
43,140, and 43,170 minutes of the 30-day month. The services have
absolute external availabilities of 100%, 99.861%, and 99.931%
respectively for the month. The site has an absolute external
availability of 99.931% for the month (129,510 of 129,600
service-minutes).
Why Calculate Availability?
Different availability calculations can provide different information,
each of which can be used to help plan workflows and upgrades, to help
schedule outages, and to help provide service reliability information
for budgeting and public relation reasons.
For example, absolute external availability is useful for public
relations, absolute internal availability is useful for detecting
problematic equipment and software, and relative unavailability and
potential unavailability are useful for measuring user impact of
planned works.
Tracking Data For Availability Calculations
The specific mechanism for tracking the data isn't so important as
ensuring the data is standardized, to make later analysis of it
simpler. For the first version of this project, a simple text file
will be used, containing the following fields:
Data |
Description |
Possible Values |
Planned? |
Whether or not the outage was scheduled |
U[nplanned]/P[lanned] |
Core? |
Whether or not the service affected is considered "core" |
Y[es]/N[o] |
Start Time |
Time outage began |
YYYYMMDD HH:MM |
Interval |
Outage duration |
HH:MM |
Users Affected |
Estimated number of affected users |
0, <10, <100, <1000, <10000, 10000+ |
Services Affected |
Service or services unavailable due to outage |
(list to be determined by experience) |
Underlying Reason |
Why the outage happened |
Misconfiguration, Hardware Failure, Preventative Maintenance,
Corrective Maintenance (additional entries to be determined) |
This data will be used primarily to calculate external availability
statistics.
Additionally, data from the systems monitor history logs may be used
to determine internal availability statistics.
|
|