The database manager can collect, maintain, and use "frequent-value statistics" and "quantiles", two types of statistics that estimate, in a concise way, the distribution of the data values in a column. Use of these statistics by the optimizer can lead to significantly more accurate estimates of the number of rows in a column that satisfy given equality or range predicates. These more accurate estimates in turn increase the likelihood that the optimizer will choose an optimal plan.

You may collect statistics about the distribution of these data values by using the WITH DISTRIBUTION clause on the RUNSTATS command. While collecting these additional statistics results in additional overhead for the RUNSTATS utility, the SQL compiler can use this information to help ensure the best access plan is chosen.

In some cases, the database manager will not collect distribution statistics and no error will be returned. For example:

- The
*num_freqvalues*and*num_quantiles*configuration parameters are set to zero (0) to indicate that you do not want to collect distribution statistics. For more information about these parameters, see: - The distribution of the data is known without the use of distribution statistics. For example, a column that does not have any data value appearing more than once, that is, each data value in the column is unique.
- The data type is one for which statistics are not collected. That is, the column is defined using a long field or large object data type.
- In the case of quantiles, there is only one non-NULL value in the column.

Distribution statistics are exact for the first column of indexes. For each additional column, the database manager uses hashing and sampling techniques to estimate the distribution statistics because calculating exact statistics would require too much time and memory to be practical. These techniques are accepted statistical methods with accepted degrees of accuracy.

The following topics provide information to help you understand and use these distribution statistics:

- "Understanding Distribution Statistics"
- "When Should You Use Distribution Statistics?"
- "How Many Statistics Should You Keep?"
- "How Does the Optimizer Use Distribution Statistics?"
- "Modelling Production Databases"
- "Rules for Updating Distribution Statistics for Columns"

For a fixed number N>=1, the *N most frequent values* in a column
consist of the data value having the highest frequency (that is, number of
duplicates), the data value having the second highest frequency, and so forth,
down to the data value having the Nth highest frequency. The
corresponding *frequent-value statistics* consist of these "N"
data values, together with the frequencies of these values in the
column.

The *K-quantile* for a column is the smallest data value,
`V`, such that at least "K" rows have data values less than or
equal to `V`. A K-quantile can be computed by sorting the rows
in the column according to increasing data values; the K-quantile is the data
value in the Kth row of the sorted column.

For example, consider the following column of data:

C1 -- B E Y B F G E A J K E L

This column can be sorted to obtain the following ordered values:

C1' -- A B B E E E F G J K L Y

There are nine distinct data values in column C1. For N = 2, the frequent value statistics are:

SEQNO COLVALUE VALCOUNT ----- --------- -------- 1 E 3 2 B 2

If the number of quantiles being collected is 5 (see "Number of Quantiles for Columns (num_quantiles)"), then the K-quantiles for this column for K = 1, 3, 6, 9, and 12 are:

SEQNO COLVALUE VALCOUNT ----- --------- -------- 1 A 1 2 B 3 3 E 6 4 J 9 5 Y 12

In this example, the 6-quantile is equal to E since the sixth row in the sorted column has a data value equal to E (and 6 rows in the original column have data values less than or equal to E).

The same quantile value may occur more than once, if it is a common value. A maximum of two quantiles will be stored for a given value. The first of these two quantiles has a COLCOUNT that gives the number of rows strictly less than COLVALUE, and the second of the two quantiles gives the number of rows less than or equal to COLVALUE.

To decide whether distribution statistics should be kept for a given table, two factors should be considered:

- The use of static or dynamic SQL.
Distribution statistics are most useful for dynamic SQL and static SQL that does not use host variables. When using SQL with host variables, the optimizer makes limited use of distribution statistics.

- The lack of uniformity in the data distributions.
Keeping distribution statistics is advisable if at least one column in the table has a highly "non-uniform" distribution of data and the column appears frequently in equality or range predicates; that is, in clauses such as the following:

WHERE C1 = KEY; WHERE C1 IN (KEY1, KEY2, KEY3); WHERE (C1 = KEY1) OR (C1 = KEY2) OR (C1 = KEY3); WHERE C1 <= KEY; WHERE C1 BETWEEN KEY1 AND KEY2;

There can be two types of non-uniformity in a data distribution, possibly occurring together:

- One type of non-uniformity occurs when the data, instead of being evenly
spread out between the highest and lowest data value, is clustered in some
sub-interval, as in the following column, where the data is clustered in the
range (5,10):
C1 ----- 0.0 5.1 6.3 7.1 8.2 8.4 8.5 9.1 93.6 100.0

It can be useful to keep quantiles when this type of non-uniformity is present.

The following example shows a query that can be used to help determine whether a high degree of non-uniformity exists in a column.

SELECT C1, COUNT(*) AS OCCURRENCES FROM T1 GROUP BY C1 ORDER BY OCCURRENCES DESC;

- Another type of non-uniformity occurs when certain data values have a much
higher frequency than other data values, as in a column having data values
with the following frequencies:
Data Value Frequency ---------- --------- 20 5 30 10 40 10 50 25 60 25 70 20 80 5

It can be useful to keep both quantiles and frequent-value statistics when this type of non-uniformity is present.

- One type of non-uniformity occurs when the data, instead of being evenly
spread out between the highest and lowest data value, is clustered in some
sub-interval, as in the following column, where the data is clustered in the
range (5,10):

You may collect distribution statistics by using the WITH DISTRIBUTION
clause on the RUNSTATS command, or by specifying `D`, `E`,
or `A` for the `statsopt` parameter when calling the
RUNSTATS API. For more information, refer to the Command Reference or the API Reference manuals.

Keeping a large number of column distribution statistics can lead to improved selection of access plans by the optimizer, but the cost of collecting these statistics and compiling your queries increases accordingly. The size of the statistics heap (see "Statistics Heap Size (stat_heap_sz)") may place limitations on the number of statistics that can be computed and stored.

When distribution statistics are requested, the database manager stores a default of the 10 most frequent values for a column. Keeping between 10 and 100 frequent values should suffice for most practical situations. Ideally, enough frequent-value statistics should be retained so that the frequencies of the remaining values are either approximately equal to each other or negligible compared to the frequencies of the most frequent values.

To set the number of frequent values to collect, use the
*num_freqvalues* configuration parameter, as described in "Number of Frequent Values Retained (num_freqvalues)". The database manager may collect less than this number of
frequent value statistics, because these statistics will only be collected for
data values that occur more than once. If collecting only quantile
statistics, this parameter can be set to zero.

When distribution statistics are requested, the database manager stores a default of 20 quantiles for a column. This value guarantees a maximum estimation error of approximately 2.5% for any simple single-sided range predicate (>, >=, <, or <=), and a maximum error of 5% for any BETWEEN predicate. A rough rule of thumb for determining the number of quantiles is:

- Determine the maximum error that is tolerable in estimating the number of rows of any range query, as a percentage, P
- The number of quantiles should be approximately 100/P if the predicate is a BETWEEN predicate, and 50/P if the predicate is any other type of range predicate (<, <=, >, or >=).

For example, 25 quantiles should result in a maximum estimate error of 4% for BETWEEN predicates and of 2% for ">" predicates. In general, at least 10 quantiles should be kept, and more than 50 quantiles should be necessary only for extremely non-uniform data.

To set the number of quantiles, use the *num_quantiles*
configuration parameter as described in "Number of Quantiles for Columns (num_quantiles)". If collecting only frequent value statistics, this parameter can be
set to zero. Setting this parameter to "1" will also result in no
quantile statistics being gathered since the entire range of values will fit
in one quantile.

Why collect and store distribution statistics? The answer lies in the fact that an optimizer needs to estimate the number of rows in a column that satisfy an equality or range predicate in order to select the least expensive access plan. The more accurate the estimate, the greater the likelihood that the optimizer will choose the optimal access plan. For example, consider the query

SELECT C1, C2 FROM TABLE1 WHERE C1 = 'NEW YORK' AND C2 <= 10

and suppose that there is an index on C1 and an index on C2. One
possible access plan is to use the index on C1 to retrieve all rows with
`C1 = 'NEW YORK'` and then check each retrieved row to see
if `C2 <= 10`. An alternative plan is to use the index on
C2 to retrieve all rows with `C2 <= 10` and then check each
retrieved row to see if `C1 = 'NEW YORK'`.
Typically, the primary cost in executing the above query is the cost of the
retrieving the rows, and so it is desirable to choose the plan the that
requires the minimum number of retrievals. To choose the best plan, it
is necessary to estimate the number of rows that satisfy each
predicate.

When you do not request distribution statistics, the optimizer maintains
only the second-highest data value (HIGH2KEY), second-lowest data value
(LOW2KEY), number of distinct values (COLCARD), and number of rows (CARD) for
a column. The number of rows that satisfy an equality or range
predicate is then estimated under the assumption that the frequencies of the
data values in a column are all equal and the data values are evenly spread
out over the interval (LOW2KEY, HIGH2KEY). Specifically, the number of
rows satisfying an equality predicate `C1 = KEY` is estimated as
CARD/COLCARD, and the number of rows satisfying a range predicate `C1
BETWEEN KEY1 AND KEY2` is estimated as:

KEY2 - KEY1 ------------------- x CARD (1) HIGH2KEY - LOW2KEY

These estimates are accurate only when the true distribution of data values in a column is reasonably uniform. When distribution statistics are unavailable and either the frequencies of the data values differ widely from each other or the data values are clustered in a few sub-intervals of the interval (LOW_KEY,HIGH_KEY), the estimates can be off by orders of magnitude and the optimizer may choose a less than optimal access plan.

When distribution statistics are available, the errors described above can be greatly reduced by using frequent-value statistics to compute the number of rows that satisfy an equality predicate and using frequent-value statistics and quantiles to compute the number of rows that satisfy a range predicate.

**Example of Impact on Equality Predicates:**

Consider first a predicate of the form `C1 = KEY`. If
`KEY` is one of the N most frequent values, then the optimizer simply
uses the frequency of `KEY` that is stored in the catalog. If
`KEY` is not one of the N most frequent values, the optimizer
estimates the number of rows that satisfy the predicate under the assumption
that the (COLCARD - N) non-frequent values have a uniform distribution.
That is, the number of rows is estimated as:

CARD - NUM_FREQ_ROWS -------------------- (2) COLCARD - N

where NUM_FREQ_ROWS is the total number of rows with a value equal to one of the N most frequent values.

For example, consider a column (`C`) for which the frequency of
the data values is as follows:

Data Value Frequency ---------- --------- 1 2 2 3 3 40 4 4 5 1

Suppose that frequent-value statistics based on only the most frequent
value (that is, N = 1) are available. For this column, CARD = 50 and
COLCARD = 5. For the predicate `C = 3`, exactly 40 rows
satisfy it. Assuming a uniform data distribution, the number of rows
that satisfy the predicate is estimated as 50/5 = 10, an error of -75%.
Using frequent-value statistics, the number of rows is estimated as 40, with
no error.

Similarly, 2 rows satisfy the predicate `C = 1`. Without
frequent-value statistics, the number of rows that satisfy the predicate is
estimated as 10, an error of 400%. You may use the following formula to
calculate the estimation error (as a percentage):

estimated rows - actual rows ----------------------------- X 100 actual rows

Using the frequent value statistics (N = 1), the optimizer will estimate the number of rows containing this value using the formula (2) given above, for example:

(50 - 40) --------- = 3 (5 - 1)

and the error is reduced by an order of magnitude as shown below:

3 - 2 ------- = 50% 2

The number of rows that satisfy a range predicate can be estimated using
quantiles, as illustrated by the following examples. Consider a column
(`C`) given by:

C ------- 0.0 5.1 6.3 7.1 8.2 8.4 8.5 9.1 93.6 100.0

and suppose that K-quantiles are available for K = 1, 4, 7, and 10:

K K-quantile --- ---------- 1 0.0 4 7.1 7 8.5 10 100.0

First consider the predicate `C <= 8.5`. For the
data given above, exactly 7 rows satisfy this predicate. Assuming a
uniform data distribution and using formula (1) from above, with KEY1 replaced
by LOW2KEY, the number of rows that satisfy the predicate is estimated
as:

8.5 - 5.1 ---------- x 10 *= 0 93.6 - 5.1

where `*=` means "approximately equal to". The error
in this estimation is approximately -100%.

Using quantiles, the number of rows that satisfy this same predicate
(`C <= 8.5`) is estimated by locating 8.5 as one of
the K-quantile values and using the corresponding value of K, namely 7, as the
estimate. In this case, the error is reduced to 0.

Now consider the predicate `C <= 10`. Exactly 8 rows
satisfy this predicate. Unlike the previous example, the value 10 is
not one of the stored K-quantiles. Assuming a uniform data distribution
and using formula (1), the number of rows that satisfy the predicate is
estimated as 1, an error of -86%.

Using quantiles, the optimizer estimates the number of rows that satisfy
the predicate as `r_1 + r_2`, where `r_1` is the number of
rows satisfying the predicate `C <= 8.5` and `r_2`
is the number of rows satisfying the predicate `C > 8.5 AND C
<= 10.`. As in the above example, `r_1 =
7`. To estimate `r_2` the optimizer uses linear
interpolation:

100.0 - 10.0 r_2 *= ------------ x (# rows with value > 8.5 and <= 100.0) 100.0 - 8.5 100.0 - 10.0 = ----------- x (10 - 7) 100.0 - 8.5 *= 3

The final estimate is `r_1 + r_2 *= 10`, and the absolute error is
reduced by more than a factor of 3.

The reason that the use of quantiles improves the accuracy of the estimates in the above examples is that the real data values are "clustered" in the range 5 - 10, but the standard estimation formulas assume that the data values are spread out evenly between 0 and 100.

The use of quantiles also improves accuracy when there are significant differences in the frequencies of different data values. Consider a column having data values with the following frequencies:

Data Value Frequency ---------- --------- 20 5 30 5 40 15 50 50 60 15 70 5 80 5

Suppose that K-quantiles are available for K = 5, 25, 75, 95, and 100:

K K-quantile ---- ---------- 5 20 25 40 75 50 95 70 100 80

Also suppose that frequent value statistics are available based on the 3 most frequent values.

Consider the predicate `C BETWEEN 20 AND 30`. From the
distribution of the data values, you can see that exactly 10 rows satisfy this
predicate. Assuming a uniform data distribution and using formula (1),
the number of rows that satisfy the predicate is estimated as:

30 - 20 ------- x 100 = 25 70 - 30

which has an error of 150%.

Using frequent-value statistics and quantiles, the number of rows that
satisfy the predicate is estimated as `r_1 + r_2`, where
`r_1` is the number of rows that satisfy the predicate `(C =
20)` and `r_2` is the number of rows that satisfy the predicate
`C > 20 AND C <= 30`. Using formula (2), `r_1`
is estimated as:

100 - 80 -------- = 5 7 - 3

Using linear interpolation, `r_2` is estimated as:

30 - 20 ------- x (# rows with value > 20 and <= 40) 40 - 20 30 - 20 = ------- x (25 - 5) 40 - 20 = 10,

yielding a final estimate of 15 and reducing the error by a factor of 3.