## Collecting and Using Distribution Statistics

The database manager can collect, maintain, and use "frequent-value statistics" and "quantiles", two types of statistics that estimate, in a concise way, the distribution of the data values in a column. Use of these statistics by the optimizer can lead to significantly more accurate estimates of the number of rows in a column that satisfy given equality or range predicates. These more accurate estimates in turn increase the likelihood that the optimizer will choose an optimal plan.

You may collect statistics about the distribution of these data values by using the WITH DISTRIBUTION clause on the RUNSTATS command. While collecting these additional statistics results in additional overhead for the RUNSTATS utility, the SQL compiler can use this information to help ensure the best access plan is chosen.

In some cases, the database manager will not collect distribution statistics and no error will be returned. For example:

• The num_freqvalues and num_quantiles configuration parameters are set to zero (0) to indicate that you do not want to collect distribution statistics. For more information about these parameters, see:
• The distribution of the data is known without the use of distribution statistics. For example, a column that does not have any data value appearing more than once, that is, each data value in the column is unique.
• The data type is one for which statistics are not collected. That is, the column is defined using a long field or large object data type.
• In the case of quantiles, there is only one non-NULL value in the column.

Distribution statistics are exact for the first column of indexes. For each additional column, the database manager uses hashing and sampling techniques to estimate the distribution statistics because calculating exact statistics would require too much time and memory to be practical. These techniques are accepted statistical methods with accepted degrees of accuracy.

The following topics provide information to help you understand and use these distribution statistics:

### Understanding Distribution Statistics

For a fixed number N>=1, the N most frequent values in a column consist of the data value having the highest frequency (that is, number of duplicates), the data value having the second highest frequency, and so forth, down to the data value having the Nth highest frequency. The corresponding frequent-value statistics consist of these "N" data values, together with the frequencies of these values in the column.

The K-quantile for a column is the smallest data value, V, such that at least "K" rows have data values less than or equal to V. A K-quantile can be computed by sorting the rows in the column according to increasing data values; the K-quantile is the data value in the Kth row of the sorted column.

For example, consider the following column of data:

```   C1
--
B
E
Y
B
F
G
E
A
J
K
E
L
```

This column can be sorted to obtain the following ordered values:

```   C1'
--
A
B
B
E
E
E
F
G
J
K
L
Y
```

There are nine distinct data values in column C1. For N = 2, the frequent value statistics are:

```   SEQNO    COLVALUE     VALCOUNT
-----    ---------    --------
1          E            3
2          B            2
```

If the number of quantiles being collected is 5 (see "Number of Quantiles for Columns (num_quantiles)"), then the K-quantiles for this column for K = 1, 3, 6, 9, and 12 are:

```   SEQNO    COLVALUE     VALCOUNT
-----    ---------    --------
1          A            1
2          B            3
3          E            6
4          J            9
5          Y           12
```

In this example, the 6-quantile is equal to E since the sixth row in the sorted column has a data value equal to E (and 6 rows in the original column have data values less than or equal to E).

The same quantile value may occur more than once, if it is a common value. A maximum of two quantiles will be stored for a given value. The first of these two quantiles has a COLCOUNT that gives the number of rows strictly less than COLVALUE, and the second of the two quantiles gives the number of rows less than or equal to COLVALUE.

### When Should You Use Distribution Statistics?

To decide whether distribution statistics should be kept for a given table, two factors should be considered:

1. The use of static or dynamic SQL.

Distribution statistics are most useful for dynamic SQL and static SQL that does not use host variables. When using SQL with host variables, the optimizer makes limited use of distribution statistics.

2. The lack of uniformity in the data distributions.

Keeping distribution statistics is advisable if at least one column in the table has a highly "non-uniform" distribution of data and the column appears frequently in equality or range predicates; that is, in clauses such as the following:

```   WHERE C1 = KEY;
WHERE C1 IN (KEY1, KEY2, KEY3);
WHERE (C1 = KEY1) OR (C1 = KEY2) OR (C1 = KEY3);
WHERE C1 <= KEY;
WHERE C1 BETWEEN KEY1 AND KEY2;
```

There can be two types of non-uniformity in a data distribution, possibly occurring together:

• One type of non-uniformity occurs when the data, instead of being evenly spread out between the highest and lowest data value, is clustered in some sub-interval, as in the following column, where the data is clustered in the range (5,10):
```     C1
-----
0.0
5.1
6.3
7.1
8.2
8.4
8.5
9.1
93.6
100.0
```

It can be useful to keep quantiles when this type of non-uniformity is present.

The following example shows a query that can be used to help determine whether a high degree of non-uniformity exists in a column.

```   SELECT C1, COUNT(*) AS OCCURRENCES
FROM T1
GROUP BY C1
ORDER BY OCCURRENCES DESC;
```
• Another type of non-uniformity occurs when certain data values have a much higher frequency than other data values, as in a column having data values with the following frequencies:
```   Data Value   Frequency
----------   ---------
20           5
30          10
40          10
50          25
60          25
70          20
80           5
```

It can be useful to keep both quantiles and frequent-value statistics when this type of non-uniformity is present.

You may collect distribution statistics by using the WITH DISTRIBUTION clause on the RUNSTATS command, or by specifying D, E, or A for the statsopt parameter when calling the RUNSTATS API. For more information, refer to the Command Reference or the API Reference manuals.

### How Many Statistics Should You Keep?

Keeping a large number of column distribution statistics can lead to improved selection of access plans by the optimizer, but the cost of collecting these statistics and compiling your queries increases accordingly. The size of the statistics heap (see "Statistics Heap Size (stat_heap_sz)") may place limitations on the number of statistics that can be computed and stored.

When distribution statistics are requested, the database manager stores a default of the 10 most frequent values for a column. Keeping between 10 and 100 frequent values should suffice for most practical situations. Ideally, enough frequent-value statistics should be retained so that the frequencies of the remaining values are either approximately equal to each other or negligible compared to the frequencies of the most frequent values.

To set the number of frequent values to collect, use the num_freqvalues configuration parameter, as described in "Number of Frequent Values Retained (num_freqvalues)". The database manager may collect less than this number of frequent value statistics, because these statistics will only be collected for data values that occur more than once. If collecting only quantile statistics, this parameter can be set to zero.

When distribution statistics are requested, the database manager stores a default of 20 quantiles for a column. This value guarantees a maximum estimation error of approximately 2.5% for any simple single-sided range predicate (>, >=, <, or <=), and a maximum error of 5% for any BETWEEN predicate. A rough rule of thumb for determining the number of quantiles is:

• Determine the maximum error that is tolerable in estimating the number of rows of any range query, as a percentage, P
• The number of quantiles should be approximately 100/P if the predicate is a BETWEEN predicate, and 50/P if the predicate is any other type of range predicate (<, <=, >, or >=).

For example, 25 quantiles should result in a maximum estimate error of 4% for BETWEEN predicates and of 2% for ">" predicates. In general, at least 10 quantiles should be kept, and more than 50 quantiles should be necessary only for extremely non-uniform data.

To set the number of quantiles, use the num_quantiles configuration parameter as described in "Number of Quantiles for Columns (num_quantiles)". If collecting only frequent value statistics, this parameter can be set to zero. Setting this parameter to "1" will also result in no quantile statistics being gathered since the entire range of values will fit in one quantile.

### How Does the Optimizer Use Distribution Statistics?

Why collect and store distribution statistics? The answer lies in the fact that an optimizer needs to estimate the number of rows in a column that satisfy an equality or range predicate in order to select the least expensive access plan. The more accurate the estimate, the greater the likelihood that the optimizer will choose the optimal access plan. For example, consider the query

```   SELECT C1, C2
FROM TABLE1
WHERE C1 = 'NEW YORK'
AND C2 <= 10
```

and suppose that there is an index on C1 and an index on C2. One possible access plan is to use the index on C1 to retrieve all rows with C1 = 'NEW YORK' and then check each retrieved row to see if C2 <= 10. An alternative plan is to use the index on C2 to retrieve all rows with C2 <= 10 and then check each retrieved row to see if C1 = 'NEW YORK'. Typically, the primary cost in executing the above query is the cost of the retrieving the rows, and so it is desirable to choose the plan the that requires the minimum number of retrievals. To choose the best plan, it is necessary to estimate the number of rows that satisfy each predicate.

When you do not request distribution statistics, the optimizer maintains only the second-highest data value (HIGH2KEY), second-lowest data value (LOW2KEY), number of distinct values (COLCARD), and number of rows (CARD) for a column. The number of rows that satisfy an equality or range predicate is then estimated under the assumption that the frequencies of the data values in a column are all equal and the data values are evenly spread out over the interval (LOW2KEY, HIGH2KEY). Specifically, the number of rows satisfying an equality predicate C1 = KEY is estimated as CARD/COLCARD, and the number of rows satisfying a range predicate C1 BETWEEN KEY1 AND KEY2 is estimated as:

```       KEY2 - KEY1
-------------------  x CARD       (1)
HIGH2KEY - LOW2KEY
```

These estimates are accurate only when the true distribution of data values in a column is reasonably uniform. When distribution statistics are unavailable and either the frequencies of the data values differ widely from each other or the data values are clustered in a few sub-intervals of the interval (LOW_KEY,HIGH_KEY), the estimates can be off by orders of magnitude and the optimizer may choose a less than optimal access plan.

When distribution statistics are available, the errors described above can be greatly reduced by using frequent-value statistics to compute the number of rows that satisfy an equality predicate and using frequent-value statistics and quantiles to compute the number of rows that satisfy a range predicate.

Example of Impact on Equality Predicates:

Consider first a predicate of the form C1 = KEY. If KEY is one of the N most frequent values, then the optimizer simply uses the frequency of KEY that is stored in the catalog. If KEY is not one of the N most frequent values, the optimizer estimates the number of rows that satisfy the predicate under the assumption that the (COLCARD - N) non-frequent values have a uniform distribution. That is, the number of rows is estimated as:

```   CARD - NUM_FREQ_ROWS
--------------------           (2)
COLCARD - N
```

where NUM_FREQ_ROWS is the total number of rows with a value equal to one of the N most frequent values.

For example, consider a column (C) for which the frequency of the data values is as follows:

```   Data Value   Frequency
----------   ---------
1            2
2            3
3           40
4            4
5            1
```

Suppose that frequent-value statistics based on only the most frequent value (that is, N = 1) are available. For this column, CARD = 50 and COLCARD = 5. For the predicate C = 3, exactly 40 rows satisfy it. Assuming a uniform data distribution, the number of rows that satisfy the predicate is estimated as 50/5 = 10, an error of -75%. Using frequent-value statistics, the number of rows is estimated as 40, with no error.

Similarly, 2 rows satisfy the predicate C = 1. Without frequent-value statistics, the number of rows that satisfy the predicate is estimated as 10, an error of 400%. You may use the following formula to calculate the estimation error (as a percentage):

```   estimated rows  - actual rows
-----------------------------  X 100
actual rows
```

Using the frequent value statistics (N = 1), the optimizer will estimate the number of rows containing this value using the formula (2) given above, for example:

```   (50 - 40)
--------- = 3
(5 - 1)
```

and the error is reduced by an order of magnitude as shown below:

```    3 - 2
------- = 50%
2
```

The number of rows that satisfy a range predicate can be estimated using quantiles, as illustrated by the following examples. Consider a column (C) given by:

```     C
-------
0.0
5.1
6.3
7.1
8.2
8.4
8.5
9.1
93.6
100.0
```

and suppose that K-quantiles are available for K = 1, 4, 7, and 10:

```    K   K-quantile
---  ----------
1      0.0
4      7.1
7      8.5
10    100.0
```

First consider the predicate C <= 8.5. For the data given above, exactly 7 rows satisfy this predicate. Assuming a uniform data distribution and using formula (1) from above, with KEY1 replaced by LOW2KEY, the number of rows that satisfy the predicate is estimated as:

```    8.5 - 5.1
---------- x 10 *= 0
93.6 - 5.1
```

where *= means "approximately equal to". The error in this estimation is approximately -100%.

Using quantiles, the number of rows that satisfy this same predicate (C <= 8.5) is estimated by locating 8.5 as one of the K-quantile values and using the corresponding value of K, namely 7, as the estimate. In this case, the error is reduced to 0.

Now consider the predicate C <= 10. Exactly 8 rows satisfy this predicate. Unlike the previous example, the value 10 is not one of the stored K-quantiles. Assuming a uniform data distribution and using formula (1), the number of rows that satisfy the predicate is estimated as 1, an error of -86%.

Using quantiles, the optimizer estimates the number of rows that satisfy the predicate as r_1 + r_2, where r_1 is the number of rows satisfying the predicate C <= 8.5 and r_2 is the number of rows satisfying the predicate C > 8.5 AND C <= 10.. As in the above example, r_1 = 7. To estimate r_2 the optimizer uses linear interpolation:

```          100.0 - 10.0
r_2 *= ------------ x (# rows with value > 8.5 and <= 100.0)
100.0 - 8.5
100.0 - 10.0
= ----------- x (10 - 7)
100.0 - 8.5
*= 3
```

The final estimate is r_1 + r_2 *= 10, and the absolute error is reduced by more than a factor of 3.

The reason that the use of quantiles improves the accuracy of the estimates in the above examples is that the real data values are "clustered" in the range 5 - 10, but the standard estimation formulas assume that the data values are spread out evenly between 0 and 100.

The use of quantiles also improves accuracy when there are significant differences in the frequencies of different data values. Consider a column having data values with the following frequencies:

```   Data Value   Frequency
----------   ---------
20            5
30            5
40           15
50           50
60           15
70            5
80            5
```

Suppose that K-quantiles are available for K = 5, 25, 75, 95, and 100:

```    K     K-quantile
----   ----------
5       20
25       40
75       50
95       70
100       80
```

Also suppose that frequent value statistics are available based on the 3 most frequent values.

Consider the predicate C BETWEEN 20 AND 30. From the distribution of the data values, you can see that exactly 10 rows satisfy this predicate. Assuming a uniform data distribution and using formula (1), the number of rows that satisfy the predicate is estimated as:

```   30 - 20
-------  x 100 = 25
70 - 30
```

which has an error of 150%.

Using frequent-value statistics and quantiles, the number of rows that satisfy the predicate is estimated as r_1 + r_2, where r_1 is the number of rows that satisfy the predicate (C = 20) and r_2 is the number of rows that satisfy the predicate C > 20 AND C <= 30. Using formula (2), r_1 is estimated as:

```   100 - 80
-------- = 5
7 - 3
```

Using linear interpolation, r_2 is estimated as:

```     30 - 20
------- x (# rows with value > 20 and <= 40)
40 - 20
30 - 20
= ------- x (25 - 5)
40 - 20
= 10,
```

yielding a final estimate of 15 and reducing the error by a factor of 3.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]