Administration Guide

Collecting Statistics Using the RUNSTATS Utility

The RUNSTATS utility updates statistics in the system catalog tables to help with the query optimization process. Without these statistics, the database manager could make a decision that would adversely affect the performance of an SQL statement. The RUNSTATS utility allows you to collect statistics on the data contained in the tables, indexes, or both tables and indexes.

Use the RUNSTATS utility to collect statistics based on both the table and the index data to provide accurate information to the access plan selection process in the following situations:

When a table has been loaded with data, and the appropriate indexes have been created.
When a table has been reorganized with the REORG utility.
When there have been extensive updates, deletions, and insertions that affect a table and its indexes. ("Extensive" in this case may mean that 10 to 20 percent of the table and index data has been affected.)
Before binding application programs whose performance is critical
When comparison with previous statistics is desired. Running statistics on a periodic basis permits the discovery of performance problems at an early stage, as described below.
When the prefetch quantity is changed.
When you have used the REDISTRIBUTE NODEGROUP utility.

When you are working in a partitioned database, collect the statistics related to a table and its indexes by executing the RUNSTATS operation at a single node. (The node at which the utility executes is determined by whether the node at which you issue the command contains table data or not. See "The Database Partition Where RUNSTATS is Executed" for details.) Because the statistics stored in the catalogs are supposed to represent table-level information, the node-level statistics collected by the database manager are multiplied where appropriate by the number of nodes across which the table is partitioned. This provides an approximation of the actual statistics that would be collected by executing RUNSTATS at every node and aggregating these statistics.
Note: The DB2 query optimizer assumes that attribute values (data) are placed equally and evenly across the database partitions of the system. If the placement of data is not equal, you should run this command on a database partition that you think has a representative table distribution.

The Database Partition Where RUNSTATS is Executed

When you invoke RUNSTATS on a table, you must be connected to the database in which the table is stored, but the database partition from which you issue the command does not have to contain a partition for this table:

If you issue RUNSTATS from a database partition that contains a partition for the table, the utility executes at this database partition.
If you issue RUNSTATS from a database partition that does not contain a table partition, the request is sent to the first database partition in the nodegroup that holds a partition for the table. The utility then executes at this database partition.

Analyzing Statistics

Analyzing the statistics can indicate when reorganization is necessary. Some of these indications are:

Clustering of indexes
If cluster ratio statistics are collected, their value will be in the range from 0 to 100. If cluster factor statistics are collected, their value will be a number between 0 and 1. Only one of these two clustering statistics will be recorded in the SYSCAT.INDEXES catalog. In general, only one of the indexes in a table can have a high degree of clustering. A value of -1 is used to indicate that no statistics are available.
If you wish to compare ratio values, multiply the cluster factor by 100 to obtain a percentage value for the amount of clustering.
Index scans that are not index-only accesses might perform better with higher cluster ratios. A low cluster ratio leads to more I/O for this type of scan, since after the first access of each data page, it is less likely that the page is still in the buffer pool the next time it is accessed. Increasing the buffer size can improve the performance of an unclustered index. (See "Understanding List Prefetching" for information about how the database manager can improve index scan performance for indexes with low cluster ratios and see "Clustered Indexes" for information about how the optimizer uses index statistics.)
If the table data was initially clustered with respect to a certain index, and the above clustering information indicates that the data is now poorly clustered for that same index, you may wish to reorganize the table to re-cluster the data with respect to that index.
Overflow of rows
The overflow number indicates the number of rows that do not fit on their original pages. This can occur when VARCHAR columns are updated with longer values. In such cases, a pointer is kept at the row's original location. This can hurt performance, because the database manager must follow the pointer to find the row's contents, which increases the processing time and may also increase the number of I/Os.
As the number of overflow rows grows higher, the potential benefit of reorganizing your table data also increases. Reorganizing the table data will eliminate the overflowing of rows.
Comparison of file pages
The number of pages with rows can be compared with the total number of pages that a table contains. Empty pages will be read for a table scan. Empty pages can occur when entire ranges of rows are deleted.
As the number of empty pages grows higher, so does the need for a table reorganization. Reorganizing the table can compress the amount of space used by a table, by reclaiming these empty pages. In addition to more efficient use of disk space, reclaiming unused pages can also improve the performance of table scan, since fewer pages will be read into the buffer pool.

Number of leaf pages

The number of leaf pages predicts how many index page I/Os are needed for a complete scan of an index.

Random update activity can cause page splits to occur that increase the size of the index beyond the minimum amount of space required. When indexes are rebuilt during the reorganization of a table, it is possible to build each index with the minimum amount of space possible. For more information on the minimum space requirements for an index, see "Creating an Index" and "Indexing Impact on Query Optimization".
Note: A default of ten percent free space is left on each index page when the indexes are rebuilt. You can increase the free space amount by using the PCTFREE parameter when first creating the index. Then, whenever you reorganize the index, the PCTFREE value is used. Having a free space larger than ten percent may be important if you wish to reduce the number of times you need to reorganize the index. The free space is used to accomodate additional index inserts.

RUNSTATS can also help you determine how performance is related to changes in your database. The statistics show the data distribution within a table. When used routinely, RUNSTATS provides data about tables and indexes over a period of time, thereby allowing performance trends to be identified for your data model as it evolves over time.

Ideally, you should rebind application programs after running statistics, because the query optimizer may choose a different access plan given the new statistics.

If you do not have enough time available to collect all of the statistics at one time, you may choose to periodically run RUNSTATS to update only a portion of the statistics that could be gathered. If inconsistencies are found as a result of activity on the table between the periods where you run RUNSTATS with a selective partial update, then a warning message (SQL0437W, reason code 6) is issued. For example, you first use RUNSTATS to gather table distribution statistics. Subsequently, you use RUNSTATS to gather index statistics. If inconsistencies are detected as a result of activity on the table, then the table distribution statistics are dropped and the warning message is issued. It is recommended that you run RUNSTATS to gather table distribution statistics when this happens.

You should periodically use RUNSTATS to gather both table and index statistics at once, to ensure that the index statistics are synchronized with the table statistics. Index statistics retain most of the table and column statistics collected from the last run of RUNSTATS. If the table has been modified extensively since the last time its table statistics were gathered, gathering only the index statistics for that table will leave the two sets of statistics out of synchronization.

You may wish to collect statistics based only on index data in the following situations:

A new index has been created since the utility was performed and you do not want to re-collect statistics on the table data.
There have been a lot of changes to the data that affect the first column of an index.

The RUNSTATS utility allows you to collect varying levels of statistics. For tables, you can collect basic level statistics or you can also collect distribution statistics for the column values within a table (see "Collecting and Using Distribution Statistics"). For indexes, you can collect basic level statistics or you can also collect detailed statistics which can help the optimizer better estimate the I/O cost of an index scan. (See "Clustered Indexes" for information about these "detailed" statistics).
Note: Statistics are not collected for LONG or large object (LOB) columns. For row types, the table level statistics NPAGES, FPAGES, and OVERFLOW are not collected for a sub-table.

The following tables show the catalog statistics that are updated by the RUNSTATS utility:

Table 39. Table Statistics (SYSCAT.TABLES and SYSSTAT.TABLES)

Statistic Description RUNSTATS Option
Table Indexes
FPAGES number of pages being used by a table Yes Yes
NPAGES number of pages containing rows Yes Yes
OVERFLOW number of rows that overflow Yes No
CARD number of rows in table (cardinality) Yes Yes (Note 2)

Note:

For a partitioned database, the values for each statistic are estimated from the value of the count at the database partition multiplied by the number of database partitions.
If the table has no indices defined and you request statistics for indexes, no new CARD statistics are updated. The previous CARD statistics are retained.

Table 40. Column Statistics (SYSCAT.COLUMNS and SYSSTAT.COLUMNS)

Statistic Description RUNSTATS Option
Table Indexes
COLCARD column cardinality Yes (Note 1) Yes (Note 2)
AVGCOLLEN average length of column Yes Yes (Note 2)
HIGH2KEY second highest value in column Yes Yes (Note 2)
LOW2KEY second lowest value in column Yes Yes (Note 2)

Note:

COLCARD is estimated for all columns in the table. In a partitioned database, if the column is the single-column partitioning key for the table, the value of the count is estimated as the count at the database partition multiplied by the number of database partitions.
Column statistics are gathered for the first column in the index key.

Table 41. Index Statistics (SYSCAT.INDEXES and SYSSTAT.INDEXES)

Statistic Description RUNSTATS Option
Table Indexes
NLEAF number of index leaf pages No Yes (Note 3)
NLEVELS number of index levels No Yes
CLUSTERRATIO degree of clustering of table data No Yes (Note 2)
CLUSTERFACTOR finer degree of clustering No Detailed (Notes 1,2)
DENSITY Ratio (percentage) of SEQUENTIAL_PAGES to number of pages in the range of pages occupied by the index (Note 4) No Yes
FIRSTKEYCARD number of distinct values in first column of the index No Yes (Note 3)
FIRST2KEYCARD number of distinct values in first two columns of the index No Yes (Note 3)
FIRST3KEYCARD number of distinct values in first three columns of the index No Yes (Note 3)
FIRST4KEYCARD number of distinct values in first four columns of the index No Yes (Note 3)
FULLKEYCARD number of distinct values in all columns of the index No Yes (Note 3)
PAGE_FETCH_PAIRS page fetch estimates for different buffer sizes No Detailed (Notes 1,2)
SEQUENTIAL_PAGES number of leaf pages located on disk in index key order, with few or no large gaps between them No Yes

Note:

Detailed index statistics are gathered by specifying the DETAILED clause on the RUNSTATS command, or by specifying A, Y or X for the statsopt parameter when calling the RUNSTATS API.
CLUSTER_FACTOR and PAGE_FETCH_PAIRS are not collected with the DETAILED clause unless the table is of a respectable size. If the table is greater than about 25 pages, then CLUSTERFACTOR and PAGE_FETCH_PAIRS statistics are collected. In this case, CLUSTERRATIO is -1 (not collected). If the table is a relatively small table, only CLUSTERRATIO is filled in by RUNSTATS while CLUSTERFACTOR and PAGE_FETCH_PAIRS are not. If the DETAILED clause is not specified, only the CLUSTERRATIO statistic is collected.
For a partitioned database, the value is estimated from the value of the count at the database partition multiplied by the number of database partitions.
This statistic measures the percentage of pages in the file containing the index that belongs to that table. For a table having only one index defined on it, DENSITY should normally be 100. DENSITY is used by the optimizer to estimate how many irrelevant pages from other indexes might be read, on average, if the index pages were prefetched.

Table 42. Column Distribution Statistics (SYSCAT.COLDIST and SYSSTAT.COLDIST)

Statistic Description RUNSTATS Option
Table Indexes
DISTCOUNT If TYPE is Q, the number of distinct values that are less than or equal to COLVALUE statistics Distribution (Note 2) No
TYPE Indicator of whether row provides frequent-value or quantile statistics Distribution No
SEQNO Frequency ranking of a sequence number to help uniquely identify the row in the table Distribution No
COLVALUE Data value for which frequency or quantile statistic is collected Distribution No
VALCOUNT Frequency with which the data value occurs in column, or for quantiles, the number of values less than or equal to the data value (COLVALUE) Distribution No

Note:

Column distribution statistics are gathered by specifying the WITH DISTRIBUTION clause on the RUNSTATS command, or by specifying A, D or Y for the statsopt parameter when calling the RUNSTATS API. Note that distribution statistics may not be gathered unless there is a sufficient lack of uniformity in the column values.
DISTCOUNT is collected only for columns that are the first key column in an index.
In a partitioned database, VALCOUNT is the estimated value of the count at the database partition multiplied by the number of database partitions. The exception to this is where the TYPE is 'F' and the column is the single-column partitioning key of the table, in which case VALCOUNT is simply the count at the database partition.

For more information about column distribution statistics, see "Collecting and Using Distribution Statistics".

Statistics for user-defined functions are not collected by the RUNSTATS utility. You must manually update the statistics for these functions. See "User Update-Capable Catalog Statistics" and "Updating Statistics for User-Defined Functions".

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]