Administration Guide

Using the LOAD Utility

The LOAD utility is intended for the initial load or an append of a table where large amounts of data are moved. There are no restrictions on the data types used by the LOAD utility including large objects (LOBs) and user-defined types (UDTs). The LOAD utility speeds up the task of loading large amounts of data into a database. LOAD is faster than IMPORT because LOAD writes formatted pages directly into the database while IMPORT does SQL INSERTs. The data being loaded must be local to the server (unlike IMPORT and EXPORT where data can be passed from the client).

In addition to the overview to the LOAD utility above, here are some details about the LOAD utility that may be of interest to you if you are concerned about recovering your database after a failure involving the database. See Chapter 7. "Recovering a Database" for more information on this subject. Logging is required for fully recoverable databases. The LOAD utility almost completely eliminates the logging associated with the loading of data. In place of logging, you have the option of making a copy of the loaded portion of the table. LOAD does not fire triggers; and does not perform referential, and table, constraint checking (other than validating the uniqueness of the unique indexes). Tables with such options defined may be populated faster, or more simply, using IMPORT. If you have a database environment that allows for database recovery following a failure, you can do one of the following:

Use the non-recoverable LOAD option (and not require a backup)
Explicitly request that a copy of the loaded portion of the table be made
Take a backup of the table space(s) in which the table resides immediately after the completion of the load operation.

Note:

You are not able to LOAD data into typed tables.

The LOAD utility can take advantage of a hardware configuration where multiple processors and/or multiple storage devices are used such as in a symmetric multiprocessor (SMP) environment. There are several ways in which parallel processing of large amounts of data can take place using the LOAD utility. One way is through the use of multiple storage devices which allows for I/O parallelism during the LOAD process. Another way involves the use of the multiple processors in an SMP environment which allows for intra-partition parallelism. And both can be used together to provide even faster loading of the data.

The following topics provide more information:

"Overview of the LOAD Process"
"Details About LOAD"
"LOAD Performance Considerations"
"LOAD Parameter Hints and Tips"
"LOAD Temporary Files"
"Using LOAD QUERY"
"Running Concurrent LOAD Jobs"
"Restarting LOAD and Database Recovery"
"LOAD Exception Table"
"Checking For Constraint Violations"

Overview of the LOAD Process

There are multiple phases to the LOAD process: Load, where the data is written into the table; Build, where the indexes are created; and Delete, where the rows that caused a unique key violation are removed from the table. You must run the SET CONSTRAINTS SQL statement after the load completes if there are tables left in the check pending state to validate the table for referential integrity and check constraints. The LOAD utility generates messages about the progress of each phase. If a failure occurs during the LOAD process, these messages will assist you in deciding how to recover.

During the Load phase, data is loaded into the table; index keys and table statistics are collected if necessary. Save points, or points of consistency, are established at intervals specified by you in the SAVECOUNT parameter on the LOAD command. These points of consistency are not established exactly on the number of rows specified with the SAVECOUNT parameter; rather the number of rows are converted to a page count, and rounded up to intervals of the extent size. Messages let you know how many input rows were successfully loaded at the time of the save point. If a failure occurs, you should use the number of input rows at the last successful consistency point with the RESTARTCOUNT parameter during a restart. If the failure occurs near the beginning of the LOAD process and you were doing a REPLACE, you might consider restarting the load using the REPLACE option.

During the Build phase, indexes are created based on the index keys collected in the Load phase. The index keys are sorted during the Load phase and index statistics are collected. The statistics collected are similar to those collected during RUNSTATS. If a failure occurs, the Build phase restarts from the beginning.

Unique key violations are placed into the exception table, if one was specified, and messages on rejected rows are put into the message file. Following the completion of the LOAD process: review these messages, correct any problems, and insert the corrected rows into the table.
Note: The recording of warnings has a detrimental effect on the performance of the LOAD. If performance is important, and you anticipate a large number of warnings, you should consider using the NOROWWARNINGS filetype modifier. If this filetype modifier was specified, these warnings are suppressed.

During the Delete phase, all rows causing a unique key violation are deleted. If a failure occurs, this phase should be restarted by you from the beginning. Information on the rows containing the invalid keys is stored in a temporary file. After you request a restart to begin at the Delete phase, the violating rows are deleted based on the information in a temporary file. You must not modify any data in any temporary files. Also, you must restart the LOAD command with the same parameters, otherwise the Delete phase will fail. If the temporary file has been modified, or does not exist, you should restart the LOAD command at the Build phase. Once the index is re-built, any invalid keys are placed in the exception table if it exists, and duplicate keys are deleted.
Note: The Delete phase is used to remove records from the tables that were inserted in the Load phase which were discovered during the Build phase to have violated a uniqueness condition (for example, they are duplicate records). However, each deletion event is recorded or logged. If you have a large number of records that violate the uniqueness condition, then the log could fill up during the Delete phase.

Since regular logging is not performed, LOAD uses pending states to preserve consistency of the database. The Load and Build phases of the LOAD process place any associated table spaces into a load pending state. The Delete phase of the LOAD process places any associated table spaces into a delete pending state. If you complete the LOAD process but you do not have either logretain or userexit "on"; and, you have not specified the COPY YES option nor the NONRECOVERABLE option, then any associated table spaces are placed in a backup pending state. These states can be checked by using the LIST TABLESPACES command. (For more information on LIST TABLESPACES, see the Command Reference manual.) One last possible state associated with the LOAD process is concerned with referential and check constraints. Dependent tables may be placed in a check pending state following the completion of the LOAD process.

LOAD can also be used with a non-recoverable option. This allows you to perform a non-recoverable LOAD without affecting the recoverability of all other tables in the database. When this type of LOAD is run, there is no requirement for either using the COPY YES option or having a backup taken.

If a LOAD fails, the table space(s) involved could be in an inconsistent state because there is no logging. For this reason, the table spaces are left in a load pending state. To remove the load pending state, you will have to restart the LOAD, perform a LOAD REPLACE on the same table on which the LOAD failed, or recover the table space(s) using a RESTORE with the most recent backup (either table space or database backup) and then carry out further recovery actions. (You could also drop the table space and then re-create it.)

For more information on how to recover, see Chapter 7. "Recovering a Database".

Details About LOAD

The LOAD utility inserts data into a table from an input file, from a device, or using a named pipe, any of which must reside on the node where the database resides. The table must exist. (Indexes on the table may or may not exist. LOAD only builds indexes that are already defined on the table.) If the table receiving the new data already contains data, you can replace or append to the existing data.
Note: If you are loading a table that already contains data and the database is non-recoverable, make sure that you have a backed-up copy of the database, or the table space for the table being loaded, before using LOAD so that you can recover from errors.
If the existing table is a parent table containing a primary or unique key referenced by a foreign key in a dependent table, replacing data in the parent table places the dependent table in a check pending state. The SET CONSTRAINTS statement must then be used to validate the referential and check constraints.

The table spaces in which the loaded table resides are quiesced in exclusive mode. For more information on QUIESCE, see the Command Reference manual.
Note: In the following command line processor example, and in the other in the other examples in this chapter, relative path names are used. Please be aware that relative path names are only allowed on calls from a client on the same node as the database. The use of fully-qualified path names is recommended.

The following is an example of the command line processor syntax for the LOAD command:

     db2 load from stafftab.ixf of ixf messages staff.msgs
        insert into userid.staff copy yes use adsm data buffer 20

This example assumes no indexes are involved, any warning or error messages are placed in staff.msgs, a copy of the changes made is stored in ADSTAR Distributed Storage Manager (ADSM), and 20 pages of buffer space are to be used during the load. See "ADSTAR Distributed Storage Manager" for more information on using ADSM.

The following is a different example of the command line processor syntax for the LOAD command:

     db2 load from stafftab.ixf of ixf messages staff.msgs
        remote file stafftmp replace into staff using . sort buffer 200

This example is similar to the previous one in the use of the MESSAGES parameter. The differences from the previous example are: Instead of INSERTing the data, the data in the table is being replaced; instead of taking a COPY using ADSM, parameters are used to assist with index creation by using the USING and the SORT BUFFER parameters. The REMOTE FILE parameter is used to declare stafftmp as a base file name for temporary files such as "stafftmp.msg", "stafftmp.log", and "stafftmp.rid". The current directory of the user, indicated by using ., is used as the working directory to hold the temporary files needed for index creation. Two hundred (200) pages of buffer space are used to sort index keys. For more information on the LOAD command, see the Command Reference manual.

Since you will typically be loading large amounts of data using the LOAD command, a LOAD QUERY command can be used to check the progress of the LOAD process if the REMOTE FILE option has been specified. You require a connection to the same database and a separate CLP session to enter this command. This command can be used by local and remote users. For more information on the LOAD QUERY command, see the Command Reference manual.

The LOAD utility can also be invoked by the application programming interfaces (APIs) sqluload and sqlgload. For more information on the requirements when loading data to a table using these APIs, see the API Reference manual.

You should also review the following points. Each represents a task that you may need to perform and each is carried out as part of the LOAD command. For more information on any of the following tasks, see the API Reference and/or the Command Reference manuals.

Definition of the path(s) and the input filename(s) in which the LOBs are stored. The use of LOBSINFILE in the MODIFIED BY parameter will tell the LOAD utility that all LOB data is being loaded from files.
Determine if column values being loaded have implied decimal points or not. The use of IMPLIEDDECIMAL in the MODIFIED BY parameter will tell the LOAD utility that there are decimal points to be applied to the data as it enters the table. For example, with the IMPLIEDDECIMAL, the value 12345 is loaded into a DECIMAL(8,2) column as 123.45, NOT 12345.00.
Definition of the path(s) and the input filename(s) in which rejected rows are stored. The use of DUMPFILE in the MODIFIED BY parameter will tell the LOAD utility the name and location of the exception file to which rejected rows are written. When running in a partitioned database environment, the name provided is suffixed with an extension which identifies the partition number where the exceptions were generated. For example, if the following were used on the MODIFIED BY parameter:
```
   DUMPFILE = "/u/usrname/dumpit"
```
On partition zero (0) this will generate a file
```
   /u/usrname/dumpit.000
```
On partition five (5) this will generate a file
```
   /u/usrname/dumpit.005
```
All partitions where exceptions are recorded in an exception file will have the filename with suffizes like those show here.
There are several MODIFIED BY parameters that can be used to control the amount of free space available after a table is loaded. These parameters can be used individually or together to provide free space to allow for INSERT and UPDATE growth to the table following the completion of the LOAD. Using up the free space keeps related rows in close together and generally keeps the performance of table better than if no free space were considered.
- totalfreespace allows you to append empty data pages to the end of the loaded table. The number used with this parameter is the percentage of the total pages in the table that is to be appended to the end of the table as free space. Therefore, if you used twenty (20) with this parameter, and the table has 100 data pages, twenty additional empty pages are appended. The total number of data pages in the table will then be 120.
- pagefreespace allows you to control the amount of free space allowed on each loaded data page. The number used with this parameter is the percentage of each data page that is to be left as free space. The first row in a page is added without restriction. Therefore, with very large rows and a large number used with this parameter, there may be less space left free on each page than indicated by the value used with this parameter.
- indexfreespace allows you to control the amount of free space allowed on each loaded index page. The number used with this parameter is the percentage of each index page that is to be left as free space. The first index entry in a page is added without restriction. Additional index entries are placed in the index page, provided the percent free space threshold can be maintained. The default value is the one used at CREATE INDEX time. This value takes precedence over the PCTFREE value specified in the CREATE INDEX statement.
If you determine to use pagefreespace and you have an index on the table, you should consider using indexfreespace.When deciding on the amount of free space to leave for each, consider that the size of each row being inserted into the table will likely be larger than the size of the associated key to be inserted into the index. In addition, the page size of the table spaces for the table and the index may be different.
Determine whether statistics will be gathered during the load process. (This option is not supported if the load is in INSERT or RESTART mode.) Gathering statistics is only valid when indexes are already defined on the table prior to the load.
Define the method to use for loading the data: column location, column name, or relative column position.
Determine whether to keep a copy of the changes made. (This option is not supported if forward log recovery is disabled for the database; that is, LOGRETAIN=OFF and USEREXIT=OFF.) If no copy is made, and forward log recovery is not disabled, the table space is left in a backup pending state at the completion of the load.
For more information on how to recover, see Chapter 7. "Recovering a Database".
Sort the data on the clustering index values before loading it into the table.
Declare the location of the directories holding temporary files during the creation of indexes. The directories are defined by the USING parameter. When defining the directory, if it is unqualified and a local connection, the directory will be qualified using the current directory of the user running the load. If the directory is unqualified and part of a remote connection, the load will fail. If the parameter is not defined by the USING parameter, the default is to create the files in the tmp subdirectory in the directory defined by the DB2INSTPROF environment variable. You should be aware that the amount of information stored in this directory will be at least equal to the size of all the indexes and possibly twice that size.
In a partitioned database environment, there could be a scenario where several load operations are underway on separate database partitions. Each load operation may attempt to write to the tmp sub-directory at the same time. This could adversely affect the performance within this sub-directory.
In such an environment, you should specify a tmp sub-directory that is local to the database partition where the load operation is invoked. This will reduce the contention at the common tmp sub-directory and may improve the load performance on each database partition.
For all other temporary files either a base name defined by REMOTE FILE is used or, the default name db2utmp is used. There are many temporary files that the load process may create: The temporary files for sorting go in the path specified by the USING option; other temporary files go in the path specified by the REMOTE FILE option. Only the remote message temporary file is of any interest to you. You can use the LOAD QUERY command to view the messages that are generated by the load process, or write them to a local file that you can view later. Refer to the "Restarting the LOAD Using RESTART" for additional information on using the REMOTE FILE parameter. Also see the Command Reference for further information on these topics.
The remote files reside on the server machine and are accessed by the DB2 instance exclusively. Therefore, any filename qualification given to this parameter must reflect the directory structure of the server, not the client, and the DB2 instance owner must have read and write permission on the file. This latter point is true even if the user running the LOAD utility is the instance owner. The user has to specify a REMOTE FILE location which is writable by the instance owner if the user is not the instance owner.
Attention: If you want to perform concurrent loads, you must ensure that the REMOTE FILE specified for each load is unique; otherwise, results are unpredictable.
There are several ways in which the remote filename may be selected and qualified when the user has just given a partially qualified name, or no name at all.
Notes:
1. No remote filename is given in a load operation where the user is on the same machine as the database instance. In this case, the load utility uses the name db2utmp and qualifies it with the current working directory of the user. Two loads from the same directory with this option will clash in the use of the remote filename. Therefore, this option is not recommended.
2. No remote filename is given in a load operation where the user is on a different machine than the database instance. In this case, the load utility will generate a name that will reside in the database directory. This effectively prevents the user from using the load query facility, since the facility requires the name of the remote file. In addition, the filename generated is not guaranteed to be unique, and therefore clashes may occur between different load operations. Therefore, this option is not recommended.
3. A non-fully-qualified filename is given in a load operation, where the user is on the same machine as the database instance. In this case, the name is qualified by using the current directory of the user. The user must ensure that the two loads are not issued from the same directory with the same remote filename.
4. A non-fully-qualified filename is given in a load operation, where the user is on a different machine than the database instance. In this case, the load utility will reject the filename. It must be fully-qualified from the client.
5. A fully-qualified filename is given in a load operation. This will be the filename used. The user must ensure that two loads are not issued with the same remote filename. This is the recommended usage.
You should use a remote filename if you wish to use the LOAD QUERY tool. It is difficult to determine the remote filename if you do not provide the one to be used. For more information on LOAD and the importance of the filename, see "Restarting the LOAD Using RESTART".
To load data into a table, you must have either SYSADM or DBADM authority.

There are restrictions and limitations with the LOAD utility:

Attempts to create or drop tables in a table space that is in load pending state will fail.
You cannot load data into a database accessed through DB2 Connect or to a downlevel server other than DB2 Version 2.
Note: Options only available with this release of DB2 cannot be used with a server from the previous release.
If a REPLACE is performed on a table and an error occurs during the LOAD, the original data in the table is lost. Keep the LOAD input to allow the LOAD to be restarted if an error occurs during the process.
If there is no value for a NOT NULL column, the row is rejected since the LOAD utility attempts to load a NULL value. This condition also applies when there is no value for a NOT NULL WITH DEFAULT column. If a NOT NULL WITH DEFAULT column is not one of the columns being loaded, it is filled with the DEFAULT values.
Triggers are not activated on newly-loaded rows. Any business rules forming these triggers are not enforced by the LOAD command.

The LOAD utility optionally updates table and index statistics as part of the load process if run in REPLACE mode. If data is appended into a table, statistics are not gathered for the table. Run RUNSTATS following the completion of the load process to collect up-to-date statistics for the table. If gathering statistics on a table with a unique index, and duplicate keys are deleted during the Delete phase, then the statistics are not updated to account for the deleted records. If you think you will have duplicate records, you should not collect statistics during the LOAD but run RUNSTATS after the LOAD process.

To ensure the loaded data doesn't cause referential integrity or check constraint violations, any loaded table that is a parent table will cause all dependent tables to be placed in a check pending state. To remove the table from the restricted state, you must run the SET CONSTRAINTS statement.

With LOAD, there is a possibility of unequal code page situations involving possible expansion or contraction of the character data. Such situations could occur with Japanese or Traditional-Chinese Extended UNIX Code (EUC) and double-byte character sets (DBCS) which may have different length encodings for the same character. An option, NOCHECKLENGTHS, is used to toggle between two situations:

Comparison of input data length to target column length is performed before reading in any data. If the input length is larger than the target, NULLs are inserted for that column if it is nullable. Otherwise, the request is rejected. This is the default.
No initial comparison is performed and, on a row-by-row basis, an attempt is made to load the data. If the data is too long after translation is complete, the row is rejected. Otherwise, the data is loaded. Specifying NOCHECKLENGTHS will enable this behavior.

LOAD Performance Considerations

The performance of LOAD depends on the nature and size of the data, the number of indexes, the options used, and whether the SET CONSTRAINTS statement is required. Use of SET CONSTRAINTS lengthens the total time needed to load the table and make it usable again. (For more information on the SET CONSTRAINTS statement, see the SQL Reference manual.)

The LOAD utility performs almost equally well in either INSERT mode or REPLACE mode.

Index creation will reduce the performance of the load process, especially when data is added to a table already containing data. If there are many indexes on a table which already has a large amount of data and only a small percentage of data to be loaded, you should consider using the IMPORT utility instead of the LOAD utility. Unique indexes also reduce the performance of the load process if duplicates are encountered. In most cases it is still more efficient to create the index during the LOAD than to complete the LOAD and then use the CREATE INDEX command for each of the indexes.

The LOAD utility automatically attempts to provide the best performance possible by determining the best values for DISK_PARALLELISM, CPU_PARALLELISM, DATA_BUFFER, and SORT_BUFFER if these parameters have not be specified by the user at the time the utility is run. Optimization of these values is done based on the size and the free space available in the utility heap. Consider allowing the LOAD utility to choose the values for these parameters and then attempt to tune the parameters for your particular needs.

Performance of the LOAD can be improved by:

Using the FASTPARSE option in the MODIFIED BY parameter reduces the data checking done on the user-supplied column values, and performance is enhanced. This option should only be used when the data being loaded is known to be valid. This may improve the performance of the LOAD process by about 10 or 20 percent.
When creating indexes during the LOAD, you require at least as much disk space as the sum of the index sizes and possibly twice as much. The space used is temporary; that is, it is located outside the database in the directories specified for the USING parameter or in the tmp directory defined by the DB2INSTPROF environment variable.
You should specify as many devices for the temporary sort directories as you can so that there is opportunity for parallel I/O during sorting.
If during a LOAD the process is abnormally terminated, the temporary files used to sort must be manually removed before restarting the LOAD. The temporary files are located in the directories specified for the USING parameter or in the tmp directory defined by the DB2INSTPROF environment variable. You must remove the files located in those directories before proceeding.
Notes:
1. Data is striped across the directories specified by the USING parameter in round-robin fashion. If any of these directories becomes full, the sort operation will fail. This means that if one of the sort directories is quite small and becomes full, the sort operation (and consequently, the load) may fail, even though addition space is still available in the remaining sort directories.
2. The LOAD utility can only address up to 2 GB per index per tmp directory. If you have an index where you expect the sort data to grow to 4 GB, you must specify at least two tmp directories.
Specifying as large a value for the sort buffer as possible. The sort buffer memory is allocated from the utility heap. Ensure that the utility heap is defined to be large enough.
Specifying whether or not to preserve the order of the input data being loaded. The default is to preserve the order. Specify ANYORDER on the LOAD command to override this default and gain additional performance improvements.
Note: If the data to be loaded is presorted, ANYORDER may corrupt the presorted order: the benefits that the presorting would have provided will be lost for subsequent queries.
Using the SORT BUFFER and DATA BUFFER parameters. The SORT BUFFER will improve performance only if indexes are created when using LOAD. The DATA BUFFER parameter specifies the total amount of memory allocated to the LOAD utility as a buffer. The SORT BUFFER parameter specifies the amount of memory allocated for sorting index keys. (This assumes there is real storage to handle the increased buffer size and prevent increased paging.) Both buffer allocations come from the utility heap. You can modify the util_heap_sz database configuration parameter accordingly. See "Utility Heap Size (util_heap_sz)".
Controlling the parallelism used during the LOAD in a machine environment where symmetric multi-processor (SMP) exploitation is possible. You can control parallelism by using the parameter CPU_PARALLELISM. This parameter controls the degree of parallelism used when parsing, converting, and formatting records.
The maximum value for this parameter is 30, if there is sufficient memory to support this value. If there is not, LOAD will ajdust the value of CPU_PARALLELISM down until LOAD determines that the value will fit into memory.
Notes:
1. When tables include either LOB or LONG VARCHAR data, CPU_PARALLELISM is forced to one. Parallelism is not supported in this case.
2. Although use of this parameter is not restricted to SMP hardware, you may not obtain any discernable performance benefit by using it in non-SMP environments.
By using the parameter DISK_PARALLELISM. This parameter controls the degree of parallelism used when writing data to the table space containers.
The maximum value for this parameter is either 50 or 4 * CPU_PARALLELISM. The default value for this parameter is the greater of the number of containers (up to 50) or CPU_PARALLELISM/4. In both situations, the value of CPU_PARALLELISM is the value that is actually used by LOAD.
Using the BINARY NUMERICS and PACKED DECIMAL parameters when loading positional ASC data that has fixed record lengths, to improve the load time involving numeric data.
Installing high performance sorting libraries from third party vendors to create indexes during the load. An example of a third party sort product is: SyncSort**. Support for this and other sort products is through the use of the DB2SORT environment variable (registry value). See "Establish Environment Variables and the Profile Registry" for more information on environment variables.

The COPY YES/NO option specifies whether to create a copy of the input data during LOAD or not. If "YES" is chosen, performance is reduced because all the data being loaded is copied at the same time. This choice is faster than accepting a backup pending state and having to do a backup later before accessing the table. If "NO" is chosen, and forward recovery in enabled, then the table is placed in a backup pending state.

LOAD Parameter Hints and Tips

The parameters associated with the LOAD command offer you many options and alternatives when loading data into tables. All of the options, however, can be overwhelming because of their number (for newer users) and the way that they can interact with each other. This section provides additional hints and tips on most LOAD parameters with a particular focus on those affecting load performance.

DATA BUFFER and SORT BUFFER

The database configuration parameter Utility Heap Size (util_heap_sz) is the source of the memory used by the following utilities within DB2:

BACKUP -- The size of the buffer used to backup a database (see backbufsz).
RESTORE -- The size of the buffer used to restore a database (see restbufsz).
LOAD.
Load recovery.

The default for the Utility Heap Size configuration parameter is 5 000 pages (each 4 KB in size). Depending on the amount of storage available on your system, you should consider allocating more pages for use by the DB2 utilities.

Within the load utility, the memory allocated from the Utility Heap is divided among many areas of which the DATA BUFFER and the SORT BUFFER are two of the important ones. Because LOAD is only one of several utilities that use memory from the Utility Heap, it is recommended that for planning purposes you consider that no more than fifty percent of the pages defined by the Utility Heap Size configuration parameter as being available for LOAD usage. The remainder of the memory is available to the other utilities mentioned above.

The DATA BUFFER controls the size of the buffer used to load data into a table. The recommendation is that this buffer should be several extents in size. The extent size is defined within DB2 and can be one or many 4 KB pages in size. An extent is the unit of movement for data within DB2. This parameter is useful when working with large objects (LOBs), and it reduces I/O waiting time.

The SORT BUFFER controls the size of the buffer to sort index keys associated with the data being loaded into a table. This buffer is used during the index create phase of the LOAD process. It is only used if there are indexes on the table data being loaded. If used, this buffer is more important than the DATA BUFFER when considering the performance of the load operation. The recommendation is that the SORT BUFFER size be set as large as possible (depending on the overall storage available on your system and the other considerations mentioned above).

To set the configuration parameter for the Utility Heap Size, you should do the following from the command line:

GET DB CFG
UPDATE DB CFG FOR database_alias USING util_heap_sz = nnnnn (where "nnnnn" is the number of 4 KB pages)
If on the server -- DB2STOP followed by DB2START

Changes to the database configuration file become effective only after they are loaded into memory. All applications must disconnect from the database before this can occur. For a server configuration parameter, this occurs during the execution of DB2STOP followed by a DB2START. For a client configuration parameter, this occurs when the application is restarted. If the client is the command line processor, it is necessary to use TERMINATE. (TERMINATE breaks the application connection to a database and terminates the back-end process.)

CPU_PARALLELISM

Use CPU_PARALLELISM to exploit Symmetric Multi-Processor (SMP) parallelism (if this is part of your machine's capability). The parameter specifies the number of processes or threads used by the load utility to parse, convert, and format data records. The maximum number allowed is 30. Using this parameter whenever possible has a strong positive impact in reducing the time taken to complete the load operation.

Also, if this parameter is not specified, the load utility will select a reasonable default based on the number of CPUs on the system.

DISK_PARALLELISM

Use DISK_PARALLELISM to exploit the available number of containers that can be written to when loading data. The parameter specifies the number of processes or threads used by the load utility to write data records to disk. The maximum default number allowed is the greater of four times the CPU_PARALLELISM value or 50. Using this parameter whenever possible has a strong positive impact in reducing the time taken to complete the load operation.

By default, DISK_PARALLELISM is equal to the sum of the table space containers on all table spaces containing objects for the table being loaded, except where this default exceeds the maximum default number allowed (see above).

SAVECOUNT

This parameter sets the interval for the establishment of consistency points for the data being loaded. If SAVECOUNT must be used, then ensure the value used with this parameter is set sufficiently high. The synchronization of activities performed to establish a point of consistency takes time. If done too frequently, there will be a noticeable slowing of the load performance. Depending on the estimated number of rows to be loaded, the value used with this parameter is recommended to be greater than one million.

The value shown from the LOAD QUERY command can be used if the LOAD fails and needs to be restarted. When loading, we recommend the REMOTE FILE parameter be used. After the failure, LOAD QUERY using the remote file will give you the last consistency point. You can then rerun the load with RESTARTCOUNT n (where "n" is the last consistency point value and with the RESTART parameter.

STATISTICS YES

If STATISTICS YES is requested during the load, the performance of the load will decrease. The most significant decrease in the performance of the load occurs when DETAILED INDEXES ALL is chosen. Even with this performance decrease, you should consider using this option. The load utility collects statistics faster this way than using the RUNSTATS command following the completion of the LOAD.

To perform their best, applications require the use of the best distribution and index statistics possible. To update the DB2 system statistics, you should use STATISTICS YES during the load operation; or, use the RUNSTATS command following the load. Once the statistics are updated, applications may need the new access paths to the table data based on the latest statistics. New access paths to a table can be created by rebinding the application packages using the BIND command.

The Statistics Heap Size (stat_heap_sz) out of the agent private memory is used when collecting statistics. Typically, LOAD is used when loading data into large tables. Therefore, the default for the stat_heap_sz configuration parameter is not recommended when using STATISTICS YES on the LOAD command; a larger value is recommended.

NONRECOVERABLE, COPY NO, COPY YES

If you are working with a recoverable database, there are parameters that are related to this type of database and that have an effect on the performance of the load operation. A recoverable database is one where you have determined that the data found there is important and must not be lost. The ability to backup and restore databases or table spaces is one part of what it means to have a recoverable database. The other part is the ability to apply any changes to the database between the times the backups are taken. This part is called roll-forward recovery and involves the saving of the activity against the database in logs. The information from these saved logs is applied to the restored database so that no information is lost as a result of a problem with the database.

You can decide the level of recoverability you wish for your database. You can decide to backup and restore a database or table space without needing the roll-forward capability. In such a case, you may not need the logs since you have another way to reconstruct the activity against the database between backups. For example, you may be collecting all transactions against the database and applying them in batch mode; and, you may also not be allowing any ad hoc user updates to the database. In this example, there is no need to log all changes to the database.

The load operation does not log each record added to the table as part of the load operation.

By choosing the NONRECOVERABLE parameter, there is no performance penalty since no additional activity beyond the movement of the data into the table is required. In addition, the load operation completes without leaving the table spaces in a backup pending state. During a subsequent restore and roll-forward recovery where this LOAD command is one of the operations being rolled-forward when recovering, the table that is the object of the load is not updated and is marked "invalid". Further actions against this table are ignored. After the roll-forward is complete, the table cannot be used and can only be dropped. This may be an acceptable course of action in your work environment depending on the table.

By choosing COPY NO and you have a roll-forward recovery enabled (by having either logretain or userexit parameters on), the table space where the table being loaded resides is placed in a backup pending state. A database or table space must be backed up before the data in the table can be accessed. The forced back up establishes a point of consistency for the table space or database. Recovery would then begin with the restoration of this backup. (Roll-forward could then follow.)

COPY NO is the default when using the LOAD command.

You can only choose COPY YES if roll-forward recovery is enabled. A copy of all changes to the table being loaded is saved. In a subsequent restore and roll-forward recovery where this LOAD command is one of the operations being rolled-forward when recovering, the changes are applied to the table and the table is usable following the completion of the roll-forward activity. The copying of each change as the table is being loaded causes increased I/O activity, which may increase the load time on an I/O-bound system. Specifying multiple devices or directories (on different disks) can offset some of the performance penalty resulting from this operation.

The syntax when using the COPY YES option is:

   db2 load form <filename> of DEL replace into <tablename>
      copy yes <dir1>, <dir2>, <dir3>, <dir4>, ...

where

   <dirN> are devices or directories.

You might select this option if you are adding new data to an existing table already containing data. You have decided you do not want to take a backup of the table space or database involved. Recovery would begin with the restoration of a backup taken before the load operation. Roll-forward could then follow including going through the load operation using the changes saved using the COPY YES option.

WARNINGCOUNT

During the load operation there may be warnings and errors generated. The parameters WARNINGCOUNT, FOR EXCEPTION, and MESSAGES can be used to keep track of the warnings and errors. If you are expecting a significant number of warnings, then set the WARNINGCOUNT parameter to zero (0) or do not use this optional parameter. The load operation will continue to completion regardless of the number of warnings issued.

If you are expecting only a few warnings or no warnings, then set the WARNINGCOUNT parameter to approximately the number you are expecting, or to twenty (20) if you are expecting no warnings. The load operation will stop after the WARNINGCOUNT number is reached. This will allow you the opportunity to correct the data (or drop and re-create the table being loaded) before attempting to complete the load operation. Although not having a direct effect on the performance of the load operation, use of WARNINGCOUNT can assist with the successful completion of your load operations. If you encounter unexpected warnings during load processing, the establishment of a WARNINGCOUNT threshold will prevent having to wait to the full completion of the load operation before determining there was a problem.

USING

If indexes are defined on the table, the load utility sorts all of the index keys in order to rebuild the indexes. If the number of records in the table is large, then the sort may not fit into the available memory (SORT BUFFER) and sorted data may need to overflow to disk. You can decrease the I/O overhead of sorting by striping the sort overflow to a series of disks. This is done with the USING option as follows:

   db2 load from <filename> of IXF insert into <tablename>
      using <dir1>, <dir2>, <dir3>, <dir4>, ...

where

   <dirN> are directories.

Here is a suggested method to determine how many directories are required:

If D is the number of directories specified, and L is the file size limit (which varies by operating system), and I is the estimated size of the sum of all the indexes, then the value of D is established using the following equation:

   D > (2 * I) / L

For example, assume the sum of all the indexes in your database is about 2.8 GB. If the file size limit for your system is 2 GB, then L = 2. Putting these two values into the equation results in:

   D > (2 * 2.8) / 2

Therefore, rounding D to the next highest integer, you would specify three (3) directories in the USING clause.

LOAD Temporary Files

You have control over the disk space used when working with LOAD. If you uniquely specify the location of the remote files used by the LOAD, you know where the temporary files will be created and thus know which disk will have space temporarily used.

The REMOTE FILE parameter specifies the directory path and the base name to be used when creating temporary files during a load operation. For example, we could declare the path and base name as:

   REMOTE FILE /u/bobdir/bob

During the load processing, LOAD may generate the following temporary files:

   /u/bobdir/bob.msg
   /u/bobdir/bob.rid
   /u/bobdir/bob.log

Each of these files is temporary and are all removed when the load processing completes without difficulty. The files are temporary in nature and are binary in format so that they are not designed to be viewed by you during load processing.

If you do not use the REMOTE FILE parameter on the LOAD command, the load process uses "db2utmp" as the base name and the directory path is the current working directory (if the LOAD command is executed on the server). If the LOAD command is not executed on the server and you do not use the REMOTE FILE parameter with an explicit directory path and base name, it may be difficult to determine where the temporary files are located.

Using LOAD QUERY

When using the LOAD QUERY command, you simply specify the base name to see the messages information on the load process.

The base name is declared using the LOAD command and the REMOTE FILE parameter. For example, we could declare the path and base name as:

   REMOTE FILE /u/bobdir/bob

To check the status of the load process, you could issue the following commands:

   db2 connect to database_name
   db2 load query /u/bobdir/bob

You could also declare a location to place the warning and error messages that occur during load processing. This is done by using the TO parameter on the LOAD QUERY command and declaring a fully-qualified path and a unique file name.

See the Command Reference for additional details on the LOAD QUERY command.

Running Concurrent LOAD Jobs

It is extremely important when running concurrent LOAD jobs that you know the following:

When running concurrent LOADs, the remote files must be unique. If they are not unique such as in the situation where you do not use the REMOTE FILE parameter on any of your concurrent LOAD commands, then LOAD does not enforce uniqueness. You will receive indeterminate results.

Restarting LOAD and Database Recovery

A LOAD can be restarted following a failure using RESTART or RESTARTCOUNT. The REMOTE FILE information is important when using a LOAD RESTART command. A copy image of the loaded data can be created for use when recovering a database.

The following discuss these considerations in more detail:

"REMOTE FILE Considerations"
"Restarting the LOAD Using RESTART"
"Restarting the LOAD Using RESTARTCOUNT"
"Eliminating a Phantom Quiesce"
"Creating a Copy Image of Loaded Data"

REMOTE FILE Considerations

The RESTART LOAD command restarts the load process after a previous load was interrupted. As part of the task of restarting, LOAD RESTART will need to use the information that has been saved in the remote files.

The REMOTE FILE parameter specifies the directory path and the base name to be used when creating temporary files during a load operation. For example, we could declare the path and base name as:

   REMOTE FILE /u/bobdir/bob

During the load processing, LOAD may generate the following temporary files:

   /u/bobdir/bob.msg
   /u/bobdir/bob.rid
   /u/bobdir/bob.log

Restarting the LOAD Using RESTART

LOAD RESTART requires the information in the remote files. The specification of a fully-qualified path removes any possible ambiquity over the location of these files during the restart of the load process.
Note: It may still be possible to restart without having a fully-qualified path for the temporary remote files but only if the LOAD is restarted from the same directory where the original LOAD command was issued.

Restarting the LOAD Using RESTARTCOUNT

If a failure occurs while loading data, you can restart the load from the last save point or point of consistency; or reload the entire table by using the REPLACE option.

The remote file specified in the LOAD restart operation should be the one that was specified in the LOAD command that failed.

There are a number of options available should you decide to restart the load.

If you decide to restart using the RESTARTCOUNT number option, then you must use the number of rows at the last successful consistency point. To determine that value, use the LOAD QUERY command with either the name specified with the REMOTE FILE option or the default name db2utmp. By choosing RESTARTCOUNT number, the LOAD restarts from the row following the row identified by the number and attempts to finish the load.
Note: The RESTARTCOUNT number can only be used with the last successfully completed consistency point. If the last consistency point started but did not complete (that is, SQL3519W is not followed by SQL3520W), then you must carry out the action as described in the help for message SQL3519W.

If you do not want to continue loading rows, or if the failure was during the Build phase, you can use the RESTARTCOUNT B option. The LOAD process brings the table to the state of the last save point or point of consistency and then restarts the Build phase. By choosing this option the LOAD restarts, does not attempt to load additional rows, and builds the indexes for the rows already loaded.

If the message file states that the Build phase completed and all temporary files are unmodified as left by the LOAD, you can use the RESTARTCOUNT D option. The information on the rows containing duplicate keys stored in the temporary files is used to delete those rows.

The restarted LOAD command should continue until the completion of the LOAD process.
Note: For minor errors such as nonexistent data files or an invalid dcoldata, the LOAD will clean up and take the table out of the load pending state. You must do the LOAD again in either REPLACEor INSERT mode with correct parameters.

Eliminating a Phantom Quiesce

A quiesce on table spaces is persistent across any system failure. This is important to know when working with LOAD since it uses a quiesce in exclusive mode while carrying out the load process.

A quiesce is owned by a connection. When you lose a connection, the quiesce remains but it has no owner. This is called a "phantom quiesce".

To remove a phantom quiesce, you must:

Connect to the database.
Use the LIST TABLESPACES command to determine which table space is QUIESCED. See the Command Reference for more information on the LIST TABLESPACES command.
Re-quiesce the table space using the current quiesce state. In the case of a LOAD on the table BOBTAB use:
```
   db2 quiesce tablespaces for table BOBTAB exclusive
```
Once completed, your connection now owns the quiesce and you can issue the LOAD RESTART command. A LOAD REPLACE could also be issued at this point instead of the LOAD RESTART.

Creating a Copy Image of Loaded Data

If the table being loaded is part of a recoverable database, then logging is in effect. Since LOAD does not log the changes made to the table, you have the option of specifying COPY YES to create a copy of the data being loaded. This copy is used during roll-forward recovery to re-create the changes to the database done by LOAD.

When using this option you should also consider using multiple devices or directories to allow for the best possible I/O exploitation.

For more information on the load copy location file, see "Using the Load Copy Location File".

If forward recovery is enabled (logretain or userexit is "on") and the COPY option was not used, all table spaces in which the loaded table resides are left in a backup pending state. A backup of the database or the table space(s) is required to remove this pending status. The backup is done before any other units of work against the database or table space can be started.

For more information on how to recover, see Chapter 7. "Recovering a Database".

During forward recovery, if the load copy is not available, then the table spaces (of the loaded table) which are being rolled forward will be set to the restore pending state. These table spaces must be restored from a backup image taken after the table load.

An error is returned if you specify NONRECOVERABLE and COPY YES. There is no need for a copy in such a case since it would not be needed during recovery.

LOAD Exception Table

The exception table is a user-created table which mimics the definition of the table being loaded. It is specified by the FOR EXCEPTION option on the LOAD command. The table is used to store copies of rows that violate unique index rules.
Note: Any rows rejected before the building of the index on the loaded table because of invalid data are not inserted into the exception table.

Rows are added to the existing information in the exception table. The existing information may include rows listing check constraint or foreign key violations; or invalid rows from a previous LOAD. If you want only the invalid rows from this LOAD, you will need to remove the existing rows before invoking LOAD.

The exception table used with the LOAD utility is identical to the exception table(s) used by the SET CONSTRAINTS statement. An exception table should be created to perform a load which has a unique index and may have duplicate records. If an exception table is not provided for the LOAD, and duplicate records are found, then the LOAD will continue. However, only a warning message is issued about the deleted duplicate records and the deleted duplicate records are not placed anywhere.

After completing the load, information in the exception table can be used any way you wish. You may want to use the information to correct any data that was in error and insert the rows into the original table.

The exception table message column has the following structure:

Table 32. Exception Table Message Column Structure for LOAD

Field number Contents Size Comments
1 Number of violations 5 characters Right justified padded with '0'
2 Type of first violation. Only "I" is used by LOAD 1 character 'I' - Unique Index violation
3 Length of constraint/index token 5 characters Right justified padded with '0'
4 Constraint/index token length from the previous field

Note: Only Unique Index violations will be reported by LOAD. The Check Constraint and Foreign Key violations will be reported by running the SET CONSTRAINTS statement with the IMMEDIATE CHECKED FOR EXCEPTION options. Only one unique index violation in a row is reported. LONG or Large Object (LOB) data is not inserted into the exception table. Index token is the "IID" value from SYSCAT.INDEXES that identifies the index.

See the SQL Reference manual for more information on exception tables.

Checking For Constraint Violations

The loaded table may be in the check pending state if it has table check constraints or referential integrity constraints defined on it. The STATUS flag of the SYSCAT.TABLES in the row corresponding to the loaded table indicates the check pending state of the table. For the loaded table to be usable, the STATUS must have a value of "N" indicating the normal state of the table.

To remove a table from the check pending state, use the SET CONSTRAINT statement. For more information on the SET CONSTRAINTS statement, see the SQL Reference manual. One or more tables may be submitted to be checked in a single invocation. For a dependent table to be checked, the parent table must not be in the check pending state. In the case of a referential integrity cycle, all the tables involved in the cycle must be included in a single invocation of the SET CONSTRAINTS statement.

To manage the loading of several tables, consider the position of each within referential relationships along with the table size and time windows available to carry out the load. It may be convenient, for example, to check the parent table for check constraint violations while the dependent table is loaded. This can only occur if the two tables are not in the same table space.

Exception tables are convenient for a consolidated report of all the rows that have constraints violated. If the exception table option is not used, only the first violation is reported. This may be a cause for frustration when dealing with large tables having more than one constraint violation. The same exception table used for the LOAD utility may be used for checking constraint violations. As with the LOAD utility, there is no checking done when running the SET CONSTRAINTS statement to ensure that the exception table is empty. The extra timestamp column in the exception table may be used to distinguish newly-inserted rows from the old ones, if necessary.

The SET CONSTRAINTS statement does not activate any DELETE triggers as a result of deleting rows that violate constraints. It must be noted, however, that once the table is removed from the check pending state, triggers are active. This implies that, if we correct data and INSERT rows from the exception table into the loaded table, any INSERT trigger defined on the table will be activated. The implications of this on the data should be considered and, if necessary, suitable action should be taken. One option is to drop the INSERT trigger, INSERT rows from the exception table, and then re-create the INSERT trigger.

If the underlying table (or tables) of a summary table are put in the check pending state, then all summary tables with REFRESH IMMEDIATE will be put into the check pending state. To take the summary table out of the check pending state, run the REFRESH TABLE statement against the table. You can also load the summary table and issue the SET CONSTRAINTS IMMEDIATE UNCHECKED statement. Refer to the SQL Reference for more information about these statements.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]