Administration Guide

Failover Configurations

Two types of configuration are available in a DB2 system:

Hot standby (asymmetric mode)
Mutual takeover (symmetric mode).

Two modes of failover support are provided. A brief description of each mode and its application to DB2 follows. For each, the simple scenario of a two-server cluster is described.

Hot Standby: One server is being actively used to run DB2, and the second is in standby mode ready to take over if there is an operating system or hardware failure involving the first server.
Mutual Takeover: Multiple servers can be used to scale to a single database instance using the DB2 Extended Enterprise Edition product. This is done using a shared-nothing model and partitioning the data such that one or more partitions are running on each server in the cluster. If an operating system or hardware failure occurs on one of the servers, then the other servers will take over the partition (or partitions) of the failing server.

Each of the above configurations can be used to failover one or more partitions of a partitioned database.

You can use the hot standby capability to set up failover for a partition or partitions of a partitioned database configuration. If one server fails, then another server in the cluster can substitute for the failed server by automatically transferring the database partitions. To achieve this, the database instance and the actual database must be accessible to both the primary and failover server. This requires that the following installation and configuration tasks be performed:

The DB2 installation path should be local to each machine and of the same level.
The DB2 instance path should be on a shared filesystem via HA-NFS.
The database and the associated containers must be on file systems (or devices) that are accessible to both systems. The disks for the filesystems or devices of a database partition must be in disk groups that are associated with the logical host that hosts the database partition.
For failover of a partition in a partitioned database configuration, the partition is restarted on the second system: the Sun Cluster 2.1 software modifies the db2nodes.cfg file to point to the failed partition on the new system and starts the partition on that system.
When a failover occurs, the external communications addresses for supported communication protocols are transparently transferred as part of the failover procedure.

Database Partition Server Failover

Figure 86 shows how partitions fail over in a hot standby configuration. System A is running a one or more partitions of the overall configuration and System B is used as the failover system. When System A fails, the partition is restarted on the second system. The failover updates the db2nodes.cfg file, pointing the partition to System B's hostname and netname, then restarting the partition at the new system. When the failover is complete, all other partitions forward the requests targeted for this partition to System B.

Figure 86. Hot Standby Configuration

The following is a portion of the db2nodes.cfg file before and after the failover. In this example, node numbers 20, 22 and 24 are running on the system named MachineA of the cluster with the netname MachineA-scid0. After the failover, node numbers 20, 22 and 24 are running on the system named MachineB of the cluster and have a netname of MachineB-scid0.

Before:
        20 MachineA 0 MachineA-scid0   <= Sun Cluster 2.1
        22 MachineA 1 MachineA-scid0   <= Sun Cluster 2.1
        24 MachineA 2 MachineA-scid0   <= Sun Cluster 2.1
 
        db2start nodenum 20 restart hostname MachineB port 0 netname MachineB-scid0
        db2start nodenum 22 restart hostname MachineB port 1 netname MachineB-scid0
        db2start nodenum 24 restart hostname MachineB port 2 netname MachineB-scid0
 
After:
        20 MachineB 0 MachineB-scid0   <= Sun Cluster 2.1
        22 MachineB 1 MachineB-scid0   <= Sun Cluster 2.1
        24 MachineB 2 MachineB-scid0   <= Sun Cluster 2.1

Mutual Takeover Configuration

The mutual failover of partitions in a partitioned database environment requires that the failover of the partition occur as a logical node on the failover server. If two partitions of a partitioned database system run on separate servers of a cluster configured for mutual takeover, the partitions must fail over as logical nodes.

Figure 87 shows an example of a mutual takeover configuration.

Figure 87. Mutual Takeover Configuration

Another important consideration when configuring a system for mutual partition takeover is the database path of the local partition. When a database is created in a partitioned database environment, it is created on a root path, which is not shared across the database partition servers. For example, consider the following statement:

  CREATE DATABASE db_a1 ON /dbpath

This statement is executed under instance db2inst and creates the database db_a1 on the path /dbpath. Each partition creates its actual database partition on its local /dbpath file system under /dbpath/db2inst/NODExxxx, where xxxx represents the node number. After a failover, a database partition will start up on another system with a different /dbpath directory. The only filesystems that are moved along with the logical host during a failover are the logical host filesystems. This means that a symbolic link must be created from the logical host file system to the appropriate /dbpath/db2inst/NODExxxx path.

For example,

cd /dbpath/db2inst
ln -s /log0/disks/db2inst/NODE0001 NODE0001

The hadb2eee_addinst will set up symbolic links from INSTHOME/INSTANCE to the logical host filesystem that corresponds with the various database partitions (where INSTHOME is the instance owner's home directory, INSTANCE is the instance, and log0 is the logical host that is bound to database partition 1 via the hadb2-eee.cfg file). You must perform this manually for other database directories.

The following example shows a portion of the db2nodes.cfg file before and after the failover. In this example, node numbers 20, 22 and 24 are running on System A which has a hostname of MachineA with a netname of MachineA-scid0. Node numbers 30, 32, and 34 are running on System B which has a hostname of MachineB with a netname of MachineB-scid0. System A in this example is hosting a logical host which is responsible for database partitions 20, 22, and 24. System B is listed as a backup for this logical host and it will host it if System A goes down.

Before:
        20 MachineA 0 MachineA-scid0   <= Sun Cluster 2.1
        22 MachineA 1 MachineA-scid0   <= Sun Cluster 2.1
        24 MachineA 2 MachineA-scid0   <= Sun Cluster 2.1
        30 MachineB 0 MachineB-scid0   <= Sun Cluster 2.1
        32 MachineB 1 MachineB-scid0   <= Sun Cluster 2.1
        34 MachineB 2 MachineB-scid0   <= Sun Cluster 2.1
 
        db2start nodenum 20 restart hostname MachineB port 3 netname MachineB-scid0
        db2start nodenum 22 restart hostname MachineB port 4 netname MachineB-scid0
        db2start nodenum 24 restart hostname MachineB port 5 netname MachineB-scid0
 
After:
        20 MachineB 3 MachineB-scid0   <= Sun Cluster 2.1
        22 MachineB 4 MachineB-scid0   <= Sun Cluster 2.1
        24 MachineB 5 MachineB-scid0   <= Sun Cluster 2.1
        30 MachineB 0 MachineB-scid0   <= Sun Cluster 2.1
        32 MachineB 1 MachineB-scid0   <= Sun Cluster 2.1
        34 MachineB 2 MachineB-scid0   <= Sun Cluster 2.1

If you do decide to use a mutual takeover environment for the coordinator node then you may want to adjust the following database manager configuration parameters:

conn_elapse
max_connretries.

Reducing the value of these parameters will reduce the failover time for the coordinator node, but will increase the risk of an FCM connection timeout. These parameters should be tuned to meet your requirements.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]