Administration Guide

Character Sets

The database manager does not, in general, restrict the character set available to an application except as noted below.

Each combined Single-Byte Character Set (SBCS) or Double-Byte Character Set (DBCS) code page allows for both single- and double-byte character code points. This is usually accomplished by reserving a subset of the 256 available code points of a mixed code table for single-byte characters, with the remainder of the code points either undefined or allocated to the first byte of double-byte code points. These code points are shown in the following table.

Table 134. Mixed Character Set Code Points

Country Supported Mixed Code Page Code Points for Single-byte Characters Code Points for First Byte of Double-Byte Characters
Japan 932, 943 x00-7F, xA1-DF x81-9F, xE0-FC
Japan 942 x00-80, xA0-DF, xFD-FF x81-9F, xE0-FC
Taiwan 938 (*) x00-7E x81-FC
Taiwan 948 (*) x00-80, FD, FE x81-FC
Korea 949 x00-7F x8F-FE
Taiwan 950 x00-7E x81-FE
China 1381 x00-7F x8C-FE
Korea 1363 x00-7F x81-FE
China 1386 x00 x81-FE

Notes:

(*) means that this is an old code page and is not recommended anymore.

Country	Supported Mixed Code Page	Code Points for Single-byte Characters	Code Points for First Byte of Double-Byte Characters
Japan	932, 943	x00-7F, xA1-DF	x81-9F, xE0-FC
Japan	942	x00-80, xA0-DF, xFD-FF	x81-9F, xE0-FC
Taiwan	938 (*)	x00-7E	x81-FC
Taiwan	948 (*)	x00-80, FD, FE	x81-FC
Korea	949	x00-7F	x8F-FE
Taiwan	950	x00-7E	x81-FE
China	1381	x00-7F	x8C-FE
Korea	1363	x00-7F	x81-FE
China	1386	x00	x81-FE
Notes: (*) means that this is an old code page and is not recommended anymore.

Code points not assigned to either category above are not defined, and are processed as single-byte undefined code points.

Within each implied DBCS code table, there are 256 code points available as the second byte for each valid first byte. Second byte values can have any value from 0x40 to 0x7E and from 0x80 to 0xFE. Note that in DBCS environments, DB2 does not perform validity checking on individual double-byte characters.

Extended UNIX Code (EUC) Character Sets

Each EUC code page allows for both single-byte character code points, and up to three different sets of multi-byte character code points. This is accomplished by reserving a subset of the 256 available code points of each implied SBCS code page identifier for single-byte characters. The remainder of the code points is undefined, allocated as an element of a multi-byte character, or allocated as a single-shift introducer of a multi-byte character. These code points are shown in the following tables.

Table 135. Japanese EUC Code Points

Group 1st Byte 2nd Byte 3rd Byte 4th Byte
G0 x20-7E n/a n/a n/a
G1 xA1-FE xA1-FE n/a n/a
G2 x8E xA1-FE n/a n/a
G3 x8E xA1-FE xA1-FE n/a

Group	1st Byte	2nd Byte	3rd Byte	4th Byte
G0	x20-7E	n/a	n/a	n/a
G1	xA1-FE	xA1-FE	n/a	n/a
G2	x8E	xA1-FE	n/a	n/a
G3	x8E	xA1-FE	xA1-FE	n/a

Table 136. Traditional Chinese EUC Code Points

Group 1st Byte 2nd Byte 3rd Byte 4th Byte
G0 x20-7E n/a n/a n/a
G1 xA1-FE xA1-FE n/a n/a
G2 x8E xA1-FE xA1-FE xA1-FE
G3 n/a n/a n/a n/a

Group	1st Byte	2nd Byte	3rd Byte	4th Byte
G0	x20-7E	n/a	n/a	n/a
G1	xA1-FE	xA1-FE	n/a	n/a
G2	x8E	xA1-FE	xA1-FE	xA1-FE
G3	n/a	n/a	n/a	n/a

Table 137. Korean EUC Code Points

Group 1st Byte 2nd Byte 3rd Byte 4th Byte
G0 x20-7E n/a n/a n/a
G1 xA1-FE xA1-FE n/a n/a
G2 n/a n/a n/a n/a
G3 n/a n/a n/a n/a

Group	1st Byte	2nd Byte	3rd Byte	4th Byte
G0	x20-7E	n/a	n/a	n/a
G1	xA1-FE	xA1-FE	n/a	n/a
G2	n/a	n/a	n/a	n/a
G3	n/a	n/a	n/a	n/a

Table 138. Simplified Chinese EUC Code Points

Group 1st Byte 2nd Byte 3rd Byte 4th Byte
G0 x20-7E n/a n/a n/a
G1 xA1-FE xA1-FE n/a n/a
G2 n/a n/a n/a n/a
G3 n/a n/a n/a n/a

Group	1st Byte	2nd Byte	3rd Byte	4th Byte
G0	x20-7E	n/a	n/a	n/a
G1	xA1-FE	xA1-FE	n/a	n/a
G2	n/a	n/a	n/a	n/a
G3	n/a	n/a	n/a	n/a

Code points not assigned to the categories shown above are not defined, and are treated as single-byte undefined code points.

Character Set for Identifiers

The basic character set that may be used in database names consists of the single-byte uppercase and lowercase Latin letters (A...Z, a...z), the Arabic numerals (0...9) and the underscore character (_). This list of letters is augmented with the three special characters #, @ and $ to provide compatibility with host database products. However, these special characters should be used with care in an NLS environment because they are not included in the NLS host (EBCDIC) invariant character set.

When naming database objects (such as tables and views), program labels, host variables, cursors and statements alphabetics from the extended character set may also be used. For example, those letters with diacritical marks. The available characters depend on the code page in use and if you are using the database in a multiple code page environment, you must ensure that all code pages support any alphabetics you plan on using from the extended character set. See the SQL Reference for a discussion of delimited identifiers which can be used in SQL statements and can also contain characters outside the extended character set.

Extended Character Set Definition for DBCS Identifiers

In DBCS environments, the extended character set consists of all the characters in the basic character set, plus those identified as a letter or digit as follows:

All double-byte characters in each DBCS code page, except the double-byte space, are valid letters.
The double-byte space is a special character.
The single-byte characters available in each mixed code page are assigned to various categories as follows:

Category
Valid Code Points within each Mixed Code Page
Digits
x30-39
Letters
x23-24, x40-5A, x61-7A, xA6-DF (A6-DF for code pages 932 and 942 only)
Special Characters
All other valid single-byte character code points

Coding of SQL Statements

The coding of SQL statements is not language dependent. SQL is a programming language and, like other programming languages such as C, it is language invariant. The SQL keywords must be typed as shown, although they may be typed in uppercase, lowercase, or mixed case. The names of database objects, host variables and program labels that occur in an SQL statement cannot contain characters outside the database manager extended character set as described above.

Bidirectional CCSID Support

The following BiDi attributes are required for correct handling of Bidirectional data on different platforms:

   - Text type (LOGICAL vs VISUAL)
   - Shaping (SHAPED vs UNSHAPED)
   - Orientation (RIGHT-TO-LEFT vs LEFT-TO-RIGHT)
   - Numeral shape (ARABIC vs HINDI)
   - Symmetric swapping (YES or NO)

Defaults on different platforms are not the same, problems appear when DB2 data is sent from one platform to another. For example, Windows platforms use LOGICAL UNSHAPED data, while data on OS/390 is usually in SHAPED VISUAL format. Therefore, without any support for bidirectional attributes, data sent from DB2 for OS/390 to DB2 UDB on a Windows 32-bit operating systems may display incorrectly.

Bidirectional-specific CCSIDs

DB2 supports bidirectional data attributes through special bidirectional Coded Character Set Identifiers (CCSIDs). The following bidirectional CCSIDs have been defined and are implemented with DB2 UDB:

 
        CCSID ]  Code  ] String
              ]  Page  ]  Type
       -------+--------+----------
         00420    420      4
         00424    424      4
         08612    420      5
         08616    424      6
         62208    856      4
         62209    862      4
         62210    916      4
         62211    424      5
         00856    856      5
         62213    862      5
         00916    916      5
         01255   1255      5
         01046   1046      5
         00864    864      5
         01089   1089      5
         01256   1256      5
         62220    856      6
         62221    862      6
         62222    916      6
         62223   1255      6
         62224    420      6
         62225    864      6
         62226   1046      6
         62227   1089      6
         62228   1256      6
         62235    424     10
         62236    856     10
         00862    862     10
         62238    916     10
         62239   1255     10
         62240    424     11
         62241    856     11
         62242    862     11
         62243    916     11
         62244   1255     11

Where CDRA String Types are defined:

    String ]  Text ] Numerical  ] Orientation ]  Shaping  ] Symmetrical
     Type  ]  Type ]   Shape    ]             ]           ]  Swapping
  ---------+-------+------------+-------------+-----------+-------------
      4     Visual     Arabic         LTR       Shaped         OFF
      5     Implicit   Arabic         LTR       Unshaped       ON
      6     Implicit   Arabic         RTL       Unshaped       ON
      7(*)  Visual     Arabic    Contextual(*)  Unshaped-Lig   OFF
      8     Visual     Arabic         RTL       Shaped         OFF
      9     Visual    Passthru        RTL       Shaped         ON
     10     Implicit             Contextual-L                  ON
     11     Implicit             Contextual-R                  ON

Note:

(*) Field orientation is left-to-right (LTR) when the first alphabetic character is a Latin one, and right-to-left (RTL) when it is a bidirectional (RTL) character. Characters are unshaped, but LamAlef ligatures are kept, and not broken into constituents.

DB2 Universal Database Implementation of Bidirectional Support

Bidirectional layout transformations are implemented in DB2 Universal Database Version 5.2 using the new CCSID definitions. For the new BiDi-specific CCSIDs, layout transformations are performed instead of or in addition to code page conversions. To use this support, the DB2BIDI registry variable must be set to YES. By default, this variable is not set. This variable is used by the server for all conversions, and can only be set when the server is started. Setting DB2BIDI to YES may have some performance impact because of additional checking and layout transformations.

To specify a specific bidirectional CCSID in non-DRDA environment, select the appropriate CCSID from the above table that matches the characteristics of your client, and set DB2CODEPAGE to that value. If you already have a connection to the database, you must issue a TERMINATE command and connect again to make the new setting of DB2CODEPAGE take effect. If you select a CCSID which is not correct for code page or string type of your client platform, results would be unexpected. If you select an incompatible CCSID (ie, Hebrew CCSID for connection to an Arabic database or vice-versa), or if DB2BIDI has not been set for the server, you will receive an error message when you try to connect.

For DRDA environments, if the HOST EBCDIC platform also supports these bidirectional CCSIDs, you need to only set DB2CODEPAGE as mentioned above. However, if HOST platform does not support these CCSIDs, you must specify a CCSID override for the HOST database server that you are connecting to. This is necessary because, in DRDA environment, code page conversions and layout transformations are performed by the receiver of data. However, if HOST server does not support these bidirectional CCSIDs, it does not perform layout transformation on the data that it receives from DB2 UDB. If you use a CCSID override, the DB2 UDB client performs layout transformation on the outbound data as well. For details of how to set a CCSID override, please refer to DB2 Connect Release Notes.

CCSID override is not supported for cases where the HOST EBCDIC platform is the client and DB2 UDB is the server.

Collating Sequences

The database manager compares character data using a collating sequence. This is an ordering for a set of characters that determines whether a particular character sorts higher, lower, or the same as another.
Note: Character string data defined with the FOR BIT DATA attribute, or BLOB data, is sorted using the binary sort sequence.

For example, a collating sequence can be used to indicate that lowercase and uppercase versions of a particular character are to be sorted equally.

The database manager allows databases to be created with custom collating sequences. The following sections help you determine and implement a particular collating sequence for a database.

Overview

In a database, each single-byte character is represented internally as a unique number between 0 and 255, (in hexadecimal notation, between X'00' and X'FF'). This number is referred to as the code point of the character. A collating sequence is a mapping between the code point and the desired position of each character in a sorted sequence. The numeric value of the position is called the weight of the character in the collating sequence. The simplest collating sequence is one where the weights are identical to the code points. This is called the identity sequence.

For example, consider the characters B (X'42'), and b (X'62'). If, according to the collating sequence table, they both have a sort weight of X'42' (B), then they collate the same. If the sort weight for B is X'9E' and the sort weight for b is X'9D', then b will be sorted before B. Actual weights depend on the collating sequence table used which depends on the code set and locale. Note that a collating sequence table is not the same as a code page table which defines code points.

Consider the following example. In ASCII, the characters A through Z are represented by X'41' through X'5A'. To describe a collating sequence where these are sorted in order, and consecutively (no intervening characters), you can write X'41', X'42', ...X'59', X'5A'.

For multi-byte characters, the hexadecimal value of the multi-byte character is also used as the weight. For example, X'8260', X'8261' are the code points for double byte character A and B. In this case, you can write X'8260', X'8261' as the collating sequence for double byte characters A and B. These are also the code points for A and B.

The values of the weights in a collating sequence need not be unique. For example, you could give uppercase letters and their lowercase equivalents the same weight.

Specifying the collating sequence can be simplified if a collating sequence provides weights for all 256 code points. The weight of each character can be determined using the code point of the character. This is the method used to specify a collating sequence for the database manager: a string of 256 bytes, where the nth byte (starting with 0) contains the weight of code point n.

In the case of multi-byte character sets, DB2 uses the collation table which was specified at database creation time. If you require the multi-byte characters to sort the way they appear in their code point table, you must specify IDENTITY as your collation sequence when you create the database.
Note: For DBCS characters in GRAPHIC fields, the sort sequence is always IDENTITY without regard to the collation sequence specified at database creation time.

Character Comparisons

Once a collating sequence is established, character comparison is performed by comparing the weights of two characters, instead of directly comparing their code point values.

If weights that are not unique are used, characters that are not identical may compare equally. Because of this, string comparison must be a two-phase process:

Compare the characters of each string based on their weights.
If step 1 yielded equality, compare the characters of each string based on their code point values.

If the collating sequence contains 256 unique weights, only the first step is performed. If the collating sequence is the identity sequence only the second step is performed. In either case, there is a performance benefit.

For more information on character comparisons, see the SQL Reference.

Case Independent Comparisons

To perform character comparisons that are independent of whether they are upper or lower case, you can use the TRANSLATE function to select and compare mixed case column data by translating it to upper case, but only for the purposes of comparison. Consider the following data:

  Abel
  abels
  ABEL
  abel
  ab
  Ab

For the following select statement:

SELECT c1 FROM T1 WHERE TRANSLATE(c1) LIKE 'AB%'

you would receive the following results:

  ab
  Ab
  abel
  Abel
  ABEL
  abels

Note: You could also set the select as in the following view v1, and then make all your comparisons against the view (in upper case) and your inserts into the table in mixed case:

  CREATE VIEW v1 AS SELECT TRANSLATE(c1) FROM t1

At the database level, you can set the collating sequence as part of the CREATE DATABASE API. This allows you to decide if 'a' is processed before 'A', or if 'A' is processed after 'a', or if they are processed with equal weighting. This will make them equal when collating or sorting using the ORDER BY clause. If you have two values of 'a' and 'A', 'A' will always come before 'a', because in all senses they are equal, so the only difference upon which to sort is the hexadecimal value.

Thus if you issue SELECT c1 FROM t1 WHERE c1 LIKE 'ab%', you receive the following output:

  ab
  abel
  abels

If you issue SELECT c1 FROM t1 WHERE c1 LIKE 'A%', you receive the following output:

  Abel
  Ab
  ABEL

If you issue SELECT c1 FROM t1 ORDER BY c1, you receive the following:

  ab
  Ab
  abel
  Abel
  ABEL
  abels

Thus, you may want to consider using the scalar function TRANSLATE(), as well as the CREATE DATABASE API. Note that you can only specify a collating sequence using the CREATE DATABASE API. You cannot specify a collating sequence from the Command Line Processor. For information on the TRANSLATE() function, see the SQL Reference. For information on the CREATE DATABASE API see the API Reference.

You can also use the UCASE function as follows, but note that DB2 performs a table scan instead of using an index for the select:

    SELECT * FROM EMP WHERE UCASE(JOB) = 'NURSE'

Specifying a Collating Sequence

The collating sequence for a database is specified at database creation time. Once the database has been created, the collating sequence cannot be changed.

The CREATE DATABASE API accepts a data structure called the Database Descriptor Block (SQLEDBDESC). You can define your own collating sequence within this structure.

To specify a collating sequence for a database:

Pass the desired SQLEDBDESC structure, or
Pass a NULL pointer. The collating sequence of the operating system (based on current country code and code page) is used. This is the same as specifying SQLDBCSS equal to SQL_CS_SYSTEM (0).

The SQLEDBDESC structure contains:

SQLDBCSS

A 4-byte integer indicating the source of the database collating sequence. Valid values are:

SQL_CS_SYSTEM: The collating sequence of the operating system (based on current country code and code page) is used.
SQL_CS_USER: The collating sequence is specified by the value in the SQLDBUDC field.
SQL_CS_NONE: The collating sequence is the identity sequence. Strings are compared byte for byte, starting with the first byte, using a simple binary comparison.

Note:

These constants are defined in the SQLENV include file.

SQLDBUDC

A 256-byte field. The nth byte contains the sort weight of the nth character in the code page of the database. If SQLDBCSS is not equal to SQL_CS_USER, this field is ignored.

Sample Collating Sequences

Several sample collating sequences are provided (as include files) to facilitate database creation using the EBCDIC collating sequences instead of the default workstation collating sequence.

The collating sequences in these include files can be specified in the SQLDBUDC field of the SQLEDBDESC structure. They can also be used as models for the construction of other collating sequences.

Other Concerns

Once a collating sequence is defined, all future character comparisons for that database will be performed with that collating sequence. Except for character data defined as FOR BIT DATA or BLOB data, the collating sequence will be used for all SQL comparisons and ORDER BY clauses, and also in setting up indexes and statistics. For more information on how the database collating sequence is used, see the section on String Comparisons in the SQL Reference, S10J-8165-01.

Potential problems may occur in the following cases:

An application merges sorted data from a database with application data that was sorted using a different collating sequence.
An application merges sorted data from one database with sorted data from another, but the databases have different collating sequences.
An application makes assumptions about sorted data that are not true for the relevant collating sequence. For example, numbers collating lower than alphabetics might or might not be true for a particular collating sequence.

A final point to remember is that the results of any sort based on a direct comparison of characters will only match the results of a query ordered using an identity collating sequence.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]