The database manager does not, in general, restrict the character set available to an application except as noted below.
Each combined Single-Byte Character Set (SBCS) or Double-Byte Character Set
(DBCS) code page allows for both single- and double-byte character code
points. This is usually accomplished by reserving a subset of the 256
available code points of a mixed code table for single-byte characters, with
the remainder of the code points either undefined or allocated to the first
byte of double-byte code points. These code points are shown in the
following table.
Table 134. Mixed Character Set Code Points
Country | Supported Mixed Code Page | Code Points for Single-byte Characters | Code Points for First Byte of Double-Byte Characters |
---|---|---|---|
Japan | 932, 943 | x00-7F, xA1-DF | x81-9F, xE0-FC |
Japan | 942 | x00-80, xA0-DF, xFD-FF | x81-9F, xE0-FC |
Taiwan | 938 (*) | x00-7E | x81-FC |
Taiwan | 948 (*) | x00-80, FD, FE | x81-FC |
Korea | 949 | x00-7F | x8F-FE |
Taiwan | 950 | x00-7E | x81-FE |
China | 1381 | x00-7F | x8C-FE |
Korea | 1363 | x00-7F | x81-FE |
China | 1386 | x00 | x81-FE |
Notes:
|
Code points not assigned to either category above are not defined, and are processed as single-byte undefined code points.
Within each implied DBCS code table, there are 256 code points available as the second byte for each valid first byte. Second byte values can have any value from 0x40 to 0x7E and from 0x80 to 0xFE. Note that in DBCS environments, DB2 does not perform validity checking on individual double-byte characters.
Each EUC code page allows for both single-byte character code points, and
up to three different sets of multi-byte character code points. This is
accomplished by reserving a subset of the 256 available code points of each
implied SBCS code page identifier for single-byte characters. The
remainder of the code points is undefined, allocated as an element of a
multi-byte character, or allocated as a single-shift introducer of a
multi-byte character. These code points are shown in the following
tables.
Table 135. Japanese EUC Code Points
Group | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|---|
G0 | x20-7E | n/a | n/a | n/a |
G1 | xA1-FE | xA1-FE | n/a | n/a |
G2 | x8E | xA1-FE | n/a | n/a |
G3 | x8E | xA1-FE | xA1-FE | n/a |
Table 136. Traditional Chinese EUC Code Points
Group | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|---|
G0 | x20-7E | n/a | n/a | n/a |
G1 | xA1-FE | xA1-FE | n/a | n/a |
G2 | x8E | xA1-FE | xA1-FE | xA1-FE |
G3 | n/a | n/a | n/a | n/a |
Table 137. Korean EUC Code Points
Group | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|---|
G0 | x20-7E | n/a | n/a | n/a |
G1 | xA1-FE | xA1-FE | n/a | n/a |
G2 | n/a | n/a | n/a | n/a |
G3 | n/a | n/a | n/a | n/a |
Table 138. Simplified Chinese EUC Code Points
Group | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|---|
G0 | x20-7E | n/a | n/a | n/a |
G1 | xA1-FE | xA1-FE | n/a | n/a |
G2 | n/a | n/a | n/a | n/a |
G3 | n/a | n/a | n/a | n/a |
Code points not assigned to the categories shown above are not defined, and are treated as single-byte undefined code points.
The basic character set that may be used in database names consists of the single-byte uppercase and lowercase Latin letters (A...Z, a...z), the Arabic numerals (0...9) and the underscore character (_). This list of letters is augmented with the three special characters #, @ and $ to provide compatibility with host database products. However, these special characters should be used with care in an NLS environment because they are not included in the NLS host (EBCDIC) invariant character set.
When naming database objects (such as tables and views), program labels, host variables, cursors and statements alphabetics from the extended character set may also be used. For example, those letters with diacritical marks. The available characters depend on the code page in use and if you are using the database in a multiple code page environment, you must ensure that all code pages support any alphabetics you plan on using from the extended character set. See the SQL Reference for a discussion of delimited identifiers which can be used in SQL statements and can also contain characters outside the extended character set.
In DBCS environments, the extended character set consists of all the characters in the basic character set, plus those identified as a letter or digit as follows:
The coding of SQL statements is not language dependent. SQL is a programming language and, like other programming languages such as C, it is language invariant. The SQL keywords must be typed as shown, although they may be typed in uppercase, lowercase, or mixed case. The names of database objects, host variables and program labels that occur in an SQL statement cannot contain characters outside the database manager extended character set as described above.
The following BiDi attributes are required for correct handling of Bidirectional data on different platforms:
- Text type (LOGICAL vs VISUAL) - Shaping (SHAPED vs UNSHAPED) - Orientation (RIGHT-TO-LEFT vs LEFT-TO-RIGHT) - Numeral shape (ARABIC vs HINDI) - Symmetric swapping (YES or NO)
Defaults on different platforms are not the same, problems appear when DB2 data is sent from one platform to another. For example, Windows platforms use LOGICAL UNSHAPED data, while data on OS/390 is usually in SHAPED VISUAL format. Therefore, without any support for bidirectional attributes, data sent from DB2 for OS/390 to DB2 UDB on a Windows 32-bit operating systems may display incorrectly.
DB2 supports bidirectional data attributes through special bidirectional Coded Character Set Identifiers (CCSIDs). The following bidirectional CCSIDs have been defined and are implemented with DB2 UDB:
CCSID ] Code ] String ] Page ] Type -------+--------+---------- 00420 420 4 00424 424 4 08612 420 5 08616 424 6 62208 856 4 62209 862 4 62210 916 4 62211 424 5 00856 856 5 62213 862 5 00916 916 5 01255 1255 5 01046 1046 5 00864 864 5 01089 1089 5 01256 1256 5 62220 856 6 62221 862 6 62222 916 6 62223 1255 6 62224 420 6 62225 864 6 62226 1046 6 62227 1089 6 62228 1256 6 62235 424 10 62236 856 10 00862 862 10 62238 916 10 62239 1255 10 62240 424 11 62241 856 11 62242 862 11 62243 916 11 62244 1255 11
Where CDRA String Types are defined:
String ] Text ] Numerical ] Orientation ] Shaping ] Symmetrical Type ] Type ] Shape ] ] ] Swapping ---------+-------+------------+-------------+-----------+------------- 4 Visual Arabic LTR Shaped OFF 5 Implicit Arabic LTR Unshaped ON 6 Implicit Arabic RTL Unshaped ON 7(*) Visual Arabic Contextual(*) Unshaped-Lig OFF 8 Visual Arabic RTL Shaped OFF 9 Visual Passthru RTL Shaped ON 10 Implicit Contextual-L ON 11 Implicit Contextual-R ON
Note: | (*) Field orientation is left-to-right (LTR) when the first alphabetic character is a Latin one, and right-to-left (RTL) when it is a bidirectional (RTL) character. Characters are unshaped, but LamAlef ligatures are kept, and not broken into constituents. |
Bidirectional layout transformations are implemented in DB2 Universal Database Version 5.2 using the new CCSID definitions. For the new BiDi-specific CCSIDs, layout transformations are performed instead of or in addition to code page conversions. To use this support, the DB2BIDI registry variable must be set to YES. By default, this variable is not set. This variable is used by the server for all conversions, and can only be set when the server is started. Setting DB2BIDI to YES may have some performance impact because of additional checking and layout transformations.
To specify a specific bidirectional CCSID in non-DRDA environment, select the appropriate CCSID from the above table that matches the characteristics of your client, and set DB2CODEPAGE to that value. If you already have a connection to the database, you must issue a TERMINATE command and connect again to make the new setting of DB2CODEPAGE take effect. If you select a CCSID which is not correct for code page or string type of your client platform, results would be unexpected. If you select an incompatible CCSID (ie, Hebrew CCSID for connection to an Arabic database or vice-versa), or if DB2BIDI has not been set for the server, you will receive an error message when you try to connect.
For DRDA environments, if the HOST EBCDIC platform also supports these bidirectional CCSIDs, you need to only set DB2CODEPAGE as mentioned above. However, if HOST platform does not support these CCSIDs, you must specify a CCSID override for the HOST database server that you are connecting to. This is necessary because, in DRDA environment, code page conversions and layout transformations are performed by the receiver of data. However, if HOST server does not support these bidirectional CCSIDs, it does not perform layout transformation on the data that it receives from DB2 UDB. If you use a CCSID override, the DB2 UDB client performs layout transformation on the outbound data as well. For details of how to set a CCSID override, please refer to DB2 Connect Release Notes.
CCSID override is not supported for cases where the HOST EBCDIC platform is the client and DB2 UDB is the server.
The database manager compares character data using a collating sequence. This is an ordering for a set of characters that determines whether a particular character sorts higher, lower, or the same as another.
Note: | Character string data defined with the FOR BIT DATA attribute, or BLOB data, is sorted using the binary sort sequence. |
For example, a collating sequence can be used to indicate that lowercase and uppercase versions of a particular character are to be sorted equally.
The database manager allows databases to be created with custom collating sequences. The following sections help you determine and implement a particular collating sequence for a database.
In a database, each single-byte character is represented internally as a unique number between 0 and 255, (in hexadecimal notation, between X'00' and X'FF'). This number is referred to as the code point of the character. A collating sequence is a mapping between the code point and the desired position of each character in a sorted sequence. The numeric value of the position is called the weight of the character in the collating sequence. The simplest collating sequence is one where the weights are identical to the code points. This is called the identity sequence.
For example, consider the characters B (X'42'), and b (X'62'). If, according to the collating sequence table, they both have a sort weight of X'42' (B), then they collate the same. If the sort weight for B is X'9E' and the sort weight for b is X'9D', then b will be sorted before B. Actual weights depend on the collating sequence table used which depends on the code set and locale. Note that a collating sequence table is not the same as a code page table which defines code points.
Consider the following example. In ASCII, the characters A through Z are represented by X'41' through X'5A'. To describe a collating sequence where these are sorted in order, and consecutively (no intervening characters), you can write X'41', X'42', ...X'59', X'5A'.
For multi-byte characters, the hexadecimal value of the multi-byte character is also used as the weight. For example, X'8260', X'8261' are the code points for double byte character A and B. In this case, you can write X'8260', X'8261' as the collating sequence for double byte characters A and B. These are also the code points for A and B.
The values of the weights in a collating sequence need not be unique. For example, you could give uppercase letters and their lowercase equivalents the same weight.
Specifying the collating sequence can be simplified if a collating sequence provides weights for all 256 code points. The weight of each character can be determined using the code point of the character. This is the method used to specify a collating sequence for the database manager: a string of 256 bytes, where the nth byte (starting with 0) contains the weight of code point n.
In the case of multi-byte character sets, DB2 uses the collation table which was specified at database creation time. If you require the multi-byte characters to sort the way they appear in their code point table, you must specify IDENTITY as your collation sequence when you create the database.
Note: | For DBCS characters in GRAPHIC fields, the sort sequence is always IDENTITY without regard to the collation sequence specified at database creation time. |
Once a collating sequence is established, character comparison is performed by comparing the weights of two characters, instead of directly comparing their code point values.
If weights that are not unique are used, characters that are not identical may compare equally. Because of this, string comparison must be a two-phase process:
If the collating sequence contains 256 unique weights, only the first step is performed. If the collating sequence is the identity sequence only the second step is performed. In either case, there is a performance benefit.
For more information on character comparisons, see the SQL Reference.
To perform character comparisons that are independent of whether they are upper or lower case, you can use the TRANSLATE function to select and compare mixed case column data by translating it to upper case, but only for the purposes of comparison. Consider the following data:
Abel abels ABEL abel ab Ab
For the following select statement:
SELECT c1 FROM T1 WHERE TRANSLATE(c1) LIKE 'AB%'
you would receive the following results:
ab Ab abel Abel ABEL abels
Note: | You could also set the select as in the following view v1, and then make all your comparisons against the view (in upper case) and your inserts into the table in mixed case: |
CREATE VIEW v1 AS SELECT TRANSLATE(c1) FROM t1
At the database level, you can set the collating sequence as part of the CREATE DATABASE API. This allows you to decide if 'a' is processed before 'A', or if 'A' is processed after 'a', or if they are processed with equal weighting. This will make them equal when collating or sorting using the ORDER BY clause. If you have two values of 'a' and 'A', 'A' will always come before 'a', because in all senses they are equal, so the only difference upon which to sort is the hexadecimal value.
Thus if you issue SELECT c1 FROM t1 WHERE c1 LIKE 'ab%', you receive the following output:
ab abel abels
If you issue SELECT c1 FROM t1 WHERE c1 LIKE 'A%', you receive the following output:
Abel Ab ABEL
If you issue SELECT c1 FROM t1 ORDER BY c1, you receive the following:
ab Ab abel Abel ABEL abels
Thus, you may want to consider using the scalar function TRANSLATE(), as well as the CREATE DATABASE API. Note that you can only specify a collating sequence using the CREATE DATABASE API. You cannot specify a collating sequence from the Command Line Processor. For information on the TRANSLATE() function, see the SQL Reference. For information on the CREATE DATABASE API see the API Reference.
You can also use the UCASE function as follows, but note that DB2 performs a table scan instead of using an index for the select:
SELECT * FROM EMP WHERE UCASE(JOB) = 'NURSE'
The collating sequence for a database is specified at database creation time. Once the database has been created, the collating sequence cannot be changed.
The CREATE DATABASE API accepts a data structure called the Database Descriptor Block (SQLEDBDESC). You can define your own collating sequence within this structure.
To specify a collating sequence for a database:
The SQLEDBDESC structure contains:
Note: | These constants are defined in the SQLENV include file. |
Several sample collating sequences are provided (as include files) to facilitate database creation using the EBCDIC collating sequences instead of the default workstation collating sequence.
The collating sequences in these include files can be specified in the SQLDBUDC field of the SQLEDBDESC structure. They can also be used as models for the construction of other collating sequences.
Once a collating sequence is defined, all future character comparisons for that database will be performed with that collating sequence. Except for character data defined as FOR BIT DATA or BLOB data, the collating sequence will be used for all SQL comparisons and ORDER BY clauses, and also in setting up indexes and statistics. For more information on how the database collating sequence is used, see the section on String Comparisons in the SQL Reference, S10J-8165-01.
Potential problems may occur in the following cases:
A final point to remember is that the results of any sort based on a direct comparison of characters will only match the results of a query ordered using an identity collating sequence.