KERMIT PROTOCOL FOR JAPANESE TEXT FILE TRANSFER

                       Christine Gianone, Frank da Cruz
                              The Kermit Project
                             Columbia University

                              Dr. Hirofumi Fujii
              Japan National Laboratory for High Energy Physics

                               30 Sepember 1999

               [ Updating the original draft of 28 March 1991 ]

ABSTRACT

Several different Kanji computer codes exist, so a unified method is needed
for transferring Kanji text between computers that use different codes.
Several methods are examined, and one is chosen.

This proposal does not address transfer of Japanese files composed only of
Roman and Katakana single-byte characters, for which well-defined mechanisms
already exist.


BACKGROUND

The Kermit protocol transfers text files using a common intermediate
representation for text.  The protocol was extended in 1989-90 to allow for
use of different standard character sets within Kermit packets.  To transfer
text written in a particular language, the sending program translates from its
local character set to the standard set for that language, and the receiver
translates from the standard set to its own local set.  Thus, each Kermit
program needs to know only its local character sets and the standard ones, and
does not need to know about nonstandard sets used by other computers.

MS-DOS Kermit 3.0 for the IBM PC, C-Kermit 5A for UNIX and VAX/VMS, and IBM
mainframe Kermit 4.2 for MVS/TSO and VM/CMS were the first to implement
character-set translation; this included most European languages written with
Roman and Cyrillic alphabets (and, in some cases, also Greek, Hebrew, and
Japanese Katakana).

Extending the protocol to the full Japanese writing system, however, poses
special problems because it requires mixture of three distinct character sets
-- Roman, Katakana, and Kanji -- and because the Kanji character set is very
large compared to Western alphabets or syllabaries.


CHARACTER SETS USED IN JAPAN

Japanese standards exist for three different character sets:

1. JIS Roman, ISO 646 Japanese Version, ISO registration number 14.  A
94-character single-byte set identical to US ASCII except in positions 05/12
(ASCII backslash replaced by Yen sign) and 07/14 (ASCII tilde replaced by
macron or overbar).  Hereafter referred to as Roman.

2. JIS Katakana, ISO registration number 13.  A 94-character single-byte set
containing Katakana characters in columns 2 through 5, with columns 6 and 7
unused.  JIS X 0201 specifies the combination of (1) and (2) into a
single-byte 8-bit character set.

3. JIS X 0208 Multiple-Byte Character Set, ISO Registration Number 87.  A
two-byte character set comprising approximately 6000 Kanji characters, in
which each byte is a 7-bit value, and the high-order ("8th") bit of each byte
is unused.  JIS X 0208 includes not only Kanji, but also Roman, Hiragana,
Katakana, Greek, and Cyrillic characters.  The non-Kanji JIS X 0208 characters
are double width and not normally used.  JIS X 0208 consists of approximately
6400 defined characters, with additional space reserved for nonstandard
characters ("Gaiji").

Some Japanese computers use entirely different character sets, for example the
EBCDIC Kanji that is used on IBM and similar mainframes.  Most Japanese
computers, however, use a combination of the three standard character sets.
Different methods are used to allow characters from these different sets to
coexist within a file.

Shift-JIS, commonly found on PCs, uses special byte values 80-A0 (hex) and
E0-FE (hex) as lead-ins for two-byte Kanji sequences, of which the second byte
can have any value (the 8th bit can be 0 or 1).  Bytes in the 00-7F range are
Roman single-byte characters, and bytes in the A1-DF range are Katakana
single-byte characters.  Shift-JIS shifts each 96-byte Kanji plane left by two
columns, so all control regions are filled with graphics, and also fills up
the unused columns in Katakana with Kanji.  Can translate to/from EUC by
algorithm, no table necessary.  Microsoft Code Page 932 and Hewlett Packard
HP-15 are the same as Shift-JIS.  Shift-JIS is also used on the Macintosh and
on certain UNIX platforms such as Sony NEWS.

JIS-7 embeds ISO 2022 character set designation sequences in the text to
switch among double-byte Kanji and single-byte Roman/Katakana.  All Kanji
bytes are encoded with the 8th bit set to zero.  HP-16 is the same as JIS-7.

AT&T EUC (Extended Unix Code) for Japan (sometimes called JAE, for Japanese
Application Environment) sets the 8th bit of each Kanji byte to 1, allowing
Kanji bytes to be easily distinguished from Roman (ASCII) bytes, whose 8th
bits are 0.  A single-shift mechanism is used to select single-byte Katakana
characters:

  0XXXXXXX
    A Roman character (control or graphic).  Stands alone.

  10001111
    Single Shift 3 (SS3).  This means the next byte is a JIS Katakana
    character.  The following byte also has its high-order bit set to 1.

  100XXXXX
    A C1 control character.  C1 controls other than SS3 can be used to
    designate Gaiji.

  1YYXXXXX
    (YY is not 00)  The first byte of a 2-byte Kanji code.  The second
    byte also has its high-order bit set to 1.

This scheme is compliant with ISO 2022 (JIS X 0202) as used in the 8-bit
environment, with JIS Roman designated to G0 and invoked to GL, JIS Kanji
designated to G1 and invoked in GR, JIS Katakana designated to G3 and invoked
on a per-character basis with SS3.


ALTERNATIVES

Kermit's transfer character set for Japanese should be a national or
international standard, or closely related to one.  This leaves us with the
following choices:

1. JIS X 0208 (Level 1)

   Advantages:
    . Simplicity.  This character set contains all the characters of Roman 
      and Katakana as well as Kanji and all characters are the same size.
    . Clarity.  JIS X 0208 has an ISO registration number that can be used
      as an announcer.
   Disadvantages:
    . Noninvertability: the distinction between single-byte and double-byte
      Roman and Katakana characters is lost.
    . Transmission overhead of representing each Roman (ASCII) value in two
      bytes.
    . No computer uses JIS X 0208 by itself, so all Kermit programs will
      have to translate between file and transfer character sets.

2. Japanese EUC

   Advantages:
    . Matches common usage, which mixes three character sets within a file.
    . Half-width Roman and Katakana are not sacrificed.
    . No translation necessary for many computer systems, such as UNIX and
      VMS, which already use EUC.
   Disadvantages:
    . Variable-length characters.
    . High transmission overhead for 8-bit values in 7-bit environment.

3. ISO 10646 / UNICODE

   (Not ready in 1991, see below)

4. A combination of JIS Roman, JIS Katakana, and JIS Kanji, with ISO
   2022 designators and shifts (ISO 2022-JP):

   Advantages:
    . Efficient transmission in the 7-bit environment.
   Disadvantages:
    . Complexity.  Kermit program must fully implement ISO 2022.
    . Kermit's ISO 2022 extension has never been implemented [the idea
      was later dropped].

Japanese EUC was chosen as the transfer character-set for Japanese text after
consultation among various constituencies in Japan.  Implementations appeared
in all major Kermit programs in 1991, with appropriate file character sets
(JIS-7, Shift-JIS, DEC Kanji, various EBCDIC Kanjis) for each platform.  The
transmission penalty on 7-bit connections was addressed by Kermit's Locking
Shift option, which also benefits other "right-handed" character sets, such as
ISO 8859 Cyrillic, Hebrew, Greek, etc.

Kermit name: JAPAN-EUC
ISO Registration Numbers: 14 (Right half of JIS X 0201), 87 (JIS X 0208).
Kermit Designator: I14/87/13.

  Unicode/ISO10646 support was added in 1999, with UCS-2 and UTF-8 allowed
  as both file and transfer character sets.  In principal any Japanese
  character-set can be converted to Unicode, and Unicode can be converted
  to any Japanese character set (with the obvious potential for loss of
  non-JIS characters).


REFERENCES

1. Gianone, Christine M., "A Kermit Protocol Extension for International
   Character Sets", Columbia University, April 1990.
2. JIS X 0208 Multiple-Byte Character Set.
3. JIS X 0201 Single-Byte Character Set.
4. ISO 2022 "... Code Extension Techniques" (also JIS X 2020).

(End)