Sun Dec  5 14:34:25 1993
 
	 A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS
 
				  **********
   NOTE: This is a work in progress, and will be updated from time to time.
				  **********

			     Christine M. Gianone
			  cmg@watsun.cc.columbia.edu
		 Manager, Kermit Development and Distribution
 
				Frank da Cruz
			  fdc@watsun.cc.columbia.edu
			  Manager, Network Planning
 
	     Columbia University Center for Computing Activities
			    612 West 115th Street
			   New York, NY 10025, USA
 
			       DRAFT NUMBER 7.1
 			          Dec 5, 1993
 
ABSTRACT
 
An extension to the presentation layer of the Kermit file transfer protocol
is proposed to allow transfer of non-English-language text files between
unlike computers by substitution of standard character sets other than ASCII
in Kermit's text-file transfer data packets.  Methods for selection,
announcement, and use of these character sets are described.  The reader is
assumed to be familiar with the Kermit file transfer protocol and with basic
computing and terminology.  The relevant ANSI and ISO character-set related
standards are summarized in Appendix B of this document.
 
This is a nearly final draft.  The protocol and many of the commands described
in this document have been successfully implemented in major Kermit programs
including MS-DOS Kermit 3.0, C-Kermit 5A for UNIX and VAX/VMS, and IBM
Mainframe Kermit 4.2 for VM/CMS, MVS/TSO, MUSIC, and CICS.  Special thanks
to John Chandler and Hirofumi Fujii for extensive contributions to this
draft, and to John Klensin for his comments and support.
 
SUMMARY OF CHANGES SINCE DRAFT #5, April, 1990
 
 - Abandonment of the two-level concept.  Mixed languages will be
   handled by using ISO 10646 or UNICODE as the transfer character set.  The
   details remain to be specified.
 - Abandonment of CSN 36 91 03, Czechoslovak Standard alphabet as a transfer
   character set.  Czech is adequately covered by ISO 8859-2.
 - Adoption of Japanese EUC as the transfer character set for Japanese
   text files, rather than JIS X 0208.
 - Explanation of Japanese EUC added to Appendix B.
 - Reference to Kermit's new locking shift transport protocol.
 - Removal of (unworkable) design for user-defined translations.
 - Addition of mechanism for automatic translation-table selection.
 - Addition of notion of "translation goal" and related commands.
 - Deletion of irrelevant or redundant appendices.
 - Addition of an annotated References section.
 - Short sections added on terminology and notation.
 - Note: Table I moved to Appendix B, so table numbers are out of order.
 
SUMMARY OF CHANGES SINCE DRAFT #4, August, 1989
 
 - Changes for Level 1 only, to reflect experience in writing the code
   to implement it for MS-DOS Kermit 3.0, C-Kermit 5A, and Kermit 370 4.2.
   Level 2 is on hold indefinitely pending ISO 10646 & Unicode developments.
 - Abandonment of separate attributes for encoding and character set.
 - Change all references to ASCII as I2 into I6 (ISO Registration Number).
 - Change description of SET LANGUAGE to remove side effects.
 - Differentiation of SET TRANSFER CHARACTER ASCII and TRANSPARENT.
 - The section on terminal emulation has not been changed, even though
   this subject needs detailed treatment in this document.
 
SUMMARY OF CHANGES SINCE DRAFT #3, July 20, 1989
 
 - Expanded & more precise definition of Kermit's character set designators
 - Simplification of the syntax of the (former) SET TRANSFER-SYNTAX command
 - Addition of SET LANGUAGE command
 - Clarification of Kermit's behavior when it receives an unknown character set
 - Addition of Appendix F to specify how each Kermit Level is invoked
 - Correction of numerous typographical and other errors
 
ACKNOWLEDGEMENTS
 
Many thanks to these people for their helpful and constructive comments during
the drafting process.  In most cases, their suggestions or the information
they provided have been incorporated into this or previous drafts.
 
  John Chandler (Harvard/Smithsonian Center for Astrophysics, USA)
  Alan Curtis (University of London, UK)
  Joe Doupnik (Utah State University, USA)
  Hirofumi Fujii (Japan National Laboratory of High Energy Physics, Tokyo)
  John Klensin (Massachusetts Institute of Technology, USA)
  Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo)
  Vladimir Novikov (VNIIPAS, Moscow, USSR)
  Jacob Palme (Stockholm University, Sweden)
  Andre Pirard (University of Liege, Belgium)
  Paul Placeway (Ohio State University, USA)
  Gisbert W. Selke (WIdO, Bonn, Germany)
  Fridrik Skulason (University of Iceland, Reykjavik, Iceland)
  Johan van Wingen (Leiden, Netherlands)
  Konstantin Vinogradov (ICSTI, Moscow, USSR)
  Amanda Walker (InterCon Systems Corp, USA)
 
Thanks also to the following people for organizing meetings or conferences
in their countries at which the issues of this proposal were discussed:
 
  Kohichi Nishimoto (Nihon DEC, Tokyo, Japan)
  Juri Gornostaev and A. Butrimenko (ICSTI, Moscow, USSR)
 
and thanks also to those who attended these gatherings!
 
Thanks to the Kermit developers who have implemented this extension in their
Kermit programs:

  John Chandler  (Kermit-370)
  Frank da Cruz  (C-Kermit)
  Joe Doupnik    (MS-DOS Kermit)
  Hirofumi Fujii (C-Kermit, MS-DOS Kermit, and NEC PC9801 Kermit)
 
Finally, thanks to other experts who provided valuable information:

  Jerry Andersen, IBM
  Lloyd Anderson, Ecological Linguistics
  Joe Becker, Xerox Corporation and UNICODE Consortium
  James Do, Mentor Graphics, San Jose, CA
  Edwin Hart, Johns Hopkins University Applied Physics Laboratory


NOTATION

This document is written in plain 7-bit US ASCII, and to be understood
correctly it should be displayed in plain 7-bit US ASCII.  The notation:

  <xxx>

is used to express a non-ASCII or non-graphic character, where "xxx" is
replaced by the name of the character, for example:

  <ESC>       (All capital letters: the name of a control character)

or:

  <A-grave>   (Lower or mixed case: a letter with a diacritical mark)

In other places (which should be clear from the context), the same notation is
used to denote a parameter to a Kermit command, for example:

  <filename>

to stand for the name of any file.


TERMINOLOGY

A "character" is the minimum unit of a writing system: a letter, a digit, a
punctuation mark, an ideogram, without regard to the style of rendering except
for capitalization in scripts where that is possible, and without regard to
computer encoding.

A "character set" is a particular, specified group of characters, for example
(and most typically) all the letters, digits, and punctuation marks needed for
a particular writing system.

A "coded character set" is the internal computer representation of a character
set, in which each character is assigned a unique code, often with the
addition of special control codes.  In this document, "character set" and
"coded character set" are used synonymously unless otherwise noted.

"Code page" is the term used by IBM and Microsoft to mean "coded character
set".

A "code point" is the association between a character and its encoding in a
particular character set.

An "octet" is a computer storage unit of 8 bits.

A "byte" is an octet, unless otherwise noted.

The word "translation" is used loosely in this document to denote conversion
between character set encodings, not translation between languages or any
other higher-level notion.  When characters are intentionally replaced by
different characters, the word "transliteration" is used.

 
STATEMENT OF THE PROBLEM
 
The Kermit file transfer protocol has always been able to transfer text files
between unlike computers (e.g. a UNIX system with ASCII stream text files and
an IBM mainframe with EBCDIC record-oriented text files).  To do the text
file code conversion, Kermit transfers text in ASCII.  However, ASCII
includes only enough letters and symbols for English.
 
There are now computers capable of representing the characters of other
languages: Roman letters with diacritical marks, Cyrillic letters, Hebrew,
Arabic, and Greek characters; Chinese, Japanese, and Korean ideograms.
However, different computer manufacturers use different codes for these
characters.  For example, the IBM PS/2 and the Apple Macintosh both have
character sets that are "8-bit ASCII".  When the character value is 32-127,
the character is (normally) a standard ASCII graphic (printable) character.
When the value is 128 or higher, it is a "special" character.  Unfortunately,
the PC and the Macintosh assign different special characters to these values.
Here are just a few examples:
 
   Value     PS/2 Character      Macintosh Character
    138       Small e grave       Small a diaeresis
    143       Capital A ring      Small e grave
    144       Capital E acute     Small e circumflex
    136       Small e circumflex  Small a grave
 
When a file contains "8-bit ASCII", basic Kermit transfers it without any
character translation.  Therefore, a text file written in French, German,
Italian, or Norwegian transferred between a PS/2 and a Macintosh will contain
the wrong characters when it arrives at its destination: the PS/2's e-grave
becomes a-diaeresis on the Macintosh, etc.
 
There are many computer vendors in the world and nobody controls what codes
they use to represent characters.  Without a standard protocol for
transferring non-ASCII text, each computer would have to know the codes of
all the other computers in order for correct transfer of non-English text
files to occur between all combinations of unlike systems.
 
To complicate matters, many computers now support more than one character
set.  IBM mainframes have not only "standard" US EBCDIC, but also several
EBCDIC-based Country Extended Code Pages (CECPs) for the support of West
European languages, Hebrew, Kanji, etc.  The IBM PC and PS/2 have a variety
of ASCII-based 8-bit code pages for the same purpose.  These character sets
are a welcome addition because they allow users of these computers to create,
display, and print documents in languages other than English.  Unfortunately,
the computer's file system keeps no record of which character set is used in
each file.
 
IBM is not the only source of private character sets.  The Apple Macintosh has
many character sets and fonts.  DEC supports its own multinational character
set as well as private encodings for Greek, Hebrew, etc.  The NeXT workstation
has its own unique character set.  Similarly for Data General, Atari,
Commodore, and many other computer manufacturers.  In the USSR, up to five
different Cyrillic character sets are in use.  In Japan, there are many
different encodings for Roman, Katakana, and Kanji characters.  China and
Taiwan use different encodings for Chinese characters.


NORMAL KERMIT FILE TRANSFER SYNTAX
 
The Kermit file transfer protocol makes a distinction between text and binary
files.  Binary files are transmitted with no translation or conversion.  For
text files, the Kermit protocol defines a standard intermediate
representation ("transfer syntax") for text files, namely ASCII characters
with carriage return and linefeed (CRLF) after each line, so text can be
stored in useful fashion on any computer to which it is transferred.  Each
Kermit program knows how to translate from the local text-file storage
conventions to ASCII/CRLF syntax, and vice versa.  This is the basic,
required, and default mode of operation for any Kermit program.
 
 
INTERNATIONAL KERMIT TRANSFER SYNTAX
 
This proposal adds a new mechanism that permits the use of character sets
other than ASCII in file transfer.  These additional character sets are taken
from recognized national or international standards, such as the ISO 8859
Latin Alphabets.
 
Using a standard character set (other than ASCII), it is possible to transfer
a text file written in a language other than English between unlike
computers, and it is also possible to transfer a text file containing more
than one language.  For example Latin Alphabet 1 can represent a file
containing a mixture of Italian, Norwegian, French, German, English, and
Icelandic.
 
The character set used in a text file stored in a particular computer is
called the "file character set" (FCS).  When the characters in a text file
can be represented by a single standard character set, that character set can
be used in place of ASCII in Kermit's transfer syntax.  This is called the
"transfer character set" (TCS).  Whatever the transfer character set, there
must be a mapping between the local file character set and the transfer
character set.  That is, there must be a pair of translation functions in the
program: one from the local file character set to the transfer character set,
and one from the transfer set to the local set:
 
         COMPUTER A                                COMPUTER B
    +------------------+                      +------------------+
    | +-------------+  |                      |  +-------------+ |
    | | Translation |  |      Transfer        |  | Translation | |
    | | Function:   |--------------------------->| Function:   | |
    | | FCS to TCS  |  |    Character Set     |  | TCS to FCS  | |
    | +-------------+  |                      |  +-------------+ |
    |       ^          |                      |        |         |
    |       |          |                      |        v         |
    |  Kermit Program  |                      |  Kermit Program  |
    |      SEND        |                      |     RECEIVE      |
    +------------------+                      +------------------+
	    ^                                          |
	    |                                          v
    +------------------+                      +------------------+
    |  Local File      |                      |  Local File      |
    |  Character Set A |                      |  Character Set B |
    +------------------+                      +------------------+

The use of a common, standard transfer character sets means that each Kermit
program only has to know about its own local character sets and a small
number of standard ones.

International transfer syntax is an optional feature for Kermit programs, and
is designed to interoperate (with, of course, no claim to correct translation)
with Kermit programs that do not support it.
 

SPECIFYING THE FILE CHARACTER SET

The following command allows the Kermit user to specify the local file
character set:
 
  SET FILE CHARACTER-SET <file-character-set-name>
 
The file character set name is a normally system-dependent item.  Some
computers have only one character set, in which case the SET FILE
CHARACTER-SET command is unnecessary.
 
This command will be required on computers where more than one file character
set is used.  These include private (corporate) character sets or the 7-bit
national variants allowed by ISO Standard 646 (See Appendix B).
 
A consistent, or at least sensible, naming convention should be used for
private character sets.

The following names for are recommended for the 7-bit national character sets:
ASCII, BRITISH, CUBAN, DANISH, DUTCH, FINNISH, FRENCH, FRENCH-CANADIAN,
GERMAN, HUNGARIAN, ITALIAN, JAPANESE-ROMAN, NORWEGIAN, PORTUGUESE, SPANISH,
SWEDISH, and SWISS (note: most of these are ISO-646 sets, but several of them
are private 7-bit sets).

The Apple character sets might include APPLE-STANDARD, APPLE-QUICKDRAW, and
APPLE-SYMBOL.  The DEC Multinational Character Set can be called
DEC-MULTINATIONAL, DEC Greek would be DEC-GREEK, DEC Hebrew would be
DEC-HEBREW, etc.  The NeXT character set can be NEXT-MULTINATIONAL.  The Data
General international character set can be DATA-GENERAL-MULTINATIONAL, and so
on.  Later, when these companies add new and no doubt unique character sets,
these can be called NEXT-GREEK, NEXT-HEBREW, DATA-GENERAL-GREEK,
DATA-GENERAL-HEBREW, etc.

For the IBM character sets (code pages), the notation CPnnn is used, where nnn
is the code page number: CP037, CP437, CP500, CP850, etc.  EBCDIC should be
used for "standard" USA EBCDIC.  An alternative notation, more in keeping with
the ones above, would be something like IBM-PC-STANDARD, IBM-PC-MULTINATIONAL,
IBM-PC-PORTUGUESE, IBM-370-MULTINATIONAL, IBM-370-USA, IBM-370-JAPAN, etc.
But because there are often several code pages that fit one such description,
the CPnnn notation is preferred.

These are simply samples and guidelines for naming conventions for corporate
character sets.  File character set names should be both precise and mnemonic
when possible but, as in the IBM case, precision should take precedence.

In countries like the USSR, character sets are not associated with particular
companies, but have grown up as a matter of usage in several different
computing environments, or have grown out of several different generations of
standards.  In such cases, it makes the most sense to stick to common usage.
USSR character sets include KOI-7, KOI-8, DKOI, CP866 (Microsoft Cyrillic),
ALT-CYRILLIC ("Alternative Cyrillic"), and CYRILLIC (ISO 8859-5).

In Japan, a mixture of standard (JIS), modified standard, and corporate
character sets are used: JIS-7, JIS-8, SHIFT-JIS, JAPAN-EUC, DEC-KANJI, 
FUJITSU-KANJI, HITACHI-KANJI, etc.

Example: Consider a computer where the ASCII character set is used for
programming and the German ISO 646 variant is used for text.
The German phrase:

   Gr<u-diaeresis><ess-zet>e aus K<o-diaeresis>ln

would be rendered in ASCII as "Gr}~e aus K|ln", and the ASCII C-language
phrase "{~a[x]}" would appear as:
 
  <a-diaeresis><ess-zet>a<A-diaeresis>x<U-diaeresis><u-diaeresis>
 
in German ISO 646 (ess-zet is the German double-s character, similar in
appearance to Greek beta).  The German-speaking user would want Kermit to
interpret the local file characters as German (SET FILE CHARACTER-SET GERMAN)
in the former case, and as ASCII (SET FILE CHARACTER-SET ASCII) in the latter.

 
SPECIFYING THE TRANSFER CHARACTER SET
 
To select the transfer character set for file transfer, the user enters the
command:
 
  SET TRANSFER CHARACTER-SET <name>
 
where <name> is the name of a standard character set.  If the name is
TRANSPARENT, Kermit does no character set conversion at all, but it still does
text record format conversion.  For ASCII-based systems, this is equivalent to
Kermit's normal, basic mode of operation.
 
If a name other than TRANSPARENT is given, and FILE TYPE is set to TEXT,
Kermit translates between the current file character set and the named
transfer character set when constructing or deciphering file data packets.

If the transfer character set is ASCII, Kermit converts between the current
file character set and 7-bit ASCII.  This mode of operation is roughly
equivalent to Kermit's basic mode of operation on non-ASCII based systems like
IBM mainframes.  But if the local file character set contains accented Roman
characters, the accents are dropped in the transfer character set, for example
a-acute becomes simply a.  (But see SET LANGUAGE, described later.)
 
Other transfer character sets must be chosen from among approved national or
international standards.  The sets shown in Table 2 are recommended.  The
criteria for including a character set in this table are:
 
1. 7-bit US ASCII (= ISO-646 US version) is included, for compatibility
   with the original Kermit protocol and the hundreds of programs that 
   implement it.
 
2. An 8-bit single-byte character set, such those in the ISO 8859 series,
   may be included if it is registered, as in (4) below.
 
3. A multibyte character set may be included, if it is registered as in (4).
 
4. The set must be listed in the ISO International Register of Character Sets
   under the provisions of ISO Standard 2375 (see Appendix A), so it has a
   unique registration number and designating escape sequence with which the
   sending Kermit program can identify the character set to the receiving
   Kermit program.  (An exception to this provision is made for Japanese EUC,
   which is a combination of two registered standards.)  Allowance is made for
   the possibility of other registration authorities, should they appear.
 
5. The set must be a national or international standard graphic character
   set, intended for use in computer text processing or programming (as
   opposed to Videotex, Teletex, OCR, device control, or other applications).
   This category may include standard line-drawing or technical character sets
   which fit the other criteria.
 
Note in particular that the national variants of ISO 646 are not included,
since these are covered adequately by the ISO Latin alphabets.
 
Standard character sets containing "composed characters", such as CCITT T.61,
in which an accented letter is represented by a two-character sequence (for
example, c-cedilla is encoded as a cedilla character followed by a "c"
character), are not included at this time.  The issue of composed versus
precomposed characters will be addressed later.

Standard "Kermit names" (for use with the SET TRANSFER CHARACTER-SET command)
are given to these character sets so they may be referred to uniformly in all
Kermit implementations.  These names are chosen to be mnemonic so users don't
have to remember cryptic designations like "ISO-8859-3".  The choice of single
words like "CYRILLIC" implies that there will not be more than one transfer
syntax for Cyrillic text.  However, if standards change in the future, it will
be possible to add further identifying material to these names, e.g.
"CYRILLIC-2, CYRILLIC-ANCIENT", etc.
 
The Kermit names are English, as this is the language of the standards
themselves.  The Kermit commands are English words, and this document is
written in English.  Non-English user interface issues are beyond the scope
of this document.
 
_____________________________________________________________________________
 
            Table 2: Standard Character Sets
 
US 7-bit ASCII.  English, Latin, Gaelic without accents, Dutch without
  y-diaeresis, German without umlauts (vowels marked by diaeresis) or ess-zet.
  Kermit name: ASCII.
  ISO Registration Number: 6.
  Kermit Designator: none (this is the default transfer character set).
 
ISO 8859-1, Latin Alphabet 1.  Danish, Dutch, English, Faeroese, Finnish,
  French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish,
  and Swedish.
  Kermit name: LATIN1.
  ISO Registration Number: 100.
  Kermit Designator: I6/100.
 
ISO 8859-2, Latin Alphabet 2.  Albanian, Czech, English, German, Hungarian,
  Polish, Romanian, Serbocroatian (Croatian), Slovak, and Slovene.
  Kermit name: LATIN2.
  ISO Registration Number: 101.
  Kermit Designator: I6/101.
 
ISO 8859-3, Latin Alphabet 3.  Afrikaans, Catalan, Dutch, English, Esperanto,
  French, Galician, German, Italian, Maltese, Spanish, and Turkish.
  Kermit name: LATIN3.
  ISO Registration Number: 109.
  Kermit Designator: I6/109.
 
ISO 8859-4, Latin Alphabet 4.  Danish, English, Estonian, Finnish, German,
  Greenlandic, Lappish (Sami), Latvian, Lithuanian, Norwegian, and Swedish.
  Kermit name: LATIN4.
  ISO Registration Number: 110.
  Kermit Designator: I6/110.
 
ISO 8859-5, the Latin/Cyrillic Alphabet.  Bulgarian, Byelorussian, English,
  Macedonian, Russian, Serbocroatian (Serbian), and Ukrainian
  (Compatible with USSR GOST Standard 19768-1987 and ECMA-113).
  Kermit name: CYRILLIC.
  ISO Registration Number: 144.
  Kermit Designator: I6/144.
 
ISO 8859-6, the Latin/Arabic Alphabet.
  Kermit name: ARABIC.
  ISO Registration Number: 127.
  Kermit Designator: I6/127.
 
ISO 8859-7, the Latin/Greek Alphabet.
  Kermit name: GREEK.
  ISO Registration Number: 126.
  Kermit Designator: I6/126.
 
ISO 8859-8, the Latin/Hebrew Alphabet.
  Kermit name: HEBREW.
  ISO Registration Number: 138.
  Kermit Designator: I6/138.
 
ISO DIS 8859-9, Latin Alphabet 5, in which the Icelandic letters Thorn and
  Eth plus upper and lowercase Y acute from Latin Alphabet 1 are replaced by
  six other letters needed for Turkish.  Danish, Dutch, English, Faeroese,
  Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish,
  Swedish, and Turkish.
  Kermit name: LATIN5.
  ISO Registration Number: 148.
  Kermit Designator: I6/148.

JIS X 0201, a 1-byte code for Japanese Katakana, used in conjunction
  with a slightly modified ASCII (backslash is replaced by Yen sign,
  tilde by overbar).
  Kermit name: KATAKANA.
  ISO Registration Numbers: 14 (Roman), 13 (Katakana).
  Kermit Designator: I14/13.
 
Japanese EUC
  A variable-length code containing ASCII and Japanese Katakana in their JIS
  X 0201 representations, plus 2-byte JIS X 0208.  JIS X 0208, in turn,
  includes Japanese Kanji, Katakana, Hiragana, Roman, Greek, and Russian
  characters, plus special symbols, etc.  ASCII codes are single bytes with
  their 8th bit set to zero.  JIS X 0208 codes are double bytes with the
  8th bit of each byte set to one.  JIS X 0201 Katakana bytes are preceded
  by Single Shift 2 (see Appendix B).  This mixture allows single-width Roman
  and Katakana characters to coexist with double-width JIS X 0208 characters,
  a common practice in many Japanese computing environments.
  Kermit name: JAPAN-EUC
  ISO Registration Numbers: 14 (Right half of JIS X 0201), 87 (JIS X 0208).
  Kermit Designator: I14/87/13.
 
Chinese Standard GB 2312-80, a 2-byte code for Chinese.
  Kermit name: CHINESE.
  ISO Registration Number: 58.
  Kermit Designator: I6/58.
 
KS C 5601 (1989), a 2-byte code for Korean.
  Kermit name: KOREAN.
  ISO Registration Number: 149.
  Kermit Designator: I6/149.
 
TCVN 5712:1993, an ISO 2022-compliant pair of single-byte sets for 
  Vietnamese, one for uppercase letters, the other for lowercase.
  Kermit name: VIETNAMESE.
  ISO Registration Number: 180.
  Kermit Designator: I6/180.

ISO/IEC 10646-1.  International Standard 10646,
  Information Processing -- Multiple-Octet Coded Character Set, 1993.

            Table 2: Standard Character Sets
_____________________________________________________________________________
 
    BEWARE: The Latin-4 alphabet is confused. The original ECMA 94 standard
    was designed for the Scandinavian and Baltic languages and thus included
    the character A-ring (necessary for Swedish and Lappish/Sami), but some
    editions of the ISO Registry substitute L-acute (not used by any of the
    covered languages).
 
    NOTE: CNS 11643 (Taiwan) is not included because (a) one Chinese
    transfer character set should be sufficient, and (b) CNS 11643 does not
    show up in the ISO Register.  The issue of "Han Unification" (combining
    Chinese, Japanese, and Korean ideograms into a single code set) is not
    addressed by this proposal, except insofar as it occurs in the base
    multilingual plane (BMP) of ISO 10646.

    Until and unless Kermit programs are updated to take advantage of ISO
    10646, additional transfer character sets must be added to Kermit's
    repertoire for languages with writing systems not yet covered: Burmese,
    Thai, Lao, Khmer, Armenian, Georgian, Amharic, Sinhalese, Tibetan,
    Mongolian, Cherokee, many African languages, etc etc.

The ISO Latin alphabets are 8-bit character sets whose left half is identical
with ASCII, and whose right half contains the characters required for
languages other than English.  All accented letters are "precomposed", i.e.
single code points.  The ISO registration number refers only to the right half
of each of these character sets, but each of these sets must be used in its
entirety, because the unaccented Roman letters, the digits, and the common
punctuation marks appear only in the ASCII left half, which is ALWAYS (unless
otherwise noted) US ASCII, ISO Registration Number 6.  The Kermit
character-set name refers to the two halves combined as a single set.
 
A particular Kermit program need not incorporate all of these character sets.
In many cases, a single 8-bit character set will suffice, such as LATIN1 for
Western Europe, LATIN2 for Eastern European countries with Roman-alphabet
based writing systems, CYRILLIC for most of the USSR, and so on.
 
When a language is representable in more than one character set from this
table, as are English, German, Finnish, Turkish, etc., the character set
highest on the list which adequately represents the language should be
preferred.  More precisely, when a character set other than ASCII is to be
used in Kermit's transfer syntax, the ISO 8859 sets are preferred to other
registered sets which contain the same characters.  Within the ISO 8859
family, lower-numbered sets which contain the characters of interest are
preferred to higher-numbered sets which contain the same characters.  This
guideline maximizes the chance that any two particular Kermit programs will
interoperate.
 
For example, LATIN1 would be chosen for French, German, Italian, Spanish,
Danish, Dutch, Swedish, etc; LATIN3 for Turkish; JAPAN-EUC for Japanese text
that includes Kanji characters, KATAKANA for Japanese text that includes only
Roman and Katakana characters, etc.
 
Unfortunately, but unavoidably, the burden of choosing the best transfer
character set must be placed upon the user.  If a file containing a mixture of
English, Finnish, and Latvian must be transferred, the user must find a
character set that can adequately represent all three languages, in this case
Latin Alphabet 4.  A table like Table 3 should be provided in the user
documentation to help the user make this selection.
 
_____________________________________________________________________________
 
    Afrikaans    LATIN3                       Irish          LATIN1,5
    Albanian     LATIN2                       Italian        LATIN1,3,5
    Arabic       ARABIC                       Japanese Kanji JAPAN-EUC
    Bulgarian    CYRILLIC                     Japan.Katakana KATAKANA,JAPAN-EUC
    Byelorussian CYRILLIC                     Korean         KOREAN
    Catalan      LATIN3                       Lappish (Sami) LATIN4
    Chinese      CHINESE                      Latvian        LATIN4
    Czech        LATIN2                       Lithuanian     LATIN4
    Danish       LATIN1,4,5                   Macedonian     CYRILLIC
    Dutch        LATIN1,2,3,4,5               Maltese        LATIN3
    English      ASCII,LATIN1,2,3,4,5,etc     Norwegian      LATIN1,4,5
    Esperanto    LATIN3                       Polish         LATIN2
    Estonian     LATIN4                       Portuguese     LATIN1,5
    Faeroese     LATIN1,5                     Romanian       LATIN2
    Finnish      LATIN1,4,5                   Russian        CYRILLIC
    French       LATIN1,3,5                  *Serbocroatian  LATIN2, CYRILLIC
    Galician     LATIN3                       Slovak         LATIN2
    German       LATIN1,2,3,4,5               Slovene        LATIN2
    Greek        GREEK                        Spanish        LATIN1,5
    Greenlandic  LATIN4                       Swedish        LATIN1,4,5
    Hebrew       HEBREW                       Turkish        LATIN3,5
    Hungarian    LATIN2                       Ukrainian      CYRILLIC
    Icelandic    LATIN1
 
          Table 3: Preferred Transfer Syntax Character Sets
 
*If written in Cyrillic, this language is called Serbian.  If written
 in Roman letters, it is called Croatian.
_____________________________________________________________________________
 
Note, Table 3 is only a sample.  To produce a comprehensive and definitive
table would require a team of language experts.  The information in the
current table is based purely upon the claims made within the standards
themselves, in which there is no mention of languages like Farsi, Urdu, Welsh,
Cornish, Manx, Inuit, Old Church Slavonic, Armenian, Georgian, Tagalog,
Swahili, Latin, etc, nor definitions of exactly what is meant by terms like
"Greenlandic", "Irish", etc.  Nevertheless, it is the intention of this
proposal to support any language for which a computer character set can be
standardized.

 
OTHER NON-UNIVERSAL CHARACTER SETS

This section lists character sets that are not listed in Table 2, but that
are likely candidates for eventual inclusion therein (i.e. after they are
registered with the ISO).

ISO 6438, Extended Roman for African Languages.
  More information needed.

ISCII-1991 (IS 13194-1991), Indian Script Code for Information Interchange.
  Supports the nine official Indian scripts derived from Brahmi: Devanagari,
  Gurmukhi, Gujarati, Bengali, Assamese, Oriya, Telugu, Kannada, Malayalam,
  and Tamil, with Roman transliteration.  A series of single-byte codes with
  00/00-07/15 = ASCII, different right halves for each script.  All of the
  right halves are structurally identical to each other to facilitate
  automatic transliteration and display of alternate alphabets using the same
  system software.  As Perso-Arabic scripts have a different alphabet, a
  different standard is envisaged for them.  ISCII-1991 is a successor to
  earlier codes ISSCII-83 and ISCII-88 announced by the Department of
  Electronics, Government of India.  Bureau of Indian Standards, Manak Bhavan,
  09 Bahadur Shah Zafar Marg, New Delhi 110002.  No ISO registration number.

(Add others...)


THE UNIVERSAL CHARACTER SET

Though ISO Standard 10646 was approved in 1993, it will continue undergo
continuous change as national standards bodies evolve and engage in the ISO
process, and it will take many years before it replaces the many existing
private and standard character sets in data processing and communication.
Therefore there is no intention to drop support in the Kermit protocol for the
standard character sets listed above at any time in the foreseeable future.
ISO 10646 can be added (in at least in one form, most likely a compacted
version of the two-byte Base Multilingual Plane) to Kermit's list of transfer
character sets.


IMPLEMENTATION
 
Character set translation can be added to existing Kermit programs with
a minimum of effort.  The following steps are required for each Kermit program:
 
1. Add the SET FILE TYPE { BINARY, TEXT } command, if the program doesn't
   have it already.  SET FILE TYPE TEXT enables text-file character set
   conversion.  SET FILE TYPE BINARY disables conversions of all kinds, but
   does not destroy the file and transfer character-set selections (2 and 3
   below), so that a subsequent SET FILE TYPE TEXT command will still be able
   to use them.
 
2. Add the SET FILE CHARACTER-SET <name> command.  The set of <names> should
   include ASCII or EBCDIC (as appropriate, used for program source, etc) plus
   the names of any "national" or special character sets that are used on the
   particular computer.
 
3. Add the SET TRANSFER CHARACTER-SET <name> command.  The set of <names>
   should include TRANSPARENT and ASCII plus the names of one or more other
   standard character sets from Table 2 which contain the characters from the
   computer's local character set(s).
 
4. Add translation tables (or functions) between each compatible pair of
   character sets in (2) and (3).  For each pair, two translation tables are
   necessary: one from the local file character set to the transfer character
   set, and one from the transfer set to the local one.
 
5. Add SHOW commands to let the user find out what character sets are
   available, and which ones are currently selected, for the transfer syntax
   and for local files.  The exact syntax of this command will vary.  In
   some Kermit implementations, every SET command has a corresponding SHOW
   command, in which case it will be possible to SHOW FILE CHARACTER-SET and
   SHOW TRANSFER CHARACTER-SET.  In others, related SET parameters are lumped
   together into broader categories for purposes of SHOW, for example SHOW
   FILE would show all file-related parameters; SHOW PROTOCOL would show all
   protocol-related parameters.
 
Any particular Kermit program can support several (perhaps many) file
character sets (FCS) and transfer character sets (TCS).  No particular
combination of them should be forbidden.  If a useful translation between,
say, Hebrew and Katakana can be devised, there is no reason the user should
not be allowed to select it.

However, programs that support large numbers of file and transfer character
sets must bow to the limitations of the computer's architecture and memory
space, as well as the knowledge and patience of the programmer.  Hence, purely
as a matter of implementation, certain combinations of FCS and TCS --
preferably the ones that would be least frequently used -- can remain
unsupported.  In that case, the SET { FILE, TRANSFER } CHARACTER-SET command
that causes the conflict can issue an error message or switch automatically
to a combination (if any) that makes sense.

Optionally, several additional related commands can be included:
 
6. The command SET LANGUAGE may be added to allow the program to apply
   heuristics in the translation process that would not otherwise be possible.
   See discussion below.
 
7. Commands for modifying, loading, and saving translation tables (to be
   specified in a future draft of this document).
 
8. Once the new commands and translation tables are in place, it is simple to
   add a TRANSLATE command, to translate a local file from one character set
   to another, using a transfer character set as an intermediate step.  With
   this command, Kermit may be used as a character-set conversion utility for
   local files (see Appendix D).
 
9. Commands governing automatic pairing of file and transfer character set
   and setting the goal for translation, described below.

Translation occurs only in the data field of the D packets.  Packet control
fields are not translated, nor are the data fields of any other kind of
packet, including F (filename) packets.  (Filename packets cannot be
translated because the attribute packet that announces the file's character
set does not arrive until after the F packet.)  As always, IBM Mainframe
Kermit is a special case, since most character strings must be translated
between EBCDIC and ASCII.  Nonetheless, the rule applies even there, as long
as we take "translation" to mean the specific translation between the transfer
and file character sets, rather than the standard ASCII/EBCDIC conversion.
 
Internally, the Kermit program that is sending a file:
 
1. Reads characters (one or more bytes) or lines of text from the file.
 
2. Translates the character from the FILE CHARACTER-SET to the TRANSFER
   CHARACTER-SET, applying any selected and applicable special rules or goals,
   and converting the record format if necessary.
 
3. Follows the negotiated lower-level encoding options: control prefixing,
   shifting, and compression.
 
4. Assembles and sends the packet.
 
The Kermit program that is receiving a file:
 
1. Reads an incoming data packet.
 
2. Decodes the packet data according to the negotiated prefixing, shifting,
   and compression options.
 
3. Translates the resulting characters from the TRANSFER CHARACTER-SET to the
   FILE CHARACTER-SET, applying any selected and applicable special rules or
   goals, converting the record format if necessary.
 
4. Writes the translated characters to the output file.
 
 
EXAMPLE
 
To transfer a Finnish-language text file from a computer that uses the Finnish
ISO 646 national variant to an IBM PS/2, and to store the file using the
PS/2's Multilingual Code Page:
 
  On the sending computer:               On the receiving computer:
    SET FILE TYPE TEXT                     SET FILE TYPE TEXT
    SET FILE CHARACTER-SET FINNISH         SET TRANSFER CHARACTER-SET LATIN1
    SET TRANSFER CHARACTER-SET LATIN1      SET FILE CHARACTER-SET CP850
    SEND filename                          RECEIVE
 
The file sender translates from Finnish ISO 646 to Latin Alphabet 1, the
most appropriate transfer character set (see Table 3), and the file receiver
translates from Latin-1 to Code Page 850.

To transfer a C-language source program between the same two computers:
 
  On the sending computer:               On the receiving computer:
    SET FILE TYPE TEXT                     SET FILE TYPE TEXT
    SET TRANSFER CHARACTER-SET ASCII       SET FILE CHARACTER-SET ASCII
    SET FILE CHARACTER-SET ASCII           SET TRANSFER CHARACTER-SET ASCII
    SEND filename                          RECEIVE
 
Here all translations are from ASCII to ASCII, hence no translation at all.


LANGUAGE-SPECIFIC TRANSLATIONS

When national or international text must be translated into ASCII, information
is necessarily lost.  ASCII does not include accented or non-Roman letters.
For readability, accented letters can be converted to their unaccented
counterparts, but that can introduce ambiguities or mistakes (to use Andr'e
Pirard's example: "a la francaise" without accents means "has the French
girl").  If we know that the text is written in a specific language, sometimes
certain language-specific rules can be applied to reduce the loss of
information.

For example, consider text containing the y-diaeresis character.  It is
acceptable to render y-diaeresis as "ij" if the language is Dutch, but not
otherwise (yielding "Rijksmuseum" -- correct spelling -- rather than
"Ryksmuseum").  Similarly, o-diaeresis can be rendered as "oe" in German or
Swedish but not in English ("co<o-diaeresis>peration").

The command for selecting language-specific translation rules is:

   SET LANGUAGE <name>

where <name> is the (English) name of the language, for example ITALIAN,
NORWEGIAN, PORTUGUESE.

Example: The command SET LANGUAGE GERMAN would allow the Kermit program, when
translating from Latin-1 or the German ISO 646 variant into ASCII, to render:

   Gr<u-diaeresis><ess-zet>e aus K<o-diaeresis>ln

as "Gruesse aus Koeln" (correct German) rather than "Gruse aus Koln" (Gruse
means something entirely different from Gruesse -- something like "scum"
rather than "greetings").
 

TRANSLATION MECHANISMS
 
When translating from one character set to another, there are two goals
possibly conflicting goals:

1. Readability (R): Achieving a translation that makes the most sense to
   the reader.

2. Invertibility (I): Achieving a translation that can be translated back
   to the original character set without loss or distortion of information.

When readability is desired, nonmatching characters are converted to the
closest matching character, for example Latin-1 e-grave becomes simply e in
ASCII.  But now "e" represents two different characters in the translation, so
invertibility is lost.  When no sensible counterpart exists, a special "this
can't be translated" character is used (a unique character if possible,
otherwise a question mark "?").  When this special character is used for more
than one purpose, invertibility is lost.

Invertibility is possible only when both character sets are the same size.
When invertibility is desired, the characters of the intersection of the two
sets are paired together: A in one set to A in the other, A-grave in one set
to A-grave in the other, etc.  The members of the two sets of differences
between the two character sets are paired together in a way that gives every
character a unique translation in each direction.  The exact method for
pairing is problematic, and frequently a particular pair makes no sense at
all, for example "L-with-stroke" with "Vulgar fraction 3/4".  Any such pairing
will give an invertible translation, but to achieve the most useful
translation it is necessary to examine all the character sets involved.  To
illustrate, Latin Alphabet 1 lacks the OE digraph character but this character
is found in the DEC Multinational Character Set, the Apple Quickdraw character
set, and the NeXT character set, but at different code points in each.
Ideally, each of these character sets should map OE digraph into the same
Latin-1 code point.

Let's look at a few common translation scenarios.

1. From a 7-bit set to a different 7-bit set, e.g. from ISO 646 Spanish
   version to ASCII (or vice versa).  The two sets do not contain the same
   characters.  Here we must choose between readability and invertibility.  To
   achieve readability in the Spanish-to-ASCII direction, we strip diacritical
   marks (n-tilde becomes simply n).  To achieve invertibility (at least in
   this case), we make no translation at all.

2. From a 7-bit set to an 8-bit set.  The 7-bit sets are usually ASCII or an
   ISO 646 national variant.  Normally, all the characters from the 7-bit set
   are also present in the 8-bit set, and there is no R vs I conflict.
   Otherwise, we must choose between R and I.  Normal example: ASCII (and most
   ISO 646 national variants) to Latin-1 -- here we satisfy both R and I.
   Bizarre example: ISO 646 Swiss national variant to ISO Latin / Arabic --
   here we must choose between R and I.

3. From an 8-bit set to another 8-bit set.  The common case here is converting
   between one of the corporate "extended ASCII" sets (DEC, IBM, Apple, NeXT,
   Data General, Commodore, etc) and ISO Latin-1.  The two sets share a large
   percentage of common characters.  How do we handle the characters that
   differ?  Again, we must choose between R and I.  To complicate this case,
   the IBM, Apple, and NeXT sets use the forbidden (by ISO standards) C1
   control-character area for graphics characters; in this case there must be
   a mapping between graphics and C1 controls.
 
4. From an 8-bit set to a 7-bit set.  For example, from Latin-1 to ASCII or to
   an ISO 646 national set.  Here we are forced to accept a great deal of
   information loss.  We cannot possibly achieve invertibility, so we should
   aim for maximum readability.  The SET LANGUAGE command can be used to help.

5. From a single-byte character set to a multibyte character set.  Most
   multibyte character sets include ASCII and sometimes several other
   alphabets (such as Greek and Cyrillic in JIS X 0208).  Here we translate
   each character into its equivalent, if it has one, and if not we pick some
   unique nonsense value to ensure the translation is invertible (for the
   single-byte set).

6. From a multibyte set to a single-byte set, for example Japanese EUC
   into Latin-1 (or Latin/Cyrillic, Latin/Greek, or even ASCII).  Here we
   lose the vast majority of characters -- there is no hope for a readable
   or even a sensible translation.  The only way to translate Kanji into
   (say) ASCII is to replace ideograms by words, and that is beyond the
   scope of a simple character-set conversion scheme.  Hence, we normally
   replace ideograms by the "this can't be translated" character.

7. From one national multibyte set to another.  These sets are for Chinese,
   Japanese, and Korean, and have at least a large number of ideograms (Han
   characters) in common, and probably also Roman characters.  How to
   translate among them is an item for study by language experts: by shape,
   by meaning, etc.

How do we choose between readability and invertibility?  It depends on what
the user needs at a particular moment.  We (Kermit designers and programmers)
can give the user the ability to make this choice.  Or we can make the choice
for them, knowing full well that whatever our choice, it will be wrong.

To give the user a choice -- at the expense of increased size and complexity
in the program itself and of the user interface -- the following command can
(optionally) be included:

   SET TRANSFER TRANSLATION { INVERTIBLE, READABLE }

The existence of this command requires a dual set of translation tables and/or
functions -- one optimized for invertibility (totally invertible if the two
character sets are the same size), the other for readability.  When a Kermit
program handles many character sets, this can result in a significant increase
in program size.

When this command is not provided, the bias of the translation mechanisms --
readability or invertibility -- must be clearly stated in the user
documentation.  All else being equal, the bias should be towards
invertibility; if an invertible translation is possible (i.e. the two
character sets are of the same size), it should be provided.  This ensures
round-trip consistency PROVIDED the same invertible tables are always used.

It is the programmer's choice whether translation is accomplished by tables or
by functions that implement translation algorithms, or a combination of both.
Functions provide maximum flexibility and tend to reduce program size, at some
cost in execution overhead.  Tables provide greatest speed, but generally with
greater cost in program size.


THE POLITICS OF INVERTIBILITY

If two character sets are the same size and contain the same repertoire of
characters, translation is simply a matter of rearranging code points.  But
when two character sets intended to serve the same language or group of
languages differ on non-alphabetic code points or in other minor ways,
arbitrary decisions must be made in assigning the nonoverlapping characters
from the two sets.  Who makes those decisions?

The classic example is the translation between IBM Code Page 850 (the
"Multilingual Code Page") and ISO Latin-1.  Because IBM assigns graphics
characters to its C1 area, it has 32 more graphics characters than Latin-1.
Most of these are line- and box-drawing characters sprinkled throughout
the code page.  How should these be paired with Latin-1's C1 set?

Such decisions are beyond the scope of the national and international
standards activities, and they should not be made by Kermit designers or
programmers.  These tables (or algorithms) are most appropriately furnished by
the creators of each private character set.  This lends the appropriate
"official" air, and encourages the makers of all software packages that need
such a translation to use the official one so all such applications on a
particular computer can interoperate.

IBM has specified an invertible translation between certain of its code pages
and ISO Latin-1 in its Character Data Representation Architecture (CDRA).

Similarly, Apple should specify the translation between Quickdraw and Latin-1.
Microsoft should specify the translation between CP866 and the Latin/Cyrillic
alphabet.  And so on.  In the absence of such vendor-provided translations,
Kermit programmers are forced to produce their own, but should continue to
press vendors for official versions.

Eventually, the actual contents of each invertible translation table or
algorithm should be specified in a document or set of documents to accompany
this proposal, or references to the relevant corporate standards should be
listed in Appendix G.

Before leaving this topic, let's also remember to encourage designers of
computer operating systems to RECORD THE CHARACTER SET in a text file's
directory entry, so Kermit or any other application program can find out what
it is automatically without requiring the user to identify it manually.  (Of
course, this begs the larger question of recording the file type as well...
item for futher study.)


THE POLITICS OF READABILITY

Similarly, we can ask: Who decides what is readable?  Transliteration of a
language like Greek or Russian into ASCII can be done in many different ways,
depending upon -- among other things -- the language spoken by the person
reading the result.  For example, the surname of a former leader of the USSR,
which, when written in Cyrillic, has only six letters; transliterated into
English, the name is "Khrushchev".  Into German, "Khruschtschew".

There are few, if any, widely recognized standards for transliteration, and
yet it is often desirable.  Newspapers and magazines, library catalogers,
immigrant bureaus, and many other organizations have procedures for
transliterating "foreign" writing systems.  Not just in "ASCII-speaking"
lands, but everywhere: Russian names are written in Arabic newspapers, Hebrew
names in Greek journals, English names on Chinese passports, Korean
publications in Vietnamese library catalogs.

When a translation function is optimized for readability -- and some must be
-- the designer must consider whether to force a particular kind of
readability on the user, or to give the user a choice.  The precise mechanism
for this (if indeed any such mechanism can be precise!) is another topic for
further study: How to best transliterate from Language A in Writing System B
to Language X in Writing System Y?


USER-DEFINED TRANSLATIONS
 
It should be possible for users to override the decisions made by Kermit
programmers regarding the bias of the translation mechanism or its particular
details, as well as to add totally new translations, by introducing their
own translation tables or functions.

Methods for doing this will be described in a future draft of this document.
This is primarily a user-interface design issue.
 
***
 
How to do user-defined translations:
 SET FILE CHARACTER-SET USER-DEFINED
 SET XFER CHARACTER-SET <valid-xfer-charset>
 SET USER-TRANSLATION FROM <tcs> xxx yyy ; for incoming files
 SET USER-TRANSLATION TO   <tcs> yyy xxx ; for outbound files
 Applies to <tcs> + USER-DEFINED FCS.
 Can have one pair for each TCS.
 Now announcements work right, etc etc.
 DUMP USER-TRANSLATION <tcs>  [ <file> ]  ; list tables (to file)
For C-Kermit, we have 4 TCS's, so 8 x 2 x 256 = 2K bytes, not bad:
 . Add FC_USER FCS
 . Add 6 tables (not supported for Kanji)
 . Initialize each table to identity function
 . Add 8 functions (even for TRANSPARENT, but NULL for Kanji)
 . Figure out a way of telling user whether table has been defined.
 . Add SHOW USER-TRANSLATION <tcs> [ <file> ]
   (= a bunch of SET USER-TRANSL commands, with comments)
 . Add DUMP USER-TRANSLATION <tcs> [ <file> ]
   (= just the numbers, comma-separated? space-separated? one per line?)
 . Add LOAD USER-TRANSLATION <tcs> <file>
   (= read table written by DUMP, watch out for value and table-size overflow) 
 . Add some kind of built-in test pattern?
***

ATTRIBUTE PACKETS
 
We want to accommodate as many computers as possible with a minimum of
programming effort, but this approach places a burden on the user in the
form of new commands and the confusion that results if the user forgets to
issue these commands.
 
This protocol extension does not require support for Kermit File Attribute
Packets, whose use is negotiated in the Kermit Initialization exchange, but
their use is recommended; the user's burden can be alleviated if the sending
Kermit program uses an attribute field to inform the receiving Kermit of the
character set used in the file data packets.  The receiving program can accept
or refuse the file based on whether it supports the specified character set.
If the receiving program refuses a file, the user can override this refusal,
for example, if a long file contains only a word or two in an unknown
character set.  The most common user-override is the command SET ATTRIBUTES
OFF.  However, this also disables other desirable effects of attribute
packets, such as prenotification of file size.  Therefore, it is desirable to
let the user specify exactly which attributes are to be "turned off", e.g.,
SET ATTRIBUTES ENCODING OFF.
 
When the transfer character set is ASCII (or TRANSPARENT when sent from an
ASCII-based system), the Encoding attribute should have the traditional value
of "A" (for ASCII): "*!A".
 
In order for the sender to inform the receiver of transfer character sets
other than ASCII, a new value for the Encoding attribute ("*") is defined,
namely "C", which is substituted for the normal value "A" (ASCII).  "C" means
that the actual character set is specified as an operand which begins with a
single letter that designates the character set registration authority, e.g.
I for ISO, followed by a registration-authority-specific identifier, as in:
 
  Ixxx/yyy
 
where the letter "I" (for ISO) is followed by a pair of ISO registration
numbers for the character set, xxx for the "left half" and yyy for the right,
expressed in decimal ASCII digits, for example:
 
  +---+---+---+--------+
  | * | ' | C | I6/100 |
  +---+---+---+--------+
 
where "*" is code for the Encoding Attribute (or transfer syntax), "'" is the
length of its value.  In this case, "CI6/100" is 7 characters long, and
"'" is the printable encoding for 7 (7 + 32 = 39, the ASCII code for "'").
The character "C" means "I'm using the specified Character set", and
"I6/100" specifies the character set: "ISO registration number 6", i.e. US
ASCII, in the left half, and ISO registration number 100, which is the right
half of Latin-1, in the right.  The "I" stands for ISO, and is included to
allow for the possibility of other character set registration authorities.
Designators for each character set are given in Table 2, labeled "Kermit
Designator".
 
Japanese EUC is a special case, because it is a mixture of single-byte JIS X
0201 (two character sets) and double-byte JIS X 0208.  Its Kermit designator
is I14/87/13: Japanese Roman in G0, Japanese Kanji in G1, Japanese Katakana
in G2 (Katakana characters are indicated by SS2 in the data -- the SS2 is
considered part of the file). 
 
In the event that a character set standard changes, but keeps the same
registration number, the registration number for the new character set should
be preceded by a non-numeric character which indicates the revision number: @
(atsign) = 1, A=2, B=3, and so on (as suggested in ISO 2022).  For example
"I@2/B100" would indicate an 8-bit single-byte character set having Revision
1 of ASCII as its left half and Revision 3 of Latin-1 as its right.  Note:
"Revision 1" does not mean the original version, but rather the first
revision AFTER the original version.  The Kermit designator for an original
version does not have a revision indicator.
 
The form of the character-set designator was chosen because the standards
currently provide no single code to designate an 8-bit character set in its
entirety.  Each half of the character set has its own registration number.
For example, ISO 8859-1 (Latin-1) is a single 8-bit character set, but
registration number 100 only refers to its right half.  Registration number 6
denotes ASCII, which is used as the left half of all ISO 8859 character sets.
 
To promote maximum interoperability among extended Kermit programs, the
Kermit designator should be treated as a character string, to be looked up in
a small table, rather than as a flexible mechanism to be used for piecing
together character sets from an arbitrary assortment of left and right
halves.  However, the Ixxx/yyy notation leaves open this possibility should
it become desirable at a later time.
 
In the event that a new class of registration numbers appears, for example, to
denote a single-byte 8-bit character set in its entirety rather than just its
left or right half, a different initial letter will be used in the designator,
even if the registration authority is the ISO.  In the event that other
character-set registration authorities appear, they too can be assigned their
own unique Kermit designator prefixes (for example, "K" for Kermit Development
and Distribution), to avoid ambiguity from conflict of registration numbers.
For the present, standards organizations like ANSI and CCITT are not treated
as separate registration authorities, because their character sets are also
registered by the ISO.  Should these organizations adopt character sets that
have no ISO counterpart, then special Kermit designator prefixes will be
assigned for them.
 
Based on the attribute information, the receiver may accept or reject the
file, using Kermit's normal attribute response mechanism.  To accept, it puts
a "Y" as the first character of the data field of the acknowledgement to the
attribute packet.  To refuse, it puts an "N" instead of a "Y", followed by
"*".  If the file is refused in this manner, the sending Kermit should respond
by sending a "Z" (end-of-file) packet containing a "D" (for Discard) in its
data field.
 
The behavior of the receiving Kermit program when an unknown character set
is announced to it is governed by the command SET UNKNOWN-CHARACTER-SET.
SET UNKNOWN-CHARACTER-SET KEEP means that it should not reject the file, but
store it the best way it can (e.g., without translating any characters),
DISCARD means that the file should be rejected.
 
 
AUTOMATIC SELECTION OF FILE CHARACTER-SET BY THE FILE RECEIVER
 
When a file arrives whose transfer character-set is announced in the attribute
packet, it is desirable to include a mechanism to allow the receiving Kermit
program to select the most appropriate file character-set automatically.
Similarly, if the user gives a SET FILE CHARACTER-SET command, it would be
desirable to switch to an appropriate TRANSFER CHARACTER-SET automatically,
and vice-versa.  Any such mechanism should also include a "manual override"
to let the user disable it.

Suppose, for example, an MS-DOS Kermit program that is about to receive a file
has CP437 as its FILE CHARACTER-SET, but the arriving file is announced as
CYRILLIC.  The receiving Kermit can (a) translate the Cyrillic characters into
ASCII using a transliteration scheme (like "Short KOI" phonetic
transcription), or (b) switch its file character set to one that contains the
greatest number of characters that are also in the transfer character set, in
this case CP866.

We can design Kermit programs to supply translations between every possible
combination of file and transfer character set.  Or we can allow only certain
combinations, for example Roman-to-Roman, Cyrillic-to-Cyrillic,
Hebrew-to-Hebrew, etc.

In the former case, it is the user's responsibility to choose the most useful
combination.  In the latter, the receiving Kermit must either reject the file
when the file character set is not valid for the incoming transfer character
set (or accept it without translation, depending on the setting of
UNKNOWN-CHARACTER-SET), or else switch to an appropriate file character set
automatically.

An optional automatic switching mechanism, configurable by the user, can be
provided by the following command:

SET SEND AUTOMATIC-TRANSLATION { OFF, ON, <FCS> [ <TCS> ] }
    Automatic translation action when sending files.
    OFF means don't automatically switch translations.
    ON means enable automatic translation.
    <FCS> <TCS> means: If the current file character set is <FCS>, then use
    <TCS> as the transfer character set.  If <TCS> is omitted, automatic
    selection of a transfer character set for <FCS> is not done, and the
    current transfer character set is used.  In either case, any previous
    entry for <FCS> is superseded.

SET RECEIVE AUTOMATIC-TRANSLATION { OFF, ON, <TCS> [ <FCS> ] }
    Automatic translation action when receiving files.
    OFF means don't automatically switch translations.
    ON means enable automatic translation.
    <TCS> <FCS> means: if the announced transfer character set of the incoming
    file is <TCS>, then use <FCS> as the file character set.  If <FCS> is
    omitted, automatic selection of a file character set for <TCS> is not
    done, and the current file character set is used.  In either case, any
    previous entry for <TCS> is superseded.

Many of these commands can be executed.  Their effect is to build a pair of
lookup tables.  When AUTOMATIC-TRANSLATION is OFF, or the character set is not
found in these tables, the prevailing settings are used.  ON can be used to
enable any tables that had been previously disabled by OFF.

The programmer may preload the Kermit program with a default set of tables.
However, the default AUTOMATIC-TRANSLATION setting in both directions should
be OFF.


INTEROPERABILITY WITH UNEXTENDED KERMIT PROGRAMS
 
Extended Kermit programs must be fully interoperable with unextended ones.
When the file sender is extended and the receiver is not, the receiver ignores
the encoding attribute and stores the file data as received, but after
applying any required record-format conversions.  In case the sender's
encoding attribute causes problems for the receiver, the sending Kermit should
have an option to omit this attribute: SET ATTRIBUTE ENCODING OFF (or as a
last resort, SET ATTRIBUTES OFF altogether).  The sender has the option of
translating from a local file character set to any desired transfer character
set, including ASCII, that will be useful on the receiving computer.
 
When the file receiver is extended and the sender is not, the receiver has the
option of translating the received characters to a local file character set.
This will be useful if the character set used in the packets corresponds with
one of the receiver's transfer character sets, and it requires the user to
manually inform the receiving Kermit of both the transfer and the file
character sets.
 
In other cases, the extended Kermit's TRANSLATE command can be used to
pre- or postprocess a file to achieve the desired results if the desired
translations are available.

 
PERFORMANCE
 
There is nothing in this proposal that affects the performance of the Kermit
file transfer protocol.  The efficiency of file transfer is the same with
or without this extension.

However, it is recognized that transfer of 8-bit text will not always be
efficient.  Since the special characters have their 8th bits set to one, there
will be a lot of 8th-bit prefixing in the 7-bit environment -- the higher the
proportion of special characters to ASCII characters, the lower the
efficiency.  For "left-handed" languages like Italian, Norwegian, and
Portuguese (in which the preponderance of text characters are ASCII), the
impact is negligible.  For "right-handed" languages like Russian, Greek,
Hebrew, and Arabic, where characters come from the right half of the character
set, efficiency will be poor in the 7-bit environment.  The situation is even
worse for Japanese EUC, in which all Kanji bytes have their 8th bit set to 1.
 
For this reason, it is recommended that Kermit programs that implement
transfer character sets for non-Roman-based writing systems also include
Kermit's locking shift protocol, which is specified and analyzed in a separate
document.
 
 
TERMINAL EMULATION
 
While not part of the Kermit file transfer protocol, terminal emulation is an
essential feature of many Kermit programs.  It is hoped that all of Kermit's
terminal emulators will evolve along the lines of the ISO standards described
in the Appendices.  In some cases, this is already a fact, insofar as DEC
VT200 and 300 series terminals already follow these standards and Kermit
programs are available that emulate these terminals.
 
The following Kermit commands are recommended for terminal emulation:
 
SET TERMINAL TYPE <name>
  Identify the type of terminal to be emulated, for example VT320.
 
SET TERMINAL BYTESIZE <number>
  Tell how many bits of each arriving character are to be displayed on the
  screen.  This command is used to protect the user from parity bits sent by
  the host during terminal emulation, even when PARITY is set to NONE, so the
  normal setting is 7.  SET TERMINAL BYTESIZE 8 allows reception of 8-bit
  bytes.
 
SET TERMINAL CHARACTER SET <remote-character-set> [ <local-character-set> ]
  Tell how to translate characters during terminal emulation.  The
  <remote-character-set> denotes the codes sent by, and expected by, the
  remote host.  The <local-character-set>, if given, specifies the character
  codes generated by the local keyboard and displayed on the local screen.  If
  the <local-character-set> is not specified, the current FILE CHARACTER-SET
  is assumed.  Since it is likely that neither one of the two character sets
  is a standard (TRANSFER) character set, the terminal emulator cannot always
  use Kermit's built-in file translation tables or functions directly.
  However, it is often possible to use them in a two-step process, using one
  of Kermit's transfer character sets as an intermediary.
 
SET TERMINAL TRANSLATION { INVERTIBLE, READABLE }
  Specifies the desired style of character translation to use during
  terminal emulation.

SET LANGUAGE
  Should not apply to terminal emulation -- characters should not be added
  or deleted during translation, because that would interfere with the
  formatting of the screen.

SET TERMINAL DIRECTION { LEFT-TO-RIGHT, RIGHT-TO-LEFT }
  Specifies the direction of screen writing during terminal emulation.
  RIGHT-TO-LEFT can be used for Hebrew and Arabic.
 
SET TERMINAL LOCKING-SHIFT { ON, OFF }
  Specifies whether the terminal emulator should use locking shifts
  (normally SO/SI) when sending and receiving 8-bit data in the 7-bit
  communications environment.  This behavior is built in to certain
  terminal emulators (such as VT220, VT320); this command is for use
  with terminal emulators that do not have this capability built in.
 
SET TRANSLATION INPUT \aaa \bbb
  or
SET TERMINAL TRANSLATION \aaa \bbb
  Specify that when the character \aaa is received from the communication
  medium, it should be translated to \bbb before display on the screen.
  Many such commands can be given, allowing the user to form a custom-made
  terminal character set.

SET KEY <code> <value> Specify that when the key whose code is <code> is
  pressed, the Kermit program sends the specified <value>.  Many such commands
  can be given, allowing the user to customize the keyboard for any desired
  character set.  The <value> can be a single character or a string of
  characters.

Terminal character-set translation should be used in screen capture (session
logging), non-transparent screen-print operations, and "raw uploading" of
text files (TRANSMIT command, when FILE TYPE is TEXT).  Character-set
translation should NOT be used in scripting commands such as INPUT and OUTPUT.
 
 
APPENDIX A: STANDARDS
 
ANSI X3.4-1986, "Coded Character Sets - 7-bit American Standard Code for
  Information Interchange" (US ASCII), is the 7-bit code currently used by
  Kermit for transferring text files.
 
ISO 646 (1983) (= ECMA-6), "Information Processing - ISO 7-bit Coded Character
  Sets for Information Interchange", gives us a 7-bit character set equivalent
  to ASCII with provision for substituting "national characters" in selected
  positions.
 
ISO 4873 (1986), "Information Processing - ISO 8-bit Code for Information
  Interchange - Structure and Rules for Implementation", defines 8-bit
  character sets, their graphic and control regions, and how to extend an
  8-bit character set by using multiple intermediate graphics sets.
 
ANSI X3.134.1 (1991), "8-Bit ASCII - Structure and Rules", the USA equivalent
  of ISO 4873.

ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit
  Coded Character Sets - Code Extension Techniques", describes how to use
  8-bit character sets in both 7-bit and 8-bit environments, and how to switch
  among different character sets.
 
ISO International Register of Coded Character Sets to be Used with Escape
  Sequences.  This is the source of the ISO registration numbers.
 
ISO 2375 (1985) "Data Processing - Procedure for Registration of Escape
  Sequences".  The procedure by which a character set gets into the above
  register and has a registration number and designating escape sequence
  assigned to it.
 
JIS X 0202, "Code Extension Techniques for Use the Code for Information
  Interchange", the Japanese counterpart of ISO 2022.
 
ISO 6429-1983, "C1 Control Character Set".

ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded
  Character Set of the American National Standard Code for Information
  Interchange", describes 7- and 8-bit codes and extension techniques in
  approximately the same manner as ISO 4873 and ISO 2022.  (Now obsolete?)
 
ISO 8859 (1987-present) (see Table 6 for ECMA equivalents), "Information
  Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the
  actual 8-bit character sets to be used for many of the world's languages.
  The left half of each of these is the same as ASCII and ISO 646 IRV.  Each
  character, including those with diacritics, is represented by a single byte.
 
ANSI X3.134.2 (1991), "7-Bit and 8-Bit ASCII Supplemental Multilingual
  Graphic Character Set", the USA equivalent of ISO 8859-1.

JIS X 0201, Japanese Roman / Katakana set (need full reference).

JIS X 0208, Japanese Kanji set (need full reference).

JIS X 0212, Japanese Kanji set (superset of JIS X 0208, reportedly not in
  use yet, need full reference).

ISO is the International Standardization Organization.  ANSI is the American
National Standards Institute.  ECMA is the European Computer Manufacturers
Association.  JIS means Japan Industrial Standard.
 
The ISO/ECMA standards discussed in this proposal may be obtained free of
charge in their ECMA form by writing to:
 
  ECMA Headquarters
  Rue du Rhone 114
  CH-1204 Geneva
  SWITZERLAND
 
Be sure to specify the title and the ECMA number of each standard requested.
 
In general, the ISO member body from each country acts as the local sales
agent for ISO Standards in that country, for example ANSI in the USA:
 
  Sales Department
  American National Standards Institute
  1430 Broadway
  New York, NY  10018
  Telephone 212-354-3300
 
Each such organization has its own arrangements for disseminating printed
documents.  ANSI sells them for US dollars; organizations in other countries
may either sell them for local currency or give them away, depending on how
they are funded to operate.
 
ISO standards and CCITT recommendations can also be ordered from the UN
bookstore, but not free of charge:
 
  United Nations Bookstore
  United Nations Building
  New York, NY  10017
 
CCITT recommendations are also available by mail order from ANSI.
 
CCITT recommendations are also available via anonymous FTP on the Internet
from host BRUNO.CS.COLORADO.EDU or DIGITAL.RESOURCE.ORG in the directory
/pub/standards/ccitt/.
 
 
APPENDIX B: HOW THE STANDARDS WORK
 
ASCII and ISO 646 give us a 128-character 7-bit character set.  This set is
divided into two parts:
 
  1. 33 "control characters" (characters 0 through 31, and character 127).
  2. 95 "graphic characters" (32-126).
 
Graphic characters make ink appear on the page or phosphor glow on the screen.
Control characters are used as fillers or format effectors and for
transmission or device control.  The ASCII / ISO-646 IRV character set is
shown in Figure 1, arranged in a table of 16 rows and 8 columns.  The graphic
characters are shown literally (except SP stands for the space character), the
control characters by name (control character names and functions are defined
in ISO 646).
 
_____________________________________________________________________________
 
      00  01  02  03  04  05  06  07
     +---+---+---+---+---+---+---+---+
  00 |NUL DLE| SP  0   @   P   `   p |
  01 |SOH DC1| !   1   A   Q   a   q |
  02 |STX DC2| "   2   B   R   b   r |
  03 |ETX DC3| #   3   C   S   c   s |
  04 |EOT DC4| $   4   D   T   d   t |
  05 |ENQ NAK| %   5   E   U   e   u |
  06 |ACK SYN| &   6   F   V   f   v |
  07 |BEL ETB| '   7   G   W   g   w |
  08 |BS  CAN| (   8   H   X   h   x |
  09 |HT  EM | )   9   I   Y   i   y |
  10 |LF  SUB| *   :   J   Z   j   z |
  11 |VT  ESC| +   ;   K   [   k   { |
  12 |FF  FS | ,   <   L   \   l   | |
  13 |CR  GS | -   =   M   ]   m   } |
  14 |SO  RS | .   >   N   ^   n   ~ |
  15 |SI  US | /   ?   O   _   o  DEL|
     +---+---+---+---+---+---+---+---+
 
  Figure 1: The ASCII / ISO-646 International
     Reference Version 7-bit Character Set
_____________________________________________________________________________
 
Characters are often referred to by their column and row position in this type
of table.  For example, character 05/08 in Figure 1 is "X".  Columns 00-01,
plus character 07/15, comprise the control set.  Columns 02-07, minus
character 07/15, comprise the graphics.
 
ISO Standard 646 allows for national variant 7-bit character sets in which
certain non-alphanumeric ASCII graphic characters are replaced by "national
characters".  The character positions in which replacements are permitted,
along with the replacements used by four of the ten ISO 646 national variants,
are shown in Table B-1.
 
_____________________________________________________________________________
 
Column/Row   ASCII         German        Finnish      Norwegian    French
 
  04/00      at-sign       section       at-sign      at-sign      a-grave
  05/11      left-bracket  A-diaeresis   A-diaeresis  AE-digraph   degree
  05/12      backslash     O-diaeresis   O-diaeresis  O-slash      c-cedilla
  05/13      right-bracket U-diaeresis   A-circle     A-circle     section
  05/14      circumflex    circumflex    U-diaeresis  circumflex   circumflex
  06/00      accent-grave  accent-grave  e-acute      accent-grave accent-grave
  07/11      left-brace    a-diaeresis   a-diaeresis  ae-digraph   e-acute
  07/12      vertical-bar  o-diaeresis   o-diaeresis  o-circle     u-grave
  07/13      right-brace   u-diaeresis   a-circle     a-circle     e-grave
  07/14      tilde         ess-zet       u-diaeresis  tilde        diaeresis
 
     Table B-1: Selected ISO 646 National Variants, Differences from ASCII
_____________________________________________________________________________
 
The ISO-registered 7-bit national sets are listed in Table B-2.
_____________________________________________________________________________

                                       ISO
Description                            Reg.#

  International Reference Version        2
  British Version, BSI 4730              4
  USA Version, ANSI X3.4-1986            6
  Swedish Version, SEN 850200/B         10
  Japanese Version, Roman Chars         14
  Italian Version                       15
  Spanish Version                       17
  German Version                        21
  Norwegian Version, NS 4551            60
  French Version, NF Z 62010            69
  Portuguese Version                    84
  Hungarian Version, HS 7795/3          86 
  Cuba National Standard NC 99-10:81   151
  Finnish (DEC Private)                 --
  French Canadian (DEC Private)         --
  Swiss (DEC Private)                   --

     Table B-2:  National 7-Bit Character Sets
_____________________________________________________________________________


8-bit character sets are described in ISO 4873 and related standards (see
Appendix A).  An 8-bit character set has two sides.  Each side has a control
set and a graphics set.  The "left half" consists of the control set C0 and
the graphics set GL (Graphics Left).  GL has 94 characters, and corresponds to
ASCII (and ISO 646 IRV) positions 02/01-07/14.  SP (space) and DEL are
special: they are pieces of the template (the upper right and lower left
corners, respectively) into which any 94-byte graphic character set must fit.

All the characters in the left half (C0, GL, SP, and DEL) have their
high-order, or 8th, bit set to zero, and are therefore representable in 7
bits.  The "right half" consists of the control set C1 and the graphics set GR
(Graphics Right).  All characters in the right half have their 8th bits set to
one.  Figure 2 shows the layout of an 8-bit character set, with C1 occupied
by the ISO 6429 control character set.
 
_____________________________________________________________________________
 
     <--C0--> <---------GL---------->  <--C1--> <---------GR---------->
       00  01  02  03  04  05  06  07    08  09  10  11  12  13  14  15
     +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
  00 |NUL DLE| SP  0   @   P   `   p | |    DCS|---+                   |
  01 |SOH DC1| !   1   A   Q   a   q | |    PU1|                       |
  02 |STX DC2| "   2   B   R   b   r | |    PU2|                       |
  03 |ETX DC3| #   3   C   S   c   s | |    STS|                       |
  04 |EOT DC4| $   4   D   T   d   t | |IND CCH|                       |
  05 |ENQ NAK| %   5   E   U   e   u | |NEL MW |                       |
  06 |ACK SYN| &   6   F   V   f   v | |SSA SPA|                       |
  07 |BEL ETB| '   7   G   W   g   w | |ESA EPA|                       |
  08 |BS  CAN| (   8   H   X   h   x | |HTS    |      (special         |
  09 |HT  EM | )   9   I   Y   i   y | |HTJ    |       graphics)       |
  10 |LF  SUB| *   :   J   Z   j   z | |VTS    |                       |
  11 |VT  ESC| +   ;   K   [   k   { | |PLD CSI|                       |
  12 |LF  FS | ,   <   L   \   l   | | |PLU ST |                       |
  13 |CR  GS | -   =   M   ]   m   } | |RI  OSC|                       |
  14 |SO  RS | .   >   N   ^   n   ~ | |SS2 PM |                       |
  15 |SI  US | /   ?   O   _   o  DEL| |SS3 APC|                   +---|
     +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
     <--C0--> <---------GL---------->  <--C1--> <---------GR---------->
 
                      Figure 2: An 8-Bit Character Set
_____________________________________________________________________________
 
GR character sets can have either 94 or 96 characters.  A 94-character GR set
begins in position 10/01 and ends in position 15/14, with Space (SP) occupying
position 10/00 and DEL in position 15/15, just like GL (the corners shown in
GR in the diagram).  A 96-character set has graphic characters in all 96
positions, 10/00 through 15/15.
 
An 8-bit alphabet, therefore, has up to 94 + 96 = 190 graphic characters.
This number is sufficient to represent the characters in many of the world's
written languages, but not necessarily sufficient to represent all the graphic
symbols required in a given application, for instance a multi-language
document.
 
To represent a greater number of graphic characters, ISO 4873 defines four
"intermediate sets" of graphic characters, of either 94 or 96 characters each.
These are called G0, G1, G2, and G3.  The G0 set never has more than 94
graphic characters, and G1-G3 can have up to 96 each.  Therefore there can be
up to:
 
  94 + (3 x 96) = 382
 
graphics characters simultaneously within the repertoire of a given device,
assuming all are single-byte sets.
 
These intermediate graphics sets are kept in tables in the memory of the
terminal or computer.  One of the intermediate sets (usually G0) is assigned
to GL, and (in the 8-bit communications environment) another may be assigned
to GR.  When the terminal or computer receives a data byte, the numeric value
of its bits denotes the position of the character in GL or GR.  For example,
the byte 01000001 binary = 65 decimal = 04/01 = uppercase A in ASCII.  In the
8-bit environment, any byte with its 8th bit set to zero is from GL, and a
byte with its 8th bit set to one is from GR.
 
A language like English can be represented adequately by ASCII in GL, because
all the required characters fit there.  When a language has more than 94
characters, two techniques are used to represent all the characters:
 
  1. For alphabetic languages, put ASCII (or the ISO-646 IRV) in GL and
     the special characters (like accented letters) in GR.  French, German,
     and Russian are examples.
 
  2. For languages with many symbols (e.g. where a symbol is assigned
     to each word, rather than to each sound), represent each character
     with multiple bytes rather than one byte.  Japanese Kanji, for example,
     uses a 2-byte code.  A multibyte code may be assigned to G0, G1, G2, or
     G3, just like a single-byte code.
 
How do we assign actual character sets to G0-G3, and how do we associate the
intermediate character sets with the active character set?
 
Selection of character sets is accomplished using special control characters
and escape sequences embedded within the data stream as described in ISO
Standard 2022.  An ESCAPE SEQUENCE is used to DESIGNATE a particular alphabet
(such as Roman, Cyrillic, Hebrew, Arabic, Kanji, etc) to a particular
intermediate graphics set (G0, G1, G2, or G3).  A SHIFT FUNCTION is used to
INVOKE a particular intermediate graphics set into GL or GR.  In programmer's
terms, GL and GR are pointers into the array of tables G0..G3, and the shift
functions simply change the values of these pointers.
 
In our discussion, we use the following notation (numbers are decimal unless
otherwise noted):
 
  <ESC> Escape (ASCII 27, character 01/11)
  <SP>  Space  (ASCII 32, character 02/00)
  <SO>  Shift Out (Ctrl-N, ASCII 14, character 00/14)
  <SI>  Shift In  (Ctrl-O, ASCII 15, character 00/15)
 
Table 5 shows the alphabet designation functions for single-byte and
multi-byte character sets in both the 7-bit and 8-bit environments.  The
character which is substituted for "F" identifies the actual character set to
be used.
 
_____________________________________________________________________________
 
  Escape
 Sequence     Function                                         Invoked By
 
  <ESC>(F     assigns 94-character graphics set "F" to G0.     SI or LS0
  <ESC>)F     assigns 94-character graphics set "F" to G1.     SO or LS1
  <ESC>*F     assigns 94-character graphics set "F" to G2.     SS2 or LS2
  <ESC>+F     assigns 94-character graphics set "F" to G3.     SS3 or LS3
  <ESC>-F     assigns 96-character graphics set "F" to G1.     SO or LS1
  <ESC>.F     assigns 96-character graphics set "F" to G2.     SS2 or LS2
  <ESC>/F     assigns 96-character graphics set "F" to G3.     SS3 or LS3
  <ESC>$(F    assigns multibyte character set "F" to G0.       SI or LS0
  <ESC>$)F    assigns multibyte character set "F" to G1.       SO or LS1
  <ESC>$*F    assigns multibyte character set "F" to G2.       SS2 or LS2
  <ESC>$+F    assigns multibyte character set "F" to G3.       SS3 or LS3
 
             Table 5: Escape Sequences for Alphabet Designation
_____________________________________________________________________________
 

Table 6 shows the escape sequences used to designate the appropriate parts of
each of the registered character sets discussed in this proposal to G1 (except
that ASCII is designated to G0, which is the normal situation).  It is
important to note that the final letter of the escape sequence is not always
sufficient to designate a character set.  For example, Czech Standard and JIS
Katakana are both designated by letter I, but the two can be distinguished by
the intermediate characters of the escape sequence, which specify whether the
set is single- or multibyte, or, when both sets are single-byte, whether there
are 94 or 96 characters.
 
_____________________________________________________________________________
 
                            Escape    ISO          ECMA        ISO/ECMA
 Alphabet Name              Sequence  Reference    Reference   Registration
 
  ASCII (ANSI X3.4-1986)    <ESC>(B   ISO 646 IRV  ECMA-6        6
  Latin Alphabet No. 1      <ESC>-A   ISO 8859-1   ECMA-94     100
  Latin Alphabet No. 2      <ESC>-B   ISO 8859-2   ECMA-94     101
  Latin Alphabet No. 3      <ESC>-C   ISO 8859-3   ECMA-94     109
  Latin Alphabet No. 4      <ESC>-D   ISO 8859-4   ECMA-94     110
  Latin/Cyrillic            <ESC>-L   ISO 8859-5   ECMA-113    144
  Latin/Arabic              <ESC>-G   ISO 8859-6   ECMA-114    127
  Latin/Greek               <ESC>-F   ISO 8859-7   ECMA-118    126
  Latin/Hebrew              <ESC>-H   ISO 8859-8   ECMA-121    138
  Latin Alphabet No. 5      <ESC>-M   ISO 8859-9   ECMA-128    148
* Math/Technical Set        <ESC>-K   ????         ????        143
  Chinese (CAS GB 2312-80)  <ESC>$)A  none         none         58
  Japanese (JIS X 0208)     <ESC>$)B  none         none         87
  JIS-Katakana (JIS X 0201) <ESC>)I   none         none         13
  JIS-Roman (JIS X 0201)    <ESC>)J   none         none         14
  Korean (KS C 5601-1989)   <ESC>$)C  none         none        149
 
   Table 6: Alphabets, Selectors, Standards, and Registration Numbers
_____________________________________________________________________________
 
* A math/technical set is clearly needed to handle the IBM PC, DEC VT-series,
  and other math/technical/line-drawing characters, but there is apparently
  no such standard set at this time (ISO 6862? ISO DIS 10367?)

Tables 7 and 8 show the shift functions that are used to invoke the
intermediate character sets.  These shift functions may be either locking or
single.  "Locking shift" is like shift-lock on a typewriter.  It means that
all subsequent characters until the next shift are to be taken from the
designated intermediate character set.  "Single shift" applies only to the
character (either single or multibyte) that follows it immediately, but single
shift functions are only available for the G2 and G3 sets.  Locking shift
functions remain in effect across alphabet changes.
 
In the 7-bit environment, only one character set, GL, can be active at a time.
The active character set can be selected from among the intermediate sets
G0-G3 by the shifts shown in Table 6.  Control characters from C0 are
transmitted as-is, and those from the C1 set are sent prefixed by <ESC>
followed by the character value, minus 64.  For example, the C1 character
10000001 binary (129 decimal) becomes <ESC>A (129 - 64 = 65 = "A").
 
_____________________________________________________________________________
 
 Shift  Representation  Name              Function
 
  SI       Ctrl-O       Shift In          invoke G0 into GL
  SO       Ctrl-N       Shift Out         invoke G1 into GL
  LS2      <ESC>n       Locking Shift 2   invoke G2 into GL
  LS3      <ESC>o       Locking Shift 3   invoke G3 into GL
  SS2      <ESC>N       Single Shift 2    select single character from G2
  SS3      <ESC>O       Single Shift 3    select single character from G3
 
               Table 7: Shifts Used in the 7-Bit Environment
_____________________________________________________________________________
 
ISO 2022 also allows for an alternative C0 set in which the SS2 function is
assigned to the 7-bit control character EM (Control-Y, 01/09).  This set must
be designated by ESC 2/1 4/12 ("The C0 set of control characters of ISO 646
with EM replaced by SS2", ISO Registration number 140).  This set is not in
common use.

In the 8-bit environment two character sets, GL and GR, can be active at once.
A GL character is selected by a byte whose 8th bit is zero, and a GR character
by a byte whose eighth bit is one.  The actual character sets assigned to GL
and GR are selected by the shifts shown in Table 8.  Control characters from
both the C0 and C1 sets are sent as is.
 
_____________________________________________________________________________
 
 Shift  Representation  Name                   Function
 
  LS0      Ctrl-O       Locking Shift 0        invoke G0 into GL
  LS1      Ctrl-N       Locking Shift 1        invoke G1 into GL
  LS2      <ESC>n       Locking Shift 2        invoke G2 into GL
  LS3      <ESC>o       Locking Shift 3        invoke G3 into GL
  LS1R     <ESC>~       Locking Shift 1 Right  invoke G1 into GR
  LS2R     <ESC>}       Locking Shift 2 Right  invoke G2 into GR
  LS3R     <ESC>|       Locking Shift 3 Right  invoke G3 into GR
  SS2       08/14       Single Shift 2         select single character from G2
  SS3       08/15       Single Shift 3         select single character from G3
 
             Table 8: Shifts Used in the 8-Bit Environment
_____________________________________________________________________________
 
So we have a 3-tiered system.  At the bottom tier lie all the world's coded
character sets.  We can designate up to four of them, one to each of the
intermediate graphics sets G0, G1, G2, and G3 using the escape sequences shown
in Tables 5 and 6.  The terminal or computer keeps each of the selected
intermediate sets in memory.  There is also one active set, composed of GL and
GR.  The intermediate sets are invoked to GL or GR (one at a time) by the
shifts SO, SI, LS0, LS1, etc, shown in Tables 7 and 8.  A simplified diagram
for the 8-bit environment is shown in Figure 3 (see ISO 2022 for detailed
diagrams of both the 7-bit and 8-bit environments).  On a more sophisticated
output device, Figure 3 would contain numerous arrows pointing upwards to
demonstrate the operation of the designators and shifts.
 
_____________________________________________________________________________
 
                   +--+--------+  +--+--------+
                   |C0|   GL   |  |C1|   GR   |
                   |  |        |  |  |        |                  8-Bit
                   |  |        |  |  |        |                  Code
                   |  |        |  |  |        |                  In Use
                   +--+--------+  +--+--------+
 
 
         LS0          LS1,LS1R      LS2,LS2R      LS3,LS3R       Shifts
                                      SS2           SS3
       +--------+    +--------+    +--------+    +--------+      Intermediate
       |        |    |        |    |        |    |        |      Graphics
       |   G0   |    |   G1   |    |   G2   |    |   G3   |      Sets
       |        |    |        |    |        |    |        |
       +--------+    +--------+    +--------+    +--------+
                                                                 Alphabet
                                                                 Designation
 <ESC>(B      <ESC>-A      <ESC>-B      <ESC>-L      <ESC>$)B    Sequences
                                                    +---------+
+--------+   +--------+   +--------+   +--------+  +--------+ |  The world's
| ISO    |   | ISO    |   |  ISO   |   |  ISO   |  | JIS X  | |  registered
| 646IRV |   | Latin  |   |  Latin |   |  Latin |  | 0208   | |  character
|(ASCII) |   | 1      |   |  2     |   |Cyrillic|  | Kanji  | +  sets
+--------+   +--------+   +--------+   +--------+  +--------+
 
          Figure 3: The ISO 2022 Character Set Selection Mechanisms
_____________________________________________________________________________
 
For example, the following sequence could be used to transmit the German word
"<u-diaeresis>bern<a-diaeresis>chtig" using Latin Alphabet 1 in the 7-bit
environment:
 
  <ESC>(B<ESC>-A<SO>|<SI>bern<SO>d<SI>chtig
 
where:
 
  <ESC>(B   designates ASCII to G0
  <ESC>-A   designates the right half of Latin Alphabet 1 to G1
  <SO>      invokes G1 to GL
  |         is character 07/12, but since G1 is invoked to GL, it really
              denotes character 15/12, which is <u-diaeresis>
  <SI>      invokes G0 to GL
  bern      are characters from G0, which is invoked in GL
  <SO>      invokes G1 to GL
  d         is character 06/04, but since G1 is invoked to GL, it really
              denotes character 14/04, which is <a-diaeresis>
  <SI>      invokes G0 to GL
  chtig     are characters from G0
 
The same word could be transmitted in the 7-bit environment using single
shifts, if Latin Alphabet 1 were designated to G2 (or G3):
 
  <ESC>(B<ESC>*A<ESC>N|bern<ESC>Ndchtig
 
(where <ESC>*A designates Latin-1 to G2, and <ESC>N is Single Shift 2).
 
In the 8-bit environment it could be transmitted using no shifts at all:
 
  <ESC>(B<ESC>-A<u-diaeresis>bern<a-diaeresis>chtig
 
The designation escape sequences are transmitted only at the beginning of a
session and need not be repeated after the initial designations are made,
unless an intermediate set (G0-G3) is to be recycled.
 
To understand the three-tiered design of ISO 2022, imagine a computer
programmed to display a mixture of character sets on its screen.  A large
collection of fonts might be stored on the disk, one font per file.  These are
the character sets of the bottom tier.  When a font is needed, it will be read
from the disk and stored in memory in an array, for rapid access.  If several
fonts are needed, they will be stored in several arrays.  These arrays are the
intermediate character sets, G0-G3.  When a data byte arrives to be displayed,
the actual graphic representation is taken from GL or GR (depending on the
byte's 8th bit).  GL is associated with one of the intermediate graphic sets,
and GR with another.  If no more than four character sets are used, then each
one needs to be read from the disk only once, and display is rapid and
efficient thereafter.
 
Perhaps the most common application of ISO 2022 shifting techniques is with
the Japanese EUC (Extended UNIX Code) character set, which combines JIS X
0201 (which in turn consists of an ASCII-like Roman alphabet in the left half
and Japanese Katakana characters in the right) and JIS X 0208 (a double-byte
Japanese Kanji character set).  EUC encoding is used not only in data
communications, but also in files, e-mail, etc.  EUC is used as follows:

  Left half of JIS X 0201 (Roman, similar to ASCII) is designated to G0.
  JIS X 0208 (Kanji) is designated to G1.
  Right half of JIS X 0201 (Katakana) is designated to G2.
  G0 is initially invoked to GL.
  G1 is initially invoked to GR.

In the 8-bit environment, any byte with its 8th bit equal to zero is a Roman
G0 graphic or a C0 control character.  A byte with its 8th bit equal to 1 and
low-order 7 bits falling in the graphic range is the first byte of a Kanji
character pair.  Others are C1 controls.  The C1 control character SS2
selects the subsequent single byte from the Katakana set.

In the 7-bit environment, SO and SI are used to shift G1 in and out of GL,
and Kanji bytes are transmitted without their high-order bits.  C1 controls,
including SS3, are transmitted in their 2-byte 7-bit form (SS2 becomes
<ESC>N).


ANNOUNCING ISO 2022 FACILITIES
 
A large portion of ISO 2022 is devoted to describing how 8-bit characters may
be transmitted on a 7-bit communication path, for example when parity is in
use.  In the 7-bit environment, there is only GL -- no GR.  Therefore, all
characters are transmitted with their 8th bit removed, and shifts are used to
specify which intermediate set they belong to.
 
In fact, there are many possible ways to use the ISO 2022 code extension
facilities within both 7-bit and 8-bit environments.  For example, the sender
may inform the receiver in advance whether G1, G2, or G3 will be used, etc, so
that the receiver can allocate the appropriate resources.  At the beginning of
any particular data transfer, the facilities that actually will be used can be
announced with a sequence of the form <ESC><SP>F, where F is replaced by an
ISO 2022 announcer.  Several of the most important ones are described here.
Table 9 lists all the defined announcers in summary form.  For details, see
ISO 2022.
 
<ESC><SP>A means that only the G0 set will be used, invoked into GL.  No
  shift functions will be used.  In the 8-bit environment, GR is not used.
  In other words, only a single 7-bit character set is used.
 
<ESC><SP>B means the G0 and G1 sets will be used with locking shifts.  In the
  7-bit environment <SI> invokes G0 into GL, <SO> invokes G1 into GL.  In the
  8-bit environment, LS0 invokes G0 into GL, LS1 invokes G1 into GL.  In other
  words, two character sets are used, with characters from both sets always
  sent as 7-bit values, with locking shifts used to specify the 8th bit.
 
<ESC><SP>C means that G0 and G1 will be used in the 8-bit environment, with G0
  invoked in GL and G1 in GR.  No locking shift functions are used.  In other
  words, a single 8-bit character set is used, with all 8 bits transmitted as
  data.  GL is selected when the character's 8th bit is zero, GR is selected
  when the 8th bit is one.
 
<ESC><SP>D means that G0 and G1 will be used with locking shifts.  In the
  7-bit environment, <SI> invokes G0 into GL and <SO> invokes G1 into GL.  In
  the 8-bit environment, all 8 bits of each character are transmitted with no
  shifts.
 
<ESC><SP>L means that Level 1 of ISO 4873 will be used.  That is, a single
  8-bit character set with C0, G0, C1, and G1, with no shift functions.
  This is like <ESC><SP>C.
 
<ESC><SP>M means that Level 2 of ISO 4873 will be used.  This is equivalent
  to Level 1, with the addition of G2 and G3.  Characters from G2 and G3 are
  invoked only by the single-shift functions SS2 and SS3.
 
<ESC><SP>N means that Level 3 of ISO 4873 will be used.  This is equivalent
  to Level 2 with the addition of the locking shift functions LS1R, LS2R, and
  LS3R. (Note that ISO 4873 does not concern itself with the 7-bit
  environment, and therefore does not discuss the use of LS0, LS1, LS2, or
  LS3.)
 
_____________________________________________________________________________
 
Esc Sequence  7-Bit Environment          8-Bit Environment
 
<ESC><SP>A    G0->GL                     G0->GL
<ESC><SP>B    G0-(SI)->GL, G1-(SO)->GL   G0-(LS0)->GL, G1-(LS1)->GL
<ESC><SP>C    (not used)                 G0->GL, G1->GR
<ESC><SP>D    G0-(SI)->GL, G1-(SO)->GL   G0->GL, G1->GR
<ESC><SP>E    Full preservation of shift functions in 7 & 8 bit environments
<ESC><SP>F    C1 represented as <ESC>F   C1 represented as <ESC>F
<ESC><SP>G    C1 represented as <ESC>F   C1 represented as 8-bit quantity
<ESC><SP>H    All graphic character sets have 94 characters
<ESC><SP>I    All graphic character sets have 94 or 96 characters
<ESC><SP>J    In a 7 or 8 bit environment, a 7 bit code is used
<ESC><SP>K    In an 8 bit environment, an 8 bit code is used
<ESC><SP>L    Level 1 of ISO 4873 is used
<ESC><SP>M    Level 2 of ISO 4873 is used
<ESC><SP>N    Level 3 of ISO 4873 is used
<ESC><SP>P    G0 is used in addition to any other sets:
              G0 -(SI)-> GL              G0 -(LS0)-> GL
<ESC><SP>R    G1 is used in addition to any other sets:
              G1 -(SO)-> GL              G1 -(LS1)-> GL
<ESC><SP>S    G1 is used in addition to any other sets:
              G1 -(SO)-> GL              G1 -(LS1R)-> GR
<ESC><SP>T    G2 is used in addition to any other sets:
              G2 -(LS2)-> GL             G2 -(LS2)-> GL
<ESC><SP>U    G2 is used in addition to any other sets:
              G2 -(LS2)-> GL             G2 -(LS2R)-> GR
<ESC><SP>V    G3 is used in addition to any other sets:
              G3 -(LS2)-> GL             G3 -(LS3)-> GL
<ESC><SP>W    G3 is used in addition to any other sets:
              G3 -(LS2)-> GL             G3 -(LS3R)-> GR
<ESC><SP>Z    G2 is used in addition to any other sets:
              SS2 invokes a single character from G2
<ESC><SP>[    G3 is used in addition to any other sets:
              SS3 invokes a single character from G3
 
                     Table 9: ISO 2022 Announcer Summary
_____________________________________________________________________________
 

STANDARD VERSUS PRIVATE CHARACTER SETS

Most of the popular private 8-bit character sets, notably the IBM PC code
pages and the Apple Macintosh character sets (but they are not alone), differ
from the standard character sets in three important ways:

 1. The repertoire of characters is different.
 2. The encoding of characters is different.
 3. The C1 area is sometimes used for graphics, which is forbidden by the
    standards. 
 4. In some cases, even the C0 area is used for graphics.

However, most of these character sets conform to the requirement that the
left half be identical with US ASCII.


APPENDIX C:  (deleted)
 
 
APPENDIX D: SUMMARY OF KERMIT COMMANDS RELATED TO CHARACTER SET TRANSLATION
 
 
SET FILE TYPE { BINARY, TEXT }
 
    BINARY means no translation, and overrides all other file-related
    commands, including SET TRANSFER.
 
    TEXT is the default.  Enables file transfer character set translation,
    depending on the setting of SET TRANSFER.
 
SET FILE CHARACTER-SET <name>
 
    Effective only when file type is TEXT.
    Tell Kermit what character set the file is coded in,
    or what character set to translate an incoming file to.
 
SET TRANSFER { CHARACTER-SET <name>, LOCKING-SHIFT { ON, OFF, FORCED } }
 
    CHARACTER-SET <name>
      Invoke file transfer character set translation.  <name> is
      TRANSPARENT, ASCII, LATIN1, LATIN2, ..., CYRILLIC, JAPAN-EUC, etc.
 
    LOCKING-SHIFT { ON, OFF, FORCED }
      Enable, disable, or force locking-shift transport protocol for
      efficient transfer of 8-bit data in the 7-bit communications
      environment.  Normally enabled.  Used only if both Kermit programs
      agree in the feature negotiation phase to use it (essentially, if
      PARITY is not NONE, and they both have locking-shift capability).
 
SET LANGUAGE <name>
 
    This command informs the program which language is being translated,
    to allow for special language-based transliteration rules, such as
    replacing a-diaeresis by ae.
 
SET { TRANSFER, TERMINAL } TRANSLATION { INVERTIBLE, READABLE }

    Specify the goal of the specified translation: invertibility or
    readability.

SET UNKNOWN-CHARACTER-SET { KEEP, CANCEL }
 
    Tell the file receiver whether to keep or cancel an incoming file that
    contains an unknown character set.  KEEP is the default.
 
SET { SEND, RECEIVE } AUTOMATIC-TRANSLATION { ON, OFF, <set1> [ <set2> ] }

    Enable or disable automatic selection of a file transfer 
    translation table in the indicated direction, or specify pairs
    character sets to be used: given <set1>, automatically translate to
    <set2>.  Default in both directions is OFF.

SET ATTRIBUTES { ON, OFF }
SET ATTRIBUTE <name-of-attribute> { ON, OFF }
 
    Enables or disables processing of attribute packets, or specific
    attribute fields such as DATE, ENCODING, LENGTH, etc.
 
SET TERMINAL { CHARACTER-SET, DIRECTION, LOCKING-SHIFT, TRANSLATION }
 
    Specifies terminal emulation character-set translation, screen writing
    direction, locking shift usage, translation goal.
 
SHOW { CHARACTER-SETS, LANGUAGE, FILE, TRANSFER, PROTOCOL, TERMINAL }
 
    Display what character sets, translation tables, and languages are
    available, and which ones are currently selected.
 
TRANSLATE <file1> <file2> [ <file1-character-set> [ <file2-character-set> ] ]
 
    Copies local file <file1> to local file <file2>, translating <file1> from
    <file1-character-set> to <file2-character-set>.  If <file1-character-set>
    is not specified, the current FILE CHARACTER-SET is used.  If
    <file2-character-set> is not specified, the current TRANSFER CHARACTER-SET
    is used.  Note that this command can be used to convert between two
    different FILE CHARACTER-SETS, in which case an appropriate TRANSFER
    CHARACTER-SET can be used in an intermediate step.
 
 
APPENDIX E:  (Deleted)
 

APPENDIX F:  (Deleted)

 
APPENDIX G:  OFFICIAL CHARACTER SET TRANSLATIONS

Apple: ???

Atari: ???

IBM:

IBM lists its character sets in the following manuals:

  "Graphic Character Identification System, Graphic Character Global
    Identifier (GCGID) Structure", C-H 3-3220-055, 1989 (Internal Use Only).
  "Registry of Graphic Character Sets and Code Pages", C-H 3-3220-050
    (Internal Use Only).
  
The translations between its corporate code pages and ISO standard character
sets are given in the following manuals:

  "SAA Character Data Representation Architecture (CDRA)"
    Executive Overview: GC09-1392-00 (15 pages)
    Level-1, Reference: SC09-1390-00 (64 pages)
    Level-1, Registry:  SC09-1391-00 (tables, 720 pages)
 
In particular, IBM has adopted ISO 8859-1 Latin Alphabet 1 as IBM Code Page
0819, and publishes its official, invertible translations between this code
page and and various private IBM code pages (such as CP850 and CECP500), as
well as invertible or noninvertible translations between many other pairs of
IBM code pages.  From these, it is possible to infer other translations, for
example between Code Page 437 and Latin-1.

Commodore: ???

Data General: ???

Digital Equipment Corporation: ???

Microsoft: ???

(Much work is needed on this section...)


REFERENCES:
 
The standards listed in Appendix A, the documents in Appendix G, plus:
 
CCITT Recommendation T.61, "Character Repertoire and Coded Character Sets for
the International Teletex Service", Geneva (1980, amended at
Malaga-Torremolinos 1984).

Chandler, John, "IBM System/370 Kermit User's Guide", version 4.2, (1991)
(Internet: watsun.cc.columbia.edu:kermit/b/ik[cmtx]ker.{doc,ps}).  For VM/CMS,
MVS/TSO, CICS, and MUSIC.  Detailed description of how to use Kermit's
character set translation facilities in the IBM mainframe environment.
 
da Cruz, Frank, "Kermit, A File Transfer Protocol", Digital Press (1987).
The specification of the Kermit file transfer protocol before the addition
of this extension.
 
Do, James, Ngo^ Thanh Nha`n, Hoa`ng Nguye^n, "A proposal for Vietnamese
character encoding standards in a unified text processing framework", Computer
Standards & Interfaces 14 (1992) 3-12, Elsevier North-Holland.

Gianone, Christine M., "It's Time to Prepare for International Computing",
PC Week, October 2, 1989.
 
Gianone, Christine M., "Using MS-DOS Kermit", Second Edition, Digital Press
(1991).  Chapter 13 describes how to use the character set translation
facilities of MS-DOS Kermit 3.0 and later on IBM PCs, PS/2s, and compatibles,
for both terminal emulation and file transfer.  Also included are character
set and conversion tables for many Roman and Cyrillic character sets.
 
Gianone, Christine M., and Frank da Cruz, "C-Kermit User Guide", version 5A
(1991) (Internet: watsun.cc.columbia.edu:kermit/sw/ckuker.{doc,ps}).
Description of the terminal and file transfer character set translation
features of C-Kermit 5A for UNIX and VAX/VMS.
 
Gianone, Christine M., and Frank da Cruz, "A Locking Shift Mechanism for the
Kermit File Transfer Protocol", unpublished paper, Columbia University,
October 1991 (watsun.cc.columbia.edu:kermit/e/lshift.txt).  A Kermit protocol
extension for transferring 8-bit text efficiently in the 7-bit communication
environment.

Hart, Edwin (ed.), "ASCII and EBCDIC Character Set and Code Issues in a
Systems Applications Architecture", SSD #366, SHARE Inc., Chicago, IL, USA
(June 1989).  Commonly called the "SHARE White Paper".  A cogent description
of the problems of character set translation in the IBM computing environment,
with recommendations adopted by SHARE, an international, voluntary
organization of users of IBM systems.

IBM System/370 Reference Summary, IBM GX20-1850-6.  The definitive US-ASCII /
US-EBCDIC translation table.

ISO 639, "Code for Representation of Names of Languages" (1988).  Useful for
naming language-related symbols in Kermit programs.

ISO 3166, "Country Codes" (1988 + Registration Newsletter updates).  Useful
for naming country-related symbols in Kermit programs.

ISO/IEC 10646-1:1993, Multiple-Octet Coded Character Set.  The universal
character set.

Pirard, Andr'e, "Guidelines to Use 8-Bit Character Codes", University of
Liege, Belgium, unpublished paper on character set translation problems,
written from the West European perspective, listing numerous suggested
invertible translation tables.
Files: watsun.cc.columbia.edu:kermit/charsets/iso8859.networking and
iso8859.moretran.

"The Unicode Standard, Worldwide Character Encoding", Version 1.0, Volume 1,
Addison-Wesley (1991).

[End of ISOK6.TXT]