******************** NOTE: THIS VERSION IS OBSOLETE, AND HAS BEEN SUPERSEDED BY ISOK6.TXT. ******************** A KERMIT PROTOCOL EXTENSION FOR INTERNATIONAL CHARACTER SETS Christine Gianone Manager, Kermit Development and Distribution Columbia University Center for Computing Activities 612 West 115th Street New York, NY 10025, USA DRAFT NUMBER 5 APRIL 25, 1990 ABSTRACT A two-level extension to the presentation layer of the Kermit file transfer protocol is proposed to allow transfer of non-English-language text files between unlike computers. Level 1 allows substitution of single character sets other than ASCII in Kermit's normal text-file transfer syntax. Level 2 specifies a new transfer syntax in which multiple character sets may be used, along with mechanisms for switching among them as defined in ISO Standard 2022. This is still a DRAFT proposal. Readers with knowledge of real-world multi-alphabet applications and file formats are urged to comment on the suitability of this proposal. It is assumed the reader is familiar with the Kermit file transfer protocol. It is also assumed that the reader is familiar with ISO Standards 4873 and 2022, but these are summarized in Appendix B. SUMMARY OF CHANGES SINCE DRAFT #4, August, 1989 - Changes for Level 1 only, to reflect experience in writing the code to implement it for MS-DOS Kermit 3.0, C-Kermit 5A, and Kermit 370 4.2. Level 2 is on hold indefinitely pending ISO 10646 & Unicode developments. - Abandonment of separate attributes for encoding and character set. - Change all references to ASCII as I2 into I6. - Change description of SET LANGUAGE to remove side effects. - Differentiation of SET TRANSFER CHARACTER ASCII and TRANSPARENT. - The section on terminal emulation has not been changed, even though this subject needs detailed treatment in this document. SUMMARY OF CHANGES SINCE DRAFT #3, July 20, 1989 - Expanded & more precise definition of Kermit's character set designators - Simplification of the syntax of the (former) SET TRANSFER-SYNTAX command - Addition of SET LANGUAGE command - Clarification of Kermit's behavior when it receives an unknown character set - Addition of Appendix F to specify how each Kermit Level is invoked - Correction of numerous typographical and other errors ACKNOWLEDGEMENTS Many thanks to these people for their helpful and constructive comments on the first three drafts. In most cases, their suggestions or the information they provided have been incorporated into this or previous drafts. John Chandler (Harvard/Smithsonian Center for Astrophysics, USA) Alan Curtis (University of London, UK) Frank da Cruz (Columbia University, USA) Joe Doupnik (Utah State University, USA) Hirofumi Fujii (Japan National Laboratory of High Energy Physics, Tokyo) John Klensin (Massachusetts Institute of Technology, USA) Ken-ichiro Murakami (Nippon Telephone and Telegraph Research Labs, Tokyo) Vladimir Novikov (VNIIPAS, Moscow, USSR) Jacob Palme (Stockholm University, Sweden) Andre Pirard (University of Liege, Belgium) Paul Placeway (Ohio State University, USA) Gisbert W. Selke (University of Bonn, West Germany) Fridrik Skulason (University of Iceland, Reykjavik) Johan van Wingen (Leiden, Netherlands) Konstantin Vinogradov (ICSTI, Moscow, USSR) Amanda Walker (InterCon Systems Corp, USA) Thanks also to the following people for organizing meetings or conferences in their countries at which the issues of this proposal were discussed: Kohichi Nishimoto (Nihon DEC, Tokyo, Japan) Juri Gornostaev and A. Butrimenko (ICSTI, Moscow, USSR) and thanks also to those who attended these gatherings! STATEMENT OF THE PROBLEM Kermit has always been able to transfer text files between unlike systems (e.g. a UNIX system with ASCII stream text files and an IBM mainframe with EBCDIC record-oriented text files). To do the text file code conversion, Kermit transfers text in ASCII. But ASCII only includes enough letters and symbols for English. There are now computers capable of representing the characters of other languages: Roman letters with diacritical marks, Cyrillic letters, Hebrew, Arabic, and Greek characters, Japanese and Chinese ideograms. But different computer manufacturers use different codes for these characters. For example, the IBM PS/2 and the Apple Macintosh have character sets that are "8-bit ASCII". When the character value is 32-127, the character is (normally) a standard ASCII graphic (printable) character. When the value is 128 or higher, it is a special character. But the PC and the Macintosh assign different special characters to these values. Here are just a few examples: Value PS/2 Character Macintosh Character 138 Small e grave Small a umlaut 143 Capital A ring Small e grave 144 Capital E acute Small e circumflex 136 Small e circumflex Small a grave When a file contains "8-bit ASCII", Kermit presently transfers it without any character translation. Therefore, a text file written in French, German, Italian, or Norwegian transferred between a PS/2 and a Macintosh will contain the wrong characters when it arrives at its destination: the PS/2's e-grave becomes a-umlaut on the Macintosh, etc. The problem is compounded when a file is composed of characters from more than one character set, for example a Japanese text file that contains Kanji, Katakana, and Roman characters. There are many computer vendors in the world and nobody controls what codes they use to represent characters. Without a standard protocol for transferring non-ASCII text, each computer would have to know the codes of all the other computers in order for correct transfer of non-English text files to occur between unlike systems. NORMAL KERMIT FILE TRANSFER SYNTAX The Kermit file transfer protocol makes a distinction between text and binary files. Binary files are transmitted with no translation or conversion. For text files, the Kermit protocol defines a standard transfer syntax for text files, namely ASCII characters with carriage return and linefeed (CRLF) after each line, so that text may be stored in useful fashion on any computer to which it is transferred. Each Kermit program knows how to translate from the local text-file storage conventions to ASCII/CRLF syntax, and vice versa. This is the basic, required, and default mode of operation for any Kermit program, and it will be referred to as Kermit's "Normal" or "Level 0" syntax. EXPANDED KERMIT TRANSFER SYNTAX This proposal adds two additional levels of transfer syntax, Levels 1 and 2. Level 1 permits the use of a single character set other than ASCII in the transfer syntax. These additional character sets are taken from recognized national or international standards, such as ISO 8859-1 (Latin Alphabet 1), JIS X 0208 (Japanese), etc. By using a standard character set (other than ASCII), it is possible to transfer a text file written in a language other than English, and it is also possible to transfer a text file containing more than one language. For example Latin Alphabet 1 can represent a file containing a mixture of Italian, Norwegian, French, German, English, and Icelandic. Level 2 allows a mixture of character sets to transfer mixed-language text that requires characters from more than one standard character set, for example a document written in Russian, French, and Greek. The additional levels are optional features for Kermit programs, except that Level 2 should not be provided without Level 1. The following discussion applies to text-file transfer only. When the Kermit user has selected binary file transfer, none of the text-file conversions discussed here apply. EXPANDED SYNTAX, LEVEL 1 When all the characters in a text file can be represented by a single character set, then that character set can be used in place of ASCII in Kermit's transfer syntax. Whatever the transfer character set, there must be a mapping between the local file character set and the character set of the common transfer syntax. That is, there must be a pair of translation functions in the program, one from the local file character set to the transfer character set, and one from the transfer set to the local set. Until now, many Kermit programs have lacked such a translation function, because the local file character set was the same as the transfer character set, namely ASCII. But there have always been exceptions. For example, IBM System/370 mainframe Kermit must translate between ASCII and its local EBCDIC character set. To complicate matters, many computers now support a variety of character sets. IBM mainframes have not only "standard" US EBCDIC, but also several EBCDIC-based Country Extended Code Pages (CECPs) for the support of West European languages, Hebrew, etc. The IBM PC and PS/2 have a variety of ASCII-based 8-bit code pages for the same purpose. These character sets are a welcome addition, because they allow users of these computers to create, display, and print documents in languages other than English. Unfortunately, it is usually the case that the computer's file system keeps no record of which character set is used in each file. For this reason, the following command should be provided to allow the Kermit user to specify the local file character set: SET FILE CHARACTER-SET The file character set name is a system-dependent item. Some computers have only one character set, in which case the SET FILE CHARACTER-SET command would be unnecessary. This command will be necessary on computers that use the "national replacement characters" allowed by ISO Standard 646. This standard specifies a 7-bit character set equivalent to ASCII, but with national variants in which certain non-alphanumeric ASCII graphic characters are replaced by "national characters", as shown in Table 1. _____________________________________________________________________________ Column/Row ASCII German Finnish Norwegian French 04/00 at-sign section at-sign at-sign a-grave 05/11 left-bracket A-umlaut A-umlaut AE-digraph degree 05/12 backslash O-umlaut O-umlaut O-slash c-cedilla 05/13 right-bracket U-umlaut A-circle A-circle section 06/00 accent-grave accent-grave e-acute accent-grave accent-grave 07/11 left-brace a-umlaut a-umlaut ae-digraph e-acute 07/12 vertical-bar o-umlaut o-umlaut o-circle u-grave 07/13 right-brace u-umlaut a-circle a-circle e-grave 07/14 tilde ess-zet u-umlaut tilde umlaut Table 1: ISO 646 Usage in Selected Countries _____________________________________________________________________________ (see Figure 1 in Appendix B for an explanation of column/row notation.) For example, the German phrase "Gre aus Kln" would be rendered in ASCII as "Gr}~e aus K|ln", and the ASCII C-language phrase "{~a[x]}" would become: ax in German ISO 646. The German user would want Kermit to interpret the local file characters as German in the former case, and as ASCII in the latter. SPECIFYING THE TRANSFER CHARACTER SET: To select Level 1, the user enters the command: SET TRANSFER CHARACTER-SET Where is the name of a standard character set other than ASCII. If the name is TRANSPARENT, then Kermit does no character set conversion at all, but it still may do text record format conversion. For ASCII-based systems, this is equivalent to Kermit's normal, basic mode of operation. If a name other than TRANSPARENT is given, and FILE TYPE is set to TEXT, then Kermit will translate between the current file character set and the named transfer character set during all packet operations. If the transfer character set is ASCII, then Kermit converts between the current file character set and 7-bit ASCII. This mode of operation is roughly equivalent to Kermit's basic mode of operation on non-ASCII based systems like IBM mainframes. If the local file character set contains accented characters, the accents are dropped in the transfer character set, for example a-acute becomes simply a. (But see SET LANGUAGE, described later.) Other transfer character sets must be chosen from among approved national or international standards. As a starting point, the sets shown in Table 2 are recommended. The criteria for including a character set in this table are: 1. 7-bit ASCII (= ISO-646 International Reference Version, IRV) is included, for compatibility with the original Kermit protocol and the hundreds of programs that implement it. 2. An 8-bit single-byte character set, such those in the ISO 8859 series, if it is registered, as in (4) below, is included. 3. A multibyte character set may be included, if it is registered as in (4). 4. The set must be listed in the ISO International Register of Character Sets under the provisions of ISO Standard 2375 (see Appendix A), so that it has a unique registration number and designating escape sequence, in order that the sending Kermit program can identify the character set to the receiving Kermit program. Allowance is made for the possibility of other registration authorities, should they appear. 5. The set must be a national or international standard graphic character set, intended for use in computer text processing or programming (as opposed to Videotex, Teletex, OCR, device control, or other applications). This category may include line-drawing or technical character sets which fit the other criteria. Note in particular that the national variants of ISO 646 are not included, since these are covered adequately by ASCII and the ISO Latin alphabets. Standard "Kermit names" (for use with the SET TRANSFER CHARACTER-SET command) are given to these character sets so that they may be referred to uniformly in all Kermit implementations. These names are chosen to be mnemonic, so that users don't have to remember cryptic designations like "ISO-8859-3". The choice of single words like "CYRILLIC" implies that there will not be more than one Level-1 transfer syntax for Cyrillic text. However, if these standards change in the future, it will be possible to add further identifying material to these names, e.g. "CYRILLIC2", "CYRILLIC3", etc. The mnemonicity of the Kermit names is based upon English, as this is the language of the standards themselves. The Kermit commands are English words, and this document is written in English. Once we have solved the problem of transferring non-English text files between unlike computers, we may begin to consider the challenge of language-independent user interfaces and documentation! _____________________________________________________________________________ Table 2: Standard 8-Bit Character Sets US 7-bit ASCII, English, Latin, Gaelic, German without umlauts or ess-zet, etc. Kermit name: ASCII. ISO Registration Number: 6. Kermit Designator: none (this is the default transfer alphabet). ISO 8859-1, Latin Alphabet 1, for Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish. Kermit name: LATIN1. ISO Registration Number: 100. Kermit Designator: I6/100. ISO 8859-2, Latin Alphabet 2. Albanian, Czech, English, German, Hungarian, Polish, Romanian, Serbocroatian (Croatian), Slovak, and Slovene. Kermit name: LATIN2. ISO Registration Number: 101. Kermit Designator: I6/101. ISO 8859-3, Latin Alphabet 3, for Afrikaans, Catalan, English, Esperanto, French, Galician, German, Italian, Maltese, and Turkish. Kermit name: LATIN3. ISO Registration Number: 109. Kermit Designator: I6/109. ISO 8859-4, Latin Alphabet 4, for Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and Swedish. Kermit name: LATIN4. ISO Registration Number: 110. Kermit Designator: I6/110. ISO 8859-5, the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian, English, Macedonian, Russian, Serbocroatian (Serbian), and Ukrainian (Compatible with USSR GOST Standard 19768-1987 and ECMA-113). Kermit name: CYRILLIC. ISO Registration Number: 144. Kermit Designator: I6/144. ISO 8859-6, the Latin/Arabic Alphabet. Kermit name: ARABIC. ISO Registration Number: 127. Kermit Designator: I6/127. ISO 8859-7, the Latin/Greek Alphabet. Kermit name: GREEK. ISO Registration Number: 126. Kermit Designator: I6/126. ISO 8859-8, the Latin/Hebrew Alphabet. Kermit name: HEBREW. ISO Registration Number: 138. Kermit Designator: I6/138. ISO DIS 8859-9, Latin Alphabet 5, in which six Icelandic letters from Latin Alphabet 1 are replaced by six other letters needed for Turkish. Kermit name: LATIN5. ISO Registration Number: 148. Kermit Designator: I6/148. CSN 36 91 03, Czechoslovak Standard alphabet. Kermit name: CZECH. ISO Registration Number: 139. Kermit Designator: I6/139. JIS X 0201, a 1-byte code for Japanese Katakana, used in conjunction with a slightly modified ASCII (backslash is replaced by Yen sign, tilde by overbar). Kermit name: KATAKANA. ISO Registration Numbers: 14 (Roman), 13 (Katakana). Kermit Designator: I14/13. JIS X 0208, a 2-byte code containing Japanese Kanji, Katakana, Hiragana, Roman, Greek, and Russian characters, plus special symbols, etc. All characters in this set are displayed in double width, therefore it is commonly used in conjunction with JIS X 0201 so that Katakana and Roman characters and digits may be displayed in single width. Kermit name: KANJI. ISO Registration Number: 87. Kermit Designator: M87. Chinese Standard GB 2312-80, a 2-byte code for Chinese. Kermit name: CHINESE. ISO Registration Number: 58. Kermit Designator: M58. KS C 5601 (1987), a 2-byte code for Korean. Kermit name: KOREAN. ISO Registration Number: 149. Kermit Designator: M149. Table 2: Standard 8-Bit Character Sets _____________________________________________________________________________ The ISO Latin alphabets and the Czech character set are 8-bit character sets whose left half is identical with ASCII, and whose right half contains the special characters. The ISO registration number refers only to the right half of each of these character sets. But each of these sets must be used in its entirety, because the unaccented Roman letters, the digits, and the punctuation marks appear only in the ASCII left half, which is ALWAYS (unless otherwise noted) US ASCII, ISO Registration Number 6. The Kermit character-set name refers to the two halves combined as a single set. A particular Kermit program need not incorporate all of these character sets. In many cases, a single 8-bit character set will suffice, such as LATIN1 for Western Europe, LATIN2 for Eastern European countries with Roman-alphabet based languages, LATIN4 for Scandinavia, CYRILLIC for most of the USSR, etc. When a language is representable in more than one character set from this table, as are English, German, Finnish, Czech, Turkish, etc., the character set highest on the list which adequately represents the language should be preferred. More precisely, when a character set other than ASCII is to be used in the Kermit's transfer syntax, the ISO 8859 sets are preferred to other registered sets which contain the same characters. Within the ISO 8859 family, lower-numbered sets which contain the characters of interest are preferred to higher-numbered sets which contain the same characters. This guideline maximizes the chance that any two particular Kermit programs will recognize the same character sets. For example, LATIN1 would be chosen for French, LATIN1 for German (because it represents German better than ASCII), LATIN5 for Turkish (because it represents Turkish better than LATIN3), KANJI or KATAKANA for Japanese (because none of the ISO 8859 sets contain Japanese characters), etc. Unfortunately, but unavoidably, the burden of choosing the best transfer syntax character set must be placed upon the user. If a file containing a mixture of Finnish, English, and Danish must be transferred, the user must find a character set that can adequately represent all three languages, in this case Latin Alphabet 4. A table like Table 3 should be provided in the user documentation to help the user make this selection. _____________________________________________________________________________ Arabic ARABIC Italian LATIN1,3 Bulgarian CYRILLIC Kanji KANJI Chinese CHINESE Katakana KATAKANA, KANJI Czech CZECH, LATIN2 Korean KOREAN Danish LATIN4 Latvian LATIN4 Dutch LATIN1,2,3,4 Lithuanian LATIN4 English ASCII,LATIN1,2,3,4,5,etc Norwegian LATIN1,4 Esperanto LATIN3 Polish LATIN2 Estonian LATIN4 Portuguese LATIN1 Finnish LATIN1,4 Romanian LATIN2 Flemish LATIN1,2,3,4,5 Russian CYRILLIC French LATIN1,3,5 *Serbocroatian LATIN2, CYRILLIC German LATIN1,2,3,4,5 Slovak LATIN2 Greek GREEK Spanish LATIN1 Hebrew HEBREW Swedish LATIN1,4 Hungarian LATIN2 Turkish LATIN5,3 Icelandic LATIN1 Ukrainian CYRILLIC Table 3: Preferred Transfer Syntax Character Sets *If written in Cyrillic, this language is called Serbian. If written in Roman letters, it is called Croatian. _____________________________________________________________________________ Note, table 3 is only a sample. To produce a comprehensive and definitive table would require a team of language experts. The information in the current table is based purely upon the claims made within the standards themselves, in which there is no mention of languages like Albanian, Erse, Farsi, Urdu, Welsh, Cornish, Manx, Inuit, Old Church Slavonic, Armenian, Georgian, Tagalog, Swahili, Latin, Vietnamese, etc, nor definitions of exactly what is meant by terms like "Greenlandic", "Irish", etc. Obviously, it is the intention of this proposal to support any language for which a computer character set can be standardized. IMPLEMENTATION OF LEVEL 1 The Level-1 Kermit extension can be added to existing Kermit programs with a minimum of effort. The following steps are required for each Kermit program: 1. Add the SET FILE TYPE { BINARY, TEXT } command, if the program doesn't have it already. SET FILE TYPE TEXT enables text-file character set conversion at all levels. SET FILE TYPE BINARY disables conversions of all kinds, but does not destroy the file and transfer character-set selections (2 and 3 below), so that a subsequent SET FILE TYPE TEXT command will still be able to use them. 2. Add the SET FILE CHARACTER-SET command. The set of should include ASCII or EBCDIC (as appropriate, used for program source, etc) plus the names of any "national" or special character sets that are used on this particular computer. 3. Add the SET TRANSFER CHARACTER-SET command. The set of should include TRANSPARENT and ASCII plus the names of one or more other standard character sets from Table 2 which contain the characters from the computer's local character set(s). 4. Add translation tables (or functions) between each pair of character sets in (2) and (3). For each pair, two translation tables are necessary: one from the local file character set to the transfer character set, and one from the transfer set to the local one. 5. Add SHOW commands to let the user find out what character sets are available, and which ones are currently selected, for the transfer syntax and for local files. The exact syntax of this command will vary. In some Kermit implementations, every SET command has a corresponding SHOW command, in which case it will be possible to SHOW FILE CHARACTER-SET and SHOW TRANSFER CHARACTER-SET. In others, related SET parameters are lumped together into broader categories for purposes of SHOW, for example SHOW FILE would show all file-related parameters; SHOW PROTOCOL would show all protocol-related parameters. Optionally, several additional related commands may be included: 6. The command SET LANGUAGE may be added to allows the program to apply heuristics in the translation process that would not otherwise be possible (see discussion of German below). The choices for SET LANGUAGE should include ASCII (for transferring program source code, etc) plus the names of any languages the particular Kermit implementation supports, such as ITALIAN, NORWEGIAN, PORTUGUESE. 7. To allow for user-defined character-set translations, also add the LOAD TRANSLATION-TABLE, SHOW TRANSLATION-TABLE, and DUMP TRANSLATION-TABLE commands (described in the next section). 8. Once the new commands and translation tables are in place, it is simple to add a TRANSLATE command, to translate a local file from one character set to another. With this command, Kermit may be used as a character-set conversion utility for local files. LEVEL-1 EXAMPLE To transfer a Finnish-language text file from a computer that uses the Finnish ISO 646 national replacement set to an IBM PS/2, and to store the file using the PS/2's Multilingual Code Page: On the sending computer: On the receiving computer: SET FILE TYPE TEXT SET FILE TYPE TEXT SET FILE CHARACTER-SET FINNISH SET TRANSFER CHARACTER-SET LATIN1 SET TRANSFER CHARACTER-SET LATIN1 SET FILE CHARACTER-SET CP850 SEND filename RECEIVE To transfer a C-language source program between the same two computers: On the sending computer: On the receiving computer: SET FILE TYPE TEXT SET FILE TYPE TEXT SET TRANSFER CHARACTER-SET ASCII SET FILE CHARACTER-SET ASCII SET FILE CHARACTER-SET ASCII SET TRANSFER CHARACTER-SET ASCII SEND filename RECEIVE To emphasize the value of the SET LANGUAGE command, consider German text containing the Ess-Zet character and vowels with umlauts. It is acceptable to render Ess-Zet as "ss", and to render a vowel with umlaut as the same vowel without an umlaut but followed by "e". But this should not necessarily be done for languages other than German. The command SET LANGUAGE GERMAN would allow the Kermit program to perform these functions when translating from Latin-1 or German NRC into ASCII, so that "Gre aus Kln" would become "Gruesse aus Koeln" (correct German) rather than "Gruse aus Koln" (Gruse means something entirely different from Gruesse -- something like "scum" rather than "greetings"). TRANSLATION TABLES In many cases, translation tables will be 1-for-1. That is, the two character sets are the same size, and each character from one set can be found in the other set. In such cases, the translation table need be only a list of numbers, in which position "n" in the table contains the translation for character number "n". In some cases, the two character sets will be the same size, but certain characters from one will be lacking in the other, and/or vice versa. For example, IBM Code Page 850 and the Apple Macintosh sets are both "8-bit ASCII", but the IBM set lacks the Macintosh's Y-umlaut, and the Macintosh set lacks IBM's Y-acute. In other cases the character sets will be different sizes. We have long been familiar with this problem when translating between 7-bit ASCII and 8-bit EBCDIC. In Japan, there must be translations between single-byte Roman, Greek, and Cyrillic characters and the two-byte JIS X 0208 character set. It is recommended that translation tables built into Kermit programs be as general and useful as possible, substituting the closest possible character when an exact match is not available. For instance, when translating from French Latin-1 or NRC to ASCII, accented letters should normally be translated into the corresponding unaccented letters: a-acute becomes a, etc. It is a matter of choice to the programmer whether translation be accomplished by tables or by functions which implement translation algorithms, or a combination of both. Functions provide maximum flexibility and tend to reduce program size, at some cost in execution overhead. Tables provide greatest speed, with generally greater cost in program size. It is further recommended that the actual contents of each translation table eventually be specified in this standard, and that translations be invertible. USER-DEFINED TRANSLATIONS It should be possible for users to alter Kermit's translation tables or to add new ones, without having to change the program's source code. For example, in certain situations it might be preferable to have a-grave rendered in ASCII as "a", but in others as "`a", "a`", or even "?". It is also possible that new character sets will appear which are unknown to the Kermit program. For these reasons, a standard format is suggested for translation tables, together with a LOAD TRANSLATION-TABLE command to allow the user to add new character sets to a Kermit program's repertoire, or to alter current translations. Each table within a program is assigned an arbitrary tablename. For example, LATIN1-CP850 could be the name for the Latin-1 to CP850 table in the PS/2, and CP850-LATIN1 could be the name for the table in the other direction. To load a replacement table, the user would issue the command: LOAD TRANSLATION-TABLE where is the name to assign to the new table. If a table with that name already existed, that table is replaced. A suggested layout for a loadable translation table is given in Appendix C. A Kermit program, upon loading one of these files, would set up the translation table, add the names of the table and of the character sets themselves to the appropriate keyword tables, and so on. So that the translation-table related commands can also be effective for built-in translation tables, it is recommended that the built-in tables be designed in the same format as the loadable tables. Two additional commands should be furnished to allow the user to get information about the currently loaded tables: SHOW TRANSLATION-TABLE which would give summary information, and: DUMP TRANSLATION-TABLE which would write out a translation table (even a built-in one) in the form shown in Appendix C, so that it could be edited and loaded again. ATTRIBUTE PACKETS AT LEVEL 1 The objective of Kermit's Level-1 extension is to accommodate as many computers as possible with a minimum of programming effort. But this approach places a burden on the user in the form of new commands and the confusion which results if the user forgets to issue these commands. Level 1 does not require support for Kermit File Attribute Packets, whose use is negotiated in the Kermit Initialization exchange. But the user's burden can be alleviated if the sending Kermit program uses an attribute field to inform the receiving Kermit of the character set to be used in the transfer syntax. The receiving program can accept or refuse the file based on whether it supports the specified character set. If the receiving program refuses a file, the user can override this refusal, for example, if a long file contains only a word or two in an unknown character set. The most common user-override is the command SET ATTRIBUTES OFF. However, this also disables other desirable effects of attribute packets, such as prenotification of file size. Therefore, it is desirable to let the user specify exactly which attributes are to be "turned off", e.g. SET ATTRIBUTES CHARACTER-SET OFF. When the transfer character set is ASCII (or TRANSPARENT when sent from an ASCII-based system), the Encoding attribute alone will suffice, with a value of "A" (for ASCII): "*!A". In order for the sender to inform the receiver of transfer alphabets other than ASCII, a new value for the Encoding attribute ("*") is defined, namely "C", which is substituted for the normal value "A" (ASCII). "C" means that the actual character set is specified as an operand of the following form which begins with a single letter that designates the character set registration authority, e.g. I for ISO, followed by a registration-authority-specific identifier, as in: Ixxx/yyy where the letter "I" (for ISO) is followed by a pair of ISO registration numbers for the character set, xxx for the "left half" and yyy for the right, expressed in decimal ASCII digits, for example: +---+---+---+--------+ | * | ' | C | I6/100 | +---+---+---+--------+ where "*" is code for the Encoding Attribute (or transfer syntax), "'" is the length (ASCII 39 - 32 = 7) of its value, and the single character "C" is the value itself, which means "I'm using the specified Character set" specified by the six characters "I6/100" mean "ISO registration number 6", i.e. US ASCII, in the left half, and ISO registration number 100, which is the right half of Latin-1, in the right. The "I" stands for ISO, and is included to allow for the possibility of other character set registration authorities. Designators for each character set are given in Table 2, labeled "Kermit Designator". For self-contained ISO standard multibyte character sets, the Kermit Designator starts with the letter "M", rather than "I", to indicate (a) that it is a multibyte, rather than single-byte, set and (b) that there is no "left half", i.e. "M" is always followed by a single ISO registration number. In the event that a character set standard changes, but keeps the same registration number, the registration number for the new character set should be preceded by a non-numeric character which indicates the revision number: @ (atsign) = 1, A=2, B=3, and so on (as suggested in ISO 2022). For example "I@2/B100" would indicate an 8-bit single-byte character set having Revision 1 of ASCII as its left half and Revision 3 of Latin-1 as its right. Note: "Revision 1" does not mean the original version, but rather the first revision AFTER the original version. The Kermit designator for an original version does not have a revision indicator. The form of the character-set designator was chosen because the standards currently provide no single code to designate an 8-bit character set in its entirety. Each half of the character set has its own registration number. For example, ISO 8859-1 (Latin-1) is a single 8-bit character set, but registration number 100 only refers to its right half. Registration number 6 denotes ASCII, which is used as the left half of all ISO 8859 character sets. To promote maximum interoperability among extended Kermit programs, the Kermit designator should be treated as a character string, to be looked up in a small table, rather than as a flexible mechanism to be used for piecing together character sets from an arbitrary assortment of left and right halves. However, the Ixxx/yyy notation leaves open this possibility should it become desirable at a later time. In the event that a new class of registration numbers appears, for example, to denote a single-byte 8-bit character set in its entirety rather than just its left or right half, a different initial letter will be used in the designator, even if the registration authority is the ISO. In the event that other character-set registration authorities appear, they too can be assigned their own unique Kermit designator prefixes (for example, "K" for Kermit Development and Distribution), to avoid ambiguity from conflict of registration numbers. For the present, standards organizations like ANSI and CCITT are not treated as separate registration authorities, because their character sets are also registered by the ISO. Should these organizations adopt character sets that have no ISO counterpart, then special Kermit designator prefixes will be assigned for them. Based on the attribute information, the receiver may accept or reject the file, using Kermit's normal attribute response mechanism. To accept, it puts a "Y" as the first character of the data field of the acknowledgement to the attribute packet. To refuse, it puts an "N" instead of a "Y", followed by "*". If the file is refused in this manner, the sending Kermit should respond by sending a "Z" (end-of-file) packet containing a "D" (for Discard) in its data field. The behavior of the receiving Kermit program when an unknown character set is announced to it is governed by the command SET UNKNOWN-CHARACTER-SET. SET UNKNOWN-CHARACTER-SET KEEP means that it should not reject the file, but store it the best way it can (e.g. without translating any characters), DISCARD means that the file should be rejected. It is recognized that there are presently Kermit implementations in the USSR, Japan, and elsewhere that use character sets other than the ones listed in Table 2 in their transfer syntax, and/or sets that are not listed in the International Register. It is recommended that these Kermit programs be converted to use the recommended standard character sets, or if there is a strong reason why this cannot be done, that these character sets be registered with Kermit Development and Distribution at Columbia University. LEVEL 1 PERFORMANCE Level 1 can be used to transfer files containing special characters when character-set switching is not required. However, Level-1 transfer will not always be efficient. Since the special characters have their 8th bits set to one, there will be a lot of 8th-bit prefixing in the 7-bit environment -- the higher the proportion of special characters to ASCII characters, the lower the efficiency. For a language like Russian, in which all letters come from the right half of the character set, efficiency will be poor in the 7-bit environment. Therefore, even though Russian (and Greek, Hebrew, and Arabic) are served by Level 1 of this proposal, files encoded in these character sets can be transferred more efficiently using the facilities of Level 2 in the 7-bit communication environment. See Table 4. In the future, a separate proposal will address this problem in a general way, independent of transfer syntax, by specifying a locking shift mechanism to be used in Kermit's packet encoding. EXPANDED SYNTAX, LEVEL 2: MULTIPLE CHARACTER SETS DO NOT PAY MUCH ATTENTION TO THIS SECTION. IT WILL PROBABLY NEVER BE IMPLEMENTED IN ITS PRESENT FORM BECAUSE COMPUTER FILE SYSTEMS SIMPLY DO NOT EXIST THAT ALLOW MIXTURES OF CHARACTER SETS WITHIN A SINGLE FLAT FILE. IN ANY CASE, THE EMERGING ISO-10646 AND UNICODE STANDARDS MAY RENDER THIS DISCUSSION OBSOLETE ANYWAY. Suppose there is a computer that can store a file containing characters from many languages. It may do this by using a multibyte character code, or by imbedding some kind of control information in the file to mark each change of character set. One such computer is the Xerox Star and its successors, described by Joseph D. Becker in the Scientific American articles "Multilingual Word Processing" (July 1984) and "Arabic Word Processing" (July 1987). The Star stores textual data intermixed with special codes. A byte of all 1's means "alphabet shift", followed by another byte or two to identify the alphabet. Another, more limited, example is a computer using one of the AT&T Extended UNIX Codes (EUC), such as JAE for Japan. In this code, a byte with its high order bit set to zero is ASCII. If it is set to one, then it is either a 1-byte Kanji or 1-byte Katakana character, or (if it has a certain special value) a shift character indicating that the next two bytes are a Kanji character. (See N. Takahashi and W. Krone, "The Language Problem", UNIX Review, February 1987.) A third example is an IBM PC or PS/2 running a commercial word processor which uses the PC's graphics adapter to display characters from different alphabets (Roman, Greek, etc) in different renditions (bold, italic, underlined). A multilanguage word processor file may contain not only alphabet information, but also formatting and rendition information. The format of these word processor files is proprietary, and differs from product to product. A final example is a simple IBM PC "8-bit ASCII" text file which also contains the PC's line-drawing characters. These characters have no equivalents in the ISO Latin alphabets, and so at least two standard character sets would be required to represent these files during transmission. Now suppose we want to transfer a multi-language text file from one computer to a different kind of computer. Since there will be a growing need to do this, and a growing number of computers and applications that will support multi-language text in incompatible ways, it is clearly impractical to require each computer to know the formats and codes of each other computer. Once again, a standard common intermediate representation, or transfer syntax, is required so that each Kermit program need only know the codes and formats used on its own computer, plus the transfer syntax. But unlike Kermit's normal transfer syntax, and unlike Kermit's Level-1 extended transfer syntax, the multi-language syntax must embody an in-band mechanism for identifying character sets and switching among them. Fortunately, these mechanisms are already well-defined in the host-terminal communications environment, and they can be readily adapted to Kermit file transfer. The mechanisms proposed are defined in the following international standards: ISO 4873, "Information Processing - ISO 8-bit code for information interchange - Structure and rules for implementation" ISO 2022, "Information Processing - ISO 7-bit and 8-bit coded character sets - Code extension techniques" ISO 2375, "Data Processing - Procedure for registration of Escape Sequences" ISO "International Register of coded Character Sets to be Used with Escape Sequences" These standards are summarized in Appendix B, "How the Standards Work". The following discussion assumes familiarity with these standards, so consult the appendix if necessary. KERMIT MULTI-CHARACTER-SET FILE TRANSFER Level 2 Kermit syntax is intended for transferring multilanguage files that cannot be adequately represented in a single standard character set. The new "international" transfer syntax preannounces character sets by their ISO registration numbers, designates them by their registered escape sequences, and invokes them by single or locking shifts as defined in ISO Standard 2022. ENABLING LEVEL-2 TRANSFER SYNTAX Level-2 transfer syntax is selected when the user issues the command SET TRANSFER INTERNATIONAL. The user may return to Levels 0 or 1 using the SET TRANSFER CHARACTER-SET command, or SET TRANSFER NORMAL (See Appendix F). PROTOCOL PRIOR TO DATA TRANSFER It is strongly recommended that any Kermit program which is to use international syntax also support file attribute (A) packets. These are used for two purposes: (1) to inform the receiver that international syntax will be used and with which ISO-2022 facilities, and (2) to preannounce the file's character sets. This will give the receiver the opportunity to refuse files that it cannot translate, and to allocate the necessary resources for those files which it can accept. In Level 2, the value of the encoding attribute "*" should be the uppercase letter "I" (for International), optionally followed by one or more ISO-2022 announcer letters (the letter after ), as listed in Appendix B, for example "IBZ" to declare that G0 and G1 will be used with locking shifts, and G2 with single shifts. +---+---+-----+ | * | # | IBZ | +---+---+-----+ In addition, the sender may (but is not required to) preannounce the transfer syntax character sets by listing them in the new attribute, "2", "Character Sets". The value of the character-sets attribute is a comma-separated list of Kermit character set designators. For example: +---+---+----------------------+ | 2 | 4 | I2/100,I2/127,I2/144 | +---+---+----------------------+ where "2" is the character-set attribute, "4" is the length of the following value (in this case "4" = ASCII 52 - 32 = 20) and the next 20 bytes list ISO character sets numbers 100, 127, and 144, each number prefixed by "I" to denote the ISO registration authority, and "2/" to indicate that the left half of the character set is ASCII. If the sending Kermit can ascertain a file's character sets easily, it should send this information in the attribute packet. Otherwise preannouncement of character sets could require a time-consuming scan through the file prior to sending, which is undesirable for large files not only because it reduces Kermit's efficiency, but also because it could cause the entire Kermit session to time out. Therefore, preannouncement of character sets is not required. The receiver may accept or refuse a file using Kermit's normal attribute reply mechanism. When accepting the file, it should include, at minimum, the "*" attribute its acceptance, so that the sender will know that the receiver understands international syntax. When refusing a file, it should indicate what caused the problem: "*" means it can't do international transfer syntax, but "2" (without "*") means that one or more of the announced character sets are unknown. If the file is refused in this manner, the sending Kermit can issue an informative message to the user, and the user can find some other way to transfer the file (for example, binary mode, or normal text mode with pre- and postprocessing, or even by loading a new translation table). DATA TRANSFER PROTOCOL Transfer of a multi-character-set text file in international transfer syntax by Kermit is similar to transfer of a 7-bit ASCII text file, except that it may contain embedded control characters and escape sequences to identify and switch between character sets. The file sender translates the file's characters (if necessary) into one or more registered alphabets, imbedding character-set designation and shifting codes in the data stream, and terminates lines of text (records) with CRLF as in ASCII text mode. The file receiver translates from international transfer syntax into the format demanded by the local system or application. All of this occurs before Kermit packet encoding by the sender, and after Kermit packet decoding by the receiver. ISO 2022 states that "at the beginning of information interchange, except where the interchanging parties have agreed otherwise, all designations shall be defined by use of the appropriate escape sequences, and the shift status shall be defined by the use of the appropriate locking-shift functions." Kermit programs should "agree otherwise" that the default G0 character set is the US-ASCII/ISO-646-IRV (International Reference Version) 7-bit character set; thus international transfer syntax can be identical to Normal Kermit transfer syntax when transferring 7-bit text files. There are no defaults for G1, G2, or G3, in the interest of fairness to all countries and peoples. When the text contains characters outside the ASCII range, an escape sequence from Table 5 must be issued, designating the alphabet to which they belong (using the identification letters shown in Table 5) to the desired intermediate character set G0, G1, G2, or G3. This sequence must be given before the first occurrence of a character in that alphabet. If no such sequence is given, then the receiver treats all characters as ASCII data, including , the shift characters, and bytes with their 8th bits set to one. In other words, the file transfer behaves in the normal Kermit fashion for text files. Since ISO 8859 character sets are subject to revision from time to time, an alphabet selector may be preceded by &F, where F is the revision number (@ = 1, A = 2, B = 3, etc). For example, &@-A means Latin Alphabet Number One, Revision One. (This information is from ISO 2022 6.3.13.) ISO 2022 escape sequences are inserted into the data, and are indistinguishable by the Kermit packet encoder/decoder from the data itself. Therefore these escape sequences may be broken across packets, just as any other data may be. UNKNOWN ALPHABETS It is not required that the sender preannounce all of a file's character sets prior to transfer. Suppose a file contains a mixture of alphabets, some known to the receiver, others not. At some point, an alphabet designator arrives which the receiving Kermit does not recognize. Should the receiving Kermit cancel the file transfer, or accept the unknown code? A new command is provided to let the user control what happens in this situation: SET UNKNOWN-ALPHABET {KEEP, CANCEL}. If the user elects CANCEL, then the receiver will behave as if the user had manually cancelled the file, i.e. it will put the character "X" in the data field of its next acknowledgement, and the sender (assuming it supports this feature) will stop sending the file. If the user elects KEEP, the file will be accepted in its entirety. But the unknown code should be marked in case the user wants to fix it afterwards. To do this, receiving program accepts the designator for the unknown alphabet and stores it in the file as data, with subsequent characters stored untranslated. When the unknown character set is shifted out of (or the end of file arrives), the receiving Kermit stores the ISO-2022 Coding Method Delimiter, d, and resumes translation. If the unknown alphabet is shifted back into, the designating escape sequence is stored again, and the process resumes. A list of the designators of the unknown alphabets should be recorded in the transaction log (if there is one), for later reference. The default behavior should be "KEEP". This command should also be effective at Level 1, where it would simply prevent the receiving Kermit from refusing a file on the basis of the character set used to transfer it. LOCAL FILE REPRESENTATION This proposal assumes nothing about the representation of the file on the local storage medium. It may be ASCII, EBCDIC, a proprietary word processor format, IBM code page, or anything else. It is an implementation "detail" for Kermit programmer to convert between the local file representation for multi-alphabet text files, and Kermit's file transfer syntax. In some cases, the file itself (or its directory entry) might contain the necessary identifying information, in which case the sending Kermit program can automatically emit the appropriate escape sequences during file transfer. In others, the user will have to tell the sending program how the file is encoded. The suggested command is: SET FILE TYPE where specifies how the file is (or when receiving, is to be) encoded on disk. This will necessarily be highly dependent on the system's conventions, or the conventions of the applications to be used with the file (e.g. a multi-language word processing program). Possibilities for might include application names like WORDPERFECT, XYWRITE, NOTA-BENE, MACWRITE, ALEPH-BET, PC-HANGUL. BREAKING THE RULES If the local file is not encoded according to ISO 2022 rules, it may contain , , and characters. It is up to the Kermit program to know what these characters mean in the context of the file's format, and to either strip them from the file or translate them to something else. The ISO 2022 rules forbid the use of these characters as data to be transferred. If a file is to be transferred using international syntax, and it contains any of the control characters significant to this syntax, namely , , , , or , then such characters should be prefixed during transmission with Datalink Escape, , C0 character 01/00 (Control-P). Furthermore, if itself occurs in the data, it should also be prefixed with . All shifting and escape characters are subject to normal Kermit encoding rules. Therefore, if a file contains an character, it must be sent as

, normally "#P#N". If it contains an , and it is to be transmitted in the 7-bit environment, the encoding will be:

<8th-bit-prefix> (normally "#P&#N") -- That is, five characters to transmit one! Also note that a file containing data that happens to correspond to a character-set designator (e.g. "-X") could confound later efforts at reconstruction when SET UNKNOWN-ALPHABET KEEP is in effect. In this circumstance, this character sequence can be distinguished from genuine alphabet designators that were inserted into the file by the SET UNKNOWN-ALPHABET KEEP feature in one of two ways: (1) by lacking a matching d (coding method delimiter), or (2) by not being listed in the Kermit transaction log. LEVEL-2 PERFORMANCE Kermit programs may use the full range of ISO 2022 code extension techniques, including use of G0, G1, G2, and G3 in both the 7-bit and 8-bit environments, with both single-byte and multibyte character sets. In the general case, G0 will be used for ASCII and English, G1 for the special characters of the "native language" of the local country or region, G2 for a third language, and G3 for a fourth. Additional character sets may be swapped in and out of G2 and G3 as required. Transmission of 8-bit data in the 7-bit environment is accomplished by Kermit using 8th-bit prefixing, which is an optional feature of the Kermit protocol. However, most popular implementations of Kermit do include this feature. If a Kermit program cannot do 8th-bit prefixing, then it must operate in the ISO 2022 7-bit environment, shifting GL among the intermediate graphics sets G0-G3. If the Kermit program can do 8th-bit prefixing, the choice of the ISO 2022 7-bit or 8-bit environment is entirely independent of the communication channel. Selection of the ISO 2022 7-bit or 8-bit environment should be made on other grounds, such as transmission efficiency or program simplicity. For example, if the ISO 2022 8-bit environment is used on a 7-bit channel, then Kermit will have to do 8th-bit prefixing. On a 7-bit communication channel, the best choice of ISO 7-bit or 8-bit environment depends on the nature of the data to be transferred. If there is little or no 8-bit data (as in English text), it doesn't matter. If there is frequent shifting between 7-bit and 8-bit characters (as in French or Portuguese), then single shifts would tend to be more efficient than locking shifts, and Kermit's 8th-bit prefixing is equivalent to a single shift. Therefore, use the ISO 8-bit environment and let Kermit do the prefixing. If there are long strings of 8-bit characters, as in "right-sided" languages like Russian, Greek, Arabic, and Hebrew, then locking shifts are more efficient -- use the ISO 7-bit environment. In Japan, many computer systems use at least three character sets, Roman (close to ASCII), Katakana (a 1-byte code), and Kanji (a 2-byte code). Kanji is specified in JIS X 0208, which also includes Roman, Hiragana, Katakana, and some other character sets, but these are double width and not normally used. Roman characters are usually taken from the left half of JIS X 0201, and Katakana from the right half. Japanese text frequently shifts between Roman, Kana, and Kanji, and therefore requires three active character sets, for example G0 (Roman), G1 (Kana), and G2 or G3 (Kanji). In the 8-bit environment, data transfer can be quite efficient: locking shifts are used to shift GL between Roman and Kana, and any bytes with the 8th bit set to one automatically invoke Kanji in GR as a multi-byte character set. In the 7-bit environment, locking shifts would also be used to select Kanji. Note that locking shifts are more efficient in this case than Kermit 8th-bit prefixing because Kanji characters consist of more than one byte, and tend to occur in runs. For Japanese, therefore, it is better to use the ISO 7-bit environment on a 7-bit communication channel. The situation is summarized in Table 4. _____________________________________________________________________________ ISO 2022 Environment 7-bit 8-bit +------------------------------+-----------------------------+ | Recommended for right- | Recommended for 2-sided | 7-bit | sided languages like Greek, | languages like French, | data | Russian, Arabic, Hebrew. | German, etc. Use Kermit's | path | Use ISO 2022 locking shifts. | 8th-bit prefix for special | | Also for Japanese. | characters. | +------------------------------+-----------------------------+ | No reason to use ISO 7-bit | Clear transmission of 8-bit | 8-bit | environment on a clear 8-bit | characters. Use for both | data | communication channel. | left- and right-sided | path | OK for 7-bit ASCII, though. | languages, and Japanese. | | | | +------------------------------+-----------------------------+ Table 4: Selecting ISO 7- vs 8-Bit Environment _____________________________________________________________________________ The user should have control over whether the ISO-2022 7-bit or 8-bit environment is used. To allow this, the command SET TRANSFER INTERNATIONAL may be extended as follows: SET TRANSFER INTERNATIONAL [ {7, 8} ] which means that an optional final field may be included to specify the 7- or 8-bit ISO-2022 environment. The default should be 8, since it is the most efficient method in most cases. If Kermit -- at all levels -- offered locking shifts in addition to single shifts, then international syntax could always proceed in the 8-bit environment, and this would simplify implementation considerably. A proposal on locking shifts for Kermit is forthcoming. FILE TRANSFER SYNTAX EXAMPLES A simple 7-bit ASCII text file can be transmitted in the normal Kermit manner for text files, without any escapes or shifts, even in international syntax. A text file containing characters from a language or languages covered by a single alphabet other than ASCII can be transferred exactly like an ASCII text file, except that the attribute, if used, would denote the character set, e.g. "*!C2&I2/100" for Latin-1. In the 7-bit environment, international syntax can be used to cut down on Kermit's 8th-bit prefixing overhead, in which case the attributes might look like "*#IBJ2&2/144", and any strings of GR characters would be preceded by LS1 and transmitted with their high-order bits set to zero. A multi-character-set text file will require an escape sequence to identify each alphabet. The attribute packet would show international encoding, optionally including the ISO 2022 facilities announcers, and the character sets, as in "*#ICK2-I2/100,I2/144". In the 7-bit environment, and are used to shift between the G0 and G1 sets. In the absence of any specific designators, the G0 set is presumed to be ASCII. Example: A dangerous German word is "gef-Adhrlich". In this case, the only extended character is the umlaut-a in "gefaehrlich" (where ae is how to write umlaut-a without an umlaut). -A designates Latin-1 into G1, shifts GL out to G1, "d" is the left-half equivalent of umlaut-a, and shifts GL back in to G0. For clarity and consistency with the ISO-2022 recommendations, it is recommended that the text begin with explicit character set designations, and then explicitly shift into the G0 set, rather than defaulting to it: (B-AA dangerous German word is "gefdhrlich". A text file containing characters from multiple ISO 8859 alphabets requires a designation sequence for each alphabet. In the 7-bit environment, SO and SI can be used to shift between G0 and G1 of the current alphabet, and (B can be used to select G0 of any of the alphabets, since these are all the same. For example, the following text contains the same word in English, French, and Russian: -ADisappointed, digu, -L`PW^gP`^RP]]kY. The first escape sequence assigns Latin Alphabet No. 1 to G1, and the subsequent and shifts apply to its G0 and G1 set, which is used to form the English and French words. The second escape sequence assigns the Latin/Cyrillic 96-character set to G1, and the subsequent shifts apply to this new set. Another 7-bit example, in which the same word is repeated in English, Russian, and German, shows how a locking shift remains in effect when the alphabet is changed. We begin in Latin/Cyrillic, start with an English word from G0, shift to G1 for the Russian word, and while still in G1 switch to Latin Alphabet No. 1 for German to get the umlaut-A at the beginning of Aenderung (where Ae = umlaut-uppercase-A), and shift back to G0 for the rest of the word: -LAlteration _U`UTU[ZP -ADnderung. Some rules and hints to remember: 1. In the 8-bit communication environment, always use 8-bit character transmission -- it's more efficient. 2. There can be no more than four character sets designated at one time. Generally designate ASCII to G0, the most frequently-used non-ASCII set to G2, less frequently used sets to G3 and G1. If a file has more than four sets, swap the least frequently used sets in and out of G3 and G1. 3. Single shifts can only be used with G2 and G3. This is why G2 and G3 are preferred to G1. 4. Only two character sets can be invoked at once in the 8-bit communication environment, and only one in the 7-bit environment. SPECIAL EFFECTS Today, most multi-alphabet files are produced by proprietary text processing programs. These programs have many functions besides switching among alphabets. They may also endow text with special attributes such as boldface, italic, underline, super- or subscript, color, etc, and render characters in a variety of type styles and sizes. Each text processing program may have its own unique formats and conventions. These special effects are not addressed by this proposal. Nevertheless, it is likely that a multi-alphabet file produced by a text processing program also contains special effects. In order for a Kermit program to send a multi-alphabet file, it must have detailed knowledge of the file's format and coding conventions. Therefore, the Kermit program should be able to strip out the special effects, and send only the text. Otherwise the result would be meaningless when received on an unlike system or for use with a different application. (When transferring such files between like systems or compatible applications, Kermit binary mode transfers will suffice.) At some future time, it might be possible to adapt one of the popular document description languages to the Kermit protocol, so that Kermit will be able to transfer formatted documents between unlike systems and applications. Presently, there are many competing would-be standards including IBM DCA and DIA, DEC DDIF, US Navy DIF, Postscript. There are also two ISO standards emerging in this area: Standard Generalized Markup Language (ISO 8879, 9069, and 9573), and Office Document Architecture (ISO 8613). This is an area for further study. TERMINAL EMULATION While not part of the Kermit file transfer protocol, terminal emulation is a feature of many Kermit programs. It is hoped that these terminal emulators will evolve along the lines of the ISO standards mentioned above. In some cases, this is already a fact, insofar as DEC VT300 series terminals already follow these standards and Kermit programs are beginning to emulate these terminals. Kermit should be as easy to use as possible, but should still give the user the ability to specify exactly what character codes are in use for both terminal emulation and file transfer. There should also be a consistent set of commands for all Kermit programs. APPENDIX A: STANDARDS ANSI X3.4 (1986), "Coded Character Sets - 7-bit American Standard Code for Information Interchange" (US ASCII), is the 7-bit code currently used by Kermit for transferring text files. ISO 646 (1983) (= ECMA-6), "Information Processing - ISO 7-bit Coded Character Sets for Information Interchange", gives us a 7-bit character set equivalent to ASCII with provision for substituting "national characters" in selected positions. ISO 4873 (1986) (= ECMA-43), "Information Processing - ISO 8-bit Code for Information Interchange - Structure and Rules for Implementation", defines 8-bit character sets, their graphic and control regions, and how to extend an 8-bit character set by using multiple intermediate graphics sets. ISO 2022 (1986) (= ECMA-35), "Information Processing - ISO 7-bit and 8-bit Coded Character Sets - Code Extension Techniques", describes how to use 8-bit character sets in both 7-bit and 8-bit environments, and how to switch among different character sets and alphabets. ISO International Register of Coded Character Sets to be Used with Escape Sequences. This is the source of the ISO registration numbers. ISO 2375 (1985) "Data Processing - Procedure for Registration of Escape Sequences". The procedure by which a character set gets into the above register and has a registration number and designating escape sequence assigned to it. JIS X 0202, "Code Extension Techniques for Use the Code for Information Interchange", the Japanese counterpart of ISO 2022. ANSI X3.41-1974, "Code Extension Techniques for Use with the 7-Bit Coded Character Set of the American National Standard Code for Information Interchange", describes 7- and 8-bit codes and extension techniques in approximately the same manner as ISO 4873 and ISO 2022. ISO 8859 (1987-present) (see Table 6 for ECMA equivalents), "Information Processing - 8-Bit Single-Byte Coded Graphic Character Sets", defines the actual 8-bit character sets to be used for many of the world's languages. The left half of each of these is the same as ASCII and ISO 646 IRV. Each character, including those with diacritics, is represented by a single byte. ISO is the International Standardization Organization, ANSI is the American National Standards Institute, ECMA is the European Computer Manufacturers Association. JIS means Japan Industrial Standard. The ISO/ECMA standards discussed in this proposal may be obtained free of charge in their ECMA form by writing to: ECMA Headquarters Rue du Rhone 114 CH-1204 Geneva SWITZERLAND Be sure to specify the title and the ECMA number of each standard requested. In general, the ISO member body from each country acts as the local sales agent for ISO Standards in that country, for example ANSI in the USA: Sales Department American National Standards Institute 1430 Broadway New York, NY 10018 Telephone 212-354-3300 Each such organization has its own arrangements for disseminating printed documents. ANSI sells them for US dollars; organizations in other countries may either sell them for local currency or give them away, depending on how they are funded to operate. ISO standards and CCITT recommendations can also be ordered from the UN bookstore, but not free of charge: United Nations Bookstore United Nations Building New York, NY 10017 CCITT recommendations are also available from ANSI. APPENDIX B: HOW THE STANDARDS WORK ASCII and ISO 646 give us a 128-character 7-bit character set. This set is divided into two parts: 1. 33 "control characters" (characters 0 through 31, and character 127). 2. 95 "graphic characters" (32-126). "Graphics" means printing characters -- characters that make ink appear on the page or phosphor glow on the screen (as opposed to pixel- or line-oriented picture graphics), plus the space character. The ASCII / ISO-646 IRV character set is shown in Figure 1, arranged in a table of 16 rows and 8 colums. _____________________________________________________________________________ 00 01 02 03 04 05 06 07 +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | 01 |SOH DC1| ! 1 A Q a q | 02 |STX DC2| " 2 B R b r | 03 |ETX DC3| # 3 C S c s | 04 |EOT DC4| $ 4 D T d t | 05 |ENQ NAK| % 5 E U e u | 06 |ACK SYN| & 6 F V f v | 07 |BEL ETB| ' 7 G W g w | 08 |BS CAN| ( 8 H X h x | 09 |HT EM | ) 9 I Y i y | 10 |LF SUB| * : J Z j z | 11 |VT ESC| + ; K [ k { | 12 |LF FS | , < L \ l | | 13 |CR GS | - = M ] m } | 14 |SO RS | . > N ^ n ~ | 15 |SI US | / ? O _ o DEL| +---+---+---+---+---+---+---+---+ Figure 1: The ASCII / ISO-646 International Reference Version 7-bit Character Set _____________________________________________________________________________ Characters are often referred to by their column and row position in this type of table. For example, character 05/08 in Figure 1 is "X". Columns 00-01, plus character 07/15, comprise the control set. Columns 02-07, minus character 07/15, comprise the graphics. 8-bit character sets are described in ISO 4873. An 8-bit character set has two sides. Each side has a control set and a graphics set. The "left half" consists of the control set C0 and the graphics set GL (Graphics Left). GL has 94 characters, and corresponds to ASCII (and ISO 646 IRV) positions 02/01-07/14. SP (space) and DEL are not considered part of GL. All the characters in the left half have their high-order, or 8th, bit set to zero, and are therefore representable in 7 bits. The "right half" consists of the control set C1 and the graphics set GR (Graphics Right). All characters in the right half have their 8th bits set to one. Figure 2 shows the layout of an 8-bit character set. _____________________________________________________________________________ <--C0--> <---------GL----------> <--C1--> <---------GR----------> 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ 00 |NUL DLE| SP 0 @ P ` p | | DCS|---+ | 01 |SOH DC1| ! 1 A Q a q | | PU1| | 02 |STX DC2| " 2 B R b r | | PU2| | 03 |ETX DC3| # 3 C S c s | | STS| | 04 |EOT DC4| $ 4 D T d t | |IND CCH| | 05 |ENQ NAK| % 5 E U e u | |NEL MW | | 06 |ACK SYN| & 6 F V f v | |SSA SPA| | 07 |BEL ETB| ' 7 G W g w | |ESA EPA| | 08 |BS CAN| ( 8 H X h x | |HTS | (special | 09 |HT EM | ) 9 I Y i y | |HTJ | graphics) | 10 |LF SUB| * : J Z j z | |VTS | | 11 |VT ESC| + ; K [ k { | |PLD CSI| | 12 |LF FS | , < L \ l | | |PLU ST | | 13 |CR GS | - = M ] m } | |RI OSC| | 14 |SO RS | . > N ^ n ~ | |SS2 PM | | 15 |SI US | / ? O _ o DEL| |SS3 APC| +---| +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ <--C0--> <---------GL----------> <--C1--> <---------GR----------> Figure 2: An 8-Bit Character Set _____________________________________________________________________________ GR character sets can have either 94 or 96 characters. A 94-character GR set begins in position 10/01 and ends in position 15/14, with Space (SP) occupying position 10/00 and DEL in position 15/15, just like GL (the corners shown in GR in the diagram). A 96-character set has graphic characters in all 96 positions, 10/00 through 15/15. An 8-bit alphabet, therefore, has up to 94 + 96 = 190 graphic characters. This number is sufficient to represent the characters in many of the world's written languages, but not necessarily sufficient to represent all the graphic symbols required in a given application, for instance a multi-language document. To represent a greater number of graphic characters, ISO 4873 defines four "intermediate sets" of graphic characters, of either 94 or 96 characters each. These are called G0, G1, G2, and G3. The G0 set never has more than 94 graphic characters, and G1-G3 can have up to 96 each. Therefore there can be up to: 94 + (3 x 96) = 382 graphics characters simultaneously within the repertoire of a given device. These intermediate graphics sets are kept in tables in the memory of the terminal or computer. One of the intermediate sets (usually G0) is assigned to GL, and (in the 8-bit communications environment) another may be assigned to GR. When the terminal or computer receives a data byte, the numeric value of its bits denotes the position of the character in GL or GR. For example, the byte 01000001 binary = 65 decimal = 04/01 = uppercase A in ASCII. In the 8-bit environment, any byte with its 8th bit set to zero is from GL, and a byte with its 8th bit set to one is from GR. A language like English can be represented adequately by ASCII in GL, because all the required characters fit there. When a language has more than 94 characters, two techniques are used to represent all the characters: 1. For alphabetic languages, put ASCII (or the ISO-646 IRV) in GL and the special characters (like accented letters) in GR. French, German, and Russian are examples. 2. For languages with many symbols (e.g. where a symbol is assigned to each word, rather than to each sound), represent each character with multiple bytes rather than one byte. Japanese Kanji, for example, uses a 2-byte code. A multibyte code may be assigned to G0, G1, G2, or G3, just like a single-byte code. How do we assign actual character sets to G0-G3, and how do we associate the intermediate character sets with the active character set? Selection of character sets is accomplished using special control characters and escape sequences embedded within the data stream as described in ISO Standard 2022. An ESCAPE SEQUENCE is used to DESIGNATE a particular alphabet (such as Roman, Cyrillic, Hebrew, Arabic, Kanji, etc) to a particular intermediate graphics set (G0, G1, G2, or G3). A SHIFT FUNCTION is used to INVOKE a particular intermediate graphics set into GL or GR. In programmer's terms, GL and GR are pointers into the array of tables G0..G3, and the shift functions simply change the values of these pointers. In our discussion, we use the following notation (numbers are decimal unless otherwise noted): Escape (ASCII 27, character 01/11) Space (ASCII 32, character 02/00) Shift Out (Ctrl-N, ASCII 14, character 00/14) Shift In (Ctrl-O, ASCII 15, character 00/15) Table 5 shows the alphabet designatation functions for single-byte and multi-byte character sets in both the 7-bit and 8-bit environments. The character which is substituted for "F" identifies the actual character set to be used. _____________________________________________________________________________ Escape Sequence Function Invoked By (F assigns 94-character graphics set "F" to G0. SI or LS0 )F assigns 94-character graphics set "F" to G1. SO or LS1 *F assigns 94-character graphics set "F" to G2. SS2 or LS2 +F assigns 94-character graphics set "F" to G3. SS3 or LS3 -F assigns 96-character graphics set "F" to G1. SO or LS1 .F assigns 96-character graphics set "F" to G2. SS2 or LS2 /F assigns 96-character graphics set "F" to G3. SS3 or LS3 $(F assigns multibyte character set "F" to G0. SI or LS0 $)F assigns multibyte character set "F" to G1. SO or LS1 $*F assigns multibyte character set "F" to G2. SS2 or LS2 $+F assigns multibyte character set "F" to G3. SS3 or LS3 Table 5: Escape Sequences for Alphabet Designation _____________________________________________________________________________ Table 6 shows the escape sequences used to designate the appropriate parts of each of the registered character sets discussed in this proposal to G1 (except that ASCII is designated to G0, which is the normal situation). It is important to note that the final letter of the escape sequence is not always sufficient to designate a character set. For example, Czech Standard and JIS Katakana are both designated by letter I. But the two can be distinguished by the intermediate characters of the escape sequence, which specify whether the set is single- or multibyte, or, when both sets are single-byte, whether there are 94 or 96 characters. _____________________________________________________________________________ Escape ISO ECMA ISO/ECMA Alphabet Name Sequence Reference Reference Registration ASCII (ANSI X3.4-1986) (B ISO 646 IRV ECMA-6 2 Latin Alphabet No. 1 -A ISO 8859-1 ECMA-94 100 Latin Alphabet No. 2 -B ISO 8859-2 ECMA-94 101 Latin Alphabet No. 3 -C ISO 8859-3 ECMA-94 109 Latin Alphabet No. 4 -D ISO 8859-4 ECMA-94 110 Latin/Cyrillic -L ISO 8859-5 ECMA-113 144 Latin/Arabic -G ISO 8859-6 ECMA-114 127 Latin/Greek -F ISO 8859-7 ECMA-118 126 Latin/Hebrew -H ISO 8859-8 ECMA-121 138 Latin Alphabet No. 5 -M ISO 8859-9 ECMA-128 148 Czech Standard CSN 369 03 -I none none 139 * Math/Technical Set -K ???? ???? 143 Chinese (CAS GB 2312-80) $)A none none 58 Japanese (JIS X 0208) $)B none none 87 JIS-Katakana (JIS X 0201) )I none none 13 JIS-Roman (JIS X 0201) )J none none 14 Korean (KS C 5601-1987) $)C none none 149 Table 6: Alphabets, Selectors, Standards, and Registration Numbers _____________________________________________________________________________ * A math/technical set is clearly needed to handle the IBM PC, DEC VT-series, and other math/technical/line-drawing characters, but there is apparently no such standard set at this time. Tables 7 and 8 show the shift functions that are used to invoke the intermediate character sets. These shift functions may be either locking or single. "Locking shift" is like shift-lock on a typewriter. It means that all subsequent characters until the next shift are to be taken from the designated intermediate character set. "Single shift" applies only to the character (either single or multibyte) that follows it immediately, but single shift functions are only available for the G2 and G3 sets. Locking shift functions remain in effect across alphabet changes. In the 7-bit environment, only one character set, GL, can be active at a time. The active character set can be selected from among the intermediate sets G0-G3 by the shifts shown in Table 6. Control characters from C0 are transmitted as-is, and those from the C1 set are sent prefixed by followed by the character value, minus 64. For example, the C1 character 10000001 binary (129 decimal) becomes A (129 - 64 = 65 = "A"). _____________________________________________________________________________ Shift Representation Name Function SI Ctrl-O Shift In invoke G0 into GL SO Ctrl-N Shift Out invoke G1 into GL LS2 n Locking Shift 2 invoke G2 into GL LS3 o Locking Shift 3 invoke G3 into GL SS2 N Single Shift 2 select single character from G2 SS3 O Single Shift 3 select single character from G3 Table 7: Shifts Used in the 7-Bit Environment _____________________________________________________________________________ In the 8-bit environment two character sets, GL and GR, can be active at once. A GL character is selected by a byte whose 8th bit is zero, and a GR character by a byte whose eighth bit is one. The actual character sets assigned to GL and GR are selected by the shifts shown in Table 8. Control characters from both the C0 and C1 sets are sent as is. _____________________________________________________________________________ Shift Representation Name Function LS0 Ctrl-O Locking Shift 0 invoke G0 into GL LS1 Ctrl-N Locking Shift 1 invoke G1 into GL LS2 n Locking Shift 2 invoke G2 into GL LS3 o Locking Shift 3 invoke G3 into GL LS1R ~ Locking Shift 1 Right invoke G1 into GR LS2R } Locking Shift 2 Right invoke G2 into GR LS3R | Locking Shift 3 Right invoke G3 into GR SS2 08/14 Single Shift 2 select single character from G2 SS3 08/15 Single Shift 3 select single character from G3 Table 8: Shifts Used in the 8-Bit Environment _____________________________________________________________________________ So we have a 3-tiered system. At the bottom tier lie all the world's coded character sets. We can designate up to four of them, one to each of the intermediate graphics sets G0, G1, G2, and G3 using the escape sequences shown in Tables 5 and 6. The terminal or computer keeps each of the selected intermediate sets in memory. There is also one active set, composed of GL and GR. The intermediate sets are invoked to GL or GR (one at a time) by the shifts SO, SI, LS0, LS1, etc, shown in Tables 7 and 8. A simplified diagram for the 8-bit environment is shown in Figure 3 (see ISO 2022 for detailed diagrams of both the 7-bit and 8-bit environments). On a more sophisticated output device, Figure 3 would contain numerous arrows pointing upwards to demonstrate the operation of the designators and shifts. _____________________________________________________________________________ +--+--------+ +--+--------+ |C0| GL | |C1| GR | | | | | | | 8-Bit | | | | | | Code | | | | | | In Use +--+--------+ +--+--------+ LS0 LS1,LS1R LS2,LS2R LS3,LS3R Shifts SS2 SS3 +--------+ +--------+ +--------+ +--------+ Intermediate | | | | | | | | Graphics | G0 | | G1 | | G2 | | G3 | Sets | | | | | | | | +--------+ +--------+ +--------+ +--------+ Alphabet Designation (B -A -B -L $)B Sequences +---------+ +--------+ +--------+ +--------+ +--------+ +--------+ | The world's | ISO | | ISO | | ISO | | ISO | | JIS X | | registered | 646IRV | | Latin | | Latin | | Latin | | 0208 | | character |(ASCII) | | 1 | | 2 | |Cyrillic| | Kanji | + sets +--------+ +--------+ +--------+ +--------+ +--------+ Figure 3: The ISO 2022 Character Set Selection Mechanisms _____________________________________________________________________________ For example, the following sequence would be used to transmit the German word "bernchtig" using Latin Alphabet 1 in the 7-bit environment: (B-A|berndchtig where: (B designates ASCII to G0 -A designates the right half of Latin Alphabet 1 to G1 invokes G1 to GL | is character 07/12, but since G1 is invoked to GL, it really denotes character 15/12, which is invokes G0 to GL bern are characters from G0, which is invoked in GL invokes G1 to GL d is character 06/04, but since G1 is invoked to GL, it really denotes character 14/04, which is invokes G0 to GL chtig are characters from G0 The same word could be transmitted in the 7-bit environment using single shifts, if Latin Alphabet 1 were designated to G2 (or G3): (B*AN|bernNdchtig (where *A designates Latin-1 to G2, and N is Single Shift 2). In the 8-bit environment it could be transmitted using no shifts at all: (B-Abernchtig The designation escape sequences are transmitted only at the beginning of a session and need not be repeated after the initial designations are made, unless an intermediate set (G0-G3) is to be recycled. To understand the three-tiered design of ISO 2022, imagine a computer programmed to display a mixture of character sets on its screen. A large collection of fonts might be stored on the disk, one font per file. These are the character sets of the bottom tier. When a font is needed, it will be read from the disk and stored in memory in an array, for rapid access. If several fonts are needed, they will be stored in several arrays. These arrays are the intermediate character sets, G0-G3. When a data byte arrives to be displayed, the actual graphic representation is taken from GL or GR (depending on the byte's 8th bit). GL is associated with one of the intermediate graphic sets, and GR with another. If no more than four character sets are used, then each one needs to be read from the disk only once, and display is rapid and efficient thereafter. ANNOUNCING ISO 2022 FACILITIES A large portion of ISO 2022 is devoted to describing how 8-bit characters may be transmitted on a 7-bit communication path, for example when parity is in use. In the 7-bit environment, there is only GL -- no GR. Therefore, all characters are transmitted with their 8th bit removed, and shifts are used to specify which intermediate set they belong to. In fact, there are many possible ways to use the ISO 2022 code extension facilities within both 7-bit and 8-bit environments. For example, the sender may inform the receiver in advance whether G1, G2, or G3 will be used, etc, so that the receiver can allocate the appropriate resources. At the beginning of any particular data transfer, the facilities that actually will be used can be announced with a sequence of the form F, where F is replaced by an ISO 2022 announcer. Several of the most important ones are described here. Table 9 lists all the defined announcers in summary form. For details, see ISO 2022. A means that only the G0 set will be used, invoked into GL. No shift functions will be used. In the 8-bit environment, GR is not used. In other words, only a single 7-bit character set is used. B means the G0 and G1 sets will be used with locking shifts. In the 7-bit environment invokes G0 into GL, invokes G1 into GL. In the 8-bit environment, LS0 invokes G0 into GL, LS1 invokes G1 into GL. In other words, two character sets are used, with characters from both sets always sent as 7-bit values, with locking shifts used to specify the 8th bit. C means that G0 and G1 will be used in the 8-bit environment, with G0 invoked in GL and G1 in GR. No locking shift functions are used. In other words, a single 8-bit character set is used, with all 8 bits transmitted as data. GL is selected when the character's 8th bit is zero, GR is selected when the 8th bit is one. D means that G0 and G1 will be used with locking shifts. In the 7-bit environment, invokes G0 into GL and invokes G1 into GL. In the 8-bit environment, all 8 bits of each character are transmitted with no shifts. L means that Level 1 of ISO 4873 will be used. That is, a single 8-bit character set with C0, G0, C1, and G1, with no shift functions. This is like C. M means that Level 2 of ISO 4873 will be used. This is equivalent to Level 1, with the addition of G2 and G3. Characters from G2 and G3 are invoked only by the single-shift functions SS2 and SS3. N means that Level 3 of ISO 4873 will be used. This is equivalent to Level 2 with the addition of the locking shift functions LS1R, LS2R, and LS3R. (Note that ISO 4873 does not concern itself with the 7-bit environment, and therefore does not discuss the use of LS0, LS1, LS2, or LS3.) _____________________________________________________________________________ Esc Sequence 7-Bit Environment 8-Bit Environment A G0->GL G0->GL B G0-(SI)->GL, G1-(SO)->GL G0-(LS0)->GL, G1-(LS1)->GL C (not used) G0->GL, G1->GR D G0-(SI)->GL, G1-(SO)->GL G0->GL, G1->GR E Full preservation of shift functions in 7 & 8 bit environments F C1 represented as F C1 represented as F G C1 represented as F C1 represented as 8-bit quantity H All graphic character sets have 94 characters I All graphic character sets have 94 or 96 characters J In a 7 or 8 bit environment, a 7 bit code is used K In an 8 bit environment, an 8 bit code is used L Level 1 of ISO 4873 is used M Level 2 of ISO 4873 is used N Level 3 of ISO 4873 is used P G0 is used in addition to any other sets: G0 -(SI)-> GL G0 -(LS0)-> GL R G1 is used in addition to any other sets: G1 -(SO)-> GL G1 -(LS1)-> GL S G1 is used in addition to any other sets: G1 -(SO)-> GL G1 -(LS1R)-> GR T G2 is used in addition to any other sets: G2 -(LS2)-> GL G2 -(LS2)-> GL U G2 is used in addition to any other sets: G2 -(LS2)-> GL G2 -(LS2R)-> GR V G3 is used in addition to any other sets: G3 -(LS2)-> GL G3 -(LS3)-> GL W G3 is used in addition to any other sets: G3 -(LS2)-> GL G3 -(LS3R)-> GR Z G2 is used in addition to any other sets: SS2 invokes a single character from G2 [ G3 is used in addition to any other sets: SS3 invokes a single character from G3 Table 9: ISO 2022 Announcer Summary _____________________________________________________________________________ APPENDIX C: PRELIMINARY DESIGN FOR LOADABLE TRANSLATION TABLES Note the word "PRELIMINARY". This design will be refined as attempts are made to program it. The translation table is specified in a file written entirely in printable ASCII, with line divisions as shown. Numbers are represented as ASCII decimal digits. Line Contents 1. Name of this table 2. The word "COMMON" or "LOCAL" 3. Name of SOURCE character set (translating FROM) 4. Number of bytes per character of source set (1, 2, 3, 1-2, etc) 5. Number of characters per plane of source set (94, 96, 128) 6. Name of TARGET character set (translating TO) 7. Number of bytes per character of target set (1, 2, 3, 1-2, etc) 8. Number of characters per plane of target set (94, 96, 128) 9. Designating sequence for COMMON character set. 10. Version number of common character set (blank if none) 11. Registration number of common character set (e.g. I2/100, blank if none) 12. Direction of writing (Left-to-right, Right-to-left, Upwards, etc) 13. Number of entries in the translation table. 14. Count of lines, n, between this line and beginning of translation table. 15 - 15+n. Reserved for future use. n+16... The translation table itself. Line 2 is either COMMON or LOCAL, and applies to the SOURCE character set. LOCAL means that the source character set is local, and the target character set is common, i.e. the one used during transmission in the transfer syntax. COMMON means vice-versa. Line 3 gives the name of the source character set, which is either local or common, depending on line 2. Line 4 specifies the number of bytes per character in the source character set. For example, 1 for ASCII, ISO Latin-1, etc, 2 for JIS X 0208, etc. The notation "1-2" means that a character can be either one or two bytes, as in (for instance) CCITT T.61, where "A" is the single character "A", but "`A" is the single character A-grave. Line 5 specifies the number of "characters per plane". In a single-byte character set, there is one plane, in a multibyte set there are many. In the ISO world, an important distinction is made between 94-byte sets and 96-byte sets. See Appendix B for a fuller explanation. Lines 6-8 are like lines 2-5, but for the target character set. If the source set was local, the target set is common, and vice versa. Lines 9-11 give further information about the standard, COMMON character set: Line 9 specifies the designating sequences required to assign the set to G0, G1, G2, and G3 (see Table 6), in that order, with the bytes written as decimal numbers, each byte separated by a space, and each sequence separated by a comma. For example, the entry for a 94-character set whose final designating letter is "B" would look like this: 27 40 66, 27 41 66, 27 42 66, 27 43 66 If a character set cannot be assigned to G0 (as is the case with a 96-byte set), then the first entry would be left blank (the final letter here is A): , 27 45 65, 27 46 65, 27 47 65 Line 10 gives the revision number of the common character set, as described in the "Data Transfer Protocol" section of the description of Level 2. This should be blank if the character set has not been revised, @ (atsign) for the first revision, A for the second revision, B for the third revision, etc. Line 11 gives the Kermit designator of the common character set, such as I2/100 for ISO Latin Alphabet 1. Line 12, direction of writing, has nothing to do with file transfer, but is included in case the same table is also to be used with terminal emulation. The actual notation should be the letter L (Left-to-right), R (Right-to-left), U (Upwards), D (Downwards), or B (Boustrophedon, i.e. alternating L and R). Line 13 is self explanatory. Line 14 allows for future expansion of this "information header". Lines 15 through the end contain the translation table itself. Each of these lines contains a pair of characters or strings in ASCII decimal representation, with the members of the pair separated by a comma, followed optionally by a comment, like "Uppercase A Circumflex" , ; Each byte of a character is separated by a space, for example: 231, 135 ; c Cedilla (Latin-1 to CP850) 228, 97 101 ; Latin-1 a-umlaut to ASCII "ae" 97 101, 228 ; "ae" to Latin-1 a-umlaut (dangerous!) 123 456, 234 567 ; Something from a pair of 2-byte character sets The character pair is listed, rather than a single value (as in most translation tables) to allow for special translations like a-umlaut to "ae". There is no rule against having different numbers of bytes on either side of the comma. There is also no requirement to always have the same number of bytes on the left or right side of the comma, nor to have every position filled. If a position is vacant, the program should take some kind of default action, like substituting a question mark. APPENDIX D: SUMMARY OF NEW KERMIT COMMANDS SET FILE TYPE { BINARY, TEXT, } BINARY means no translation, and overrides all other file-related commands, including SET TRANSFER. TEXT is the default. Enables Level 0, 1, or 2 transfer syntax, depending on the setting of SET TRANSFER. means any application-specific format known to the Kermit program, like WORDPERFECT. The meaning of such a command is system- and/or application-dependent. SET FILE CHARACTER-SET Effective only when file type is TEXT. Tells Kermit what character set the file is coded in, or what character set to translate an incoming file to. SET TRANSFER { CHARACTER-SET , INTERNATIONAL [{7,8}], NORMAL } CHARACTER-SET invokes the Level 1 extension, unless the is TRANSPARENT or ASCII. INTERNATIONAL invokes the Level 2 extension. 7 or 8 specifies the ISO-2022 7- or 8-bit environment. NORMAL - SET TRANSFER NORMAL is synonym for SET TRANSFER CHARACTER-SET ASCII. SET LANGUAGE This command informs the program which language is being translated, to allow for special language-based translation tricks, such as a-umlaut => ae. SET UNKNOWN-CHARACTER-SET { KEEP, CANCEL } Tells the file receiver whether to keep or cancel an incoming file that contains an unknown character set. KEEP is the default. LOAD TRANSLATION-TABLE Load a new translation table, or overlay an existing one, from the specified file. SHOW TRANSLATION-TABLE Show information about the named translation table. If omitted, show information about all translation tables. DUMP TRANSLATION-TABLE Write the contents of the named table to the specified file, in a format compatible with the LOAD TRANSLATION-TABLE command. DROP TRANSLATION-TABE Remove the named translation table from Kermit's memory and command keyword tables. SET ATTRIBUTES { ON, OFF } SET ATTRIBUTE { ON, OFF } Enables or disables processing of attribute packets, or specific attribute fields such as DATE, CHARACTER-SET, LENGTH, etc. SHOW { CHARACTER-SETS, TRANSLATION-TABLES, LANGUAGE } Display what character sets, translation tables, and languages are available, and which ones are currently selected. TRANSLATE Copies local file to local file , translating from the current file character-set into the current transfer character-set. APPENDIX E: ESCAPE SEQUENCES AND CONTROL CHARACTERS FOR KERMIT LEVEL-2 TRANSFER SYNTAX 1. Designation of character sets. The final letter "F" denotes the character set, e.g. "A" for ISO Latin-1. (F assigns 94-character graphics set "F" to G0. )F assigns 94-character graphics set "F" to G1. *F assigns 94-character graphics set "F" to G2. +F assigns 94-character graphics set "F" to G3. -F assigns 96-character graphics set "F" to G1. .F assigns 96-character graphics set "F" to G2. /F assigns 96-character graphics set "F" to G3. $(F assigns multibyte character set "F" to G0. $)F assigns multibyte character set "F" to G1. $*F assigns multibyte character set "F" to G2. $+F assigns multibyte character set "F" to G3. 2. Shift functions: Character(s) Name Function Ctrl-O SI,LS0 Shift In (invoke G0 to GL) Ctrl-N SO,LS1 Shift Out (invoke G1 to GL) n LS2 Locking Shift 2 (invoke G2 to GL) o LS3 Locking Shift 3 (invoke G3 to GL) ~ LS1R Locking Shift 1 Right (invoke G1 to GR) } LS2R Locking Shift 2 Right (invoke G2 to GR) | LS3R Locking Shift 3 Right (invoke G3 to GR) N SS2 Single Shift 2, 7-bit version, single char from G2 08/14 SS2 Single Shift 2, 8-bit version, single char from G2 O SS3 Single Shift 3, 7-bit version, single char from G3 08/15 SS3 Single Shift 3, 8-bit version, single char from G3 3. Coding method delimiter: When receiving text in an unknown character set, store the character set designator, then store the untranslated characters, and terminate with the coding method delimiter. d 4. Special characters in data: If any of the following characters appear in the data itself, they must be prefixed during transmission with , datalink escape, 01/00, Control-P: 00/14 00/15 01/00 01/11 08/14 08/15 APPENDIX F: SELECTING LEVEL 0, LEVEL 1, AND LEVEL 2 A Kermit program operates in Level 0 by default. The transfer character set is ASCII, and if Attribute packets are used, the encoding attribute is "*!A", and the character-set attribute "2" is not used. To enter Level 0: SET TRANSFER CHARACTER-SET ASCII enters Level 0 from any level. SET TRANSFER NORMAL does exactly the same thing. The two commands are synonyms. To enter Level 1: SET TRANSFER CHARACTER-SET , where is not ASCII, enters Level 1 from any level. If Attribute packets are used, the encoding attribute is "*!C" and the character-set attribute must be specified. If Level 1 is entered, exited, and entered again, the transfer character set must be respecified. To enter Level 2: SET TRANSFER INTERNATIONAL enters Level 2 from any level. Attribute packets should be used at this level. The encoding attribute is "*xIyyy" where "x" is the number of characters to follow, "I" signifies international transfer syntax, and "yyy" is 0 or more ISO-2022 facility announcers. The character-set attribute may also be included, if the program knows in advance which character sets are in the file. In this case the Kermit character-set designators for each set are listed, separated by commas, for example "2-I2/100,I2/144". APPENDIX G: SIMPLIFIED FLOW DIAGRAM OF KERMIT TRANSFER SYNTAX OPTIONS SET FILE TYPE BINARY (overrides SET TRANSFER command) | | N Y--> Transfer file unmodified. END. | Text mode. Three possibilities: SET TRANSFER CHARACTER-SET TRANSPARENT (or ASCII) | | N Y--> LEVEL 0: Transfer syntax is ASCII with CRLF as line terminator. | Sending program translates from local format to transfer syntax, | Receiving program translates from transfer syntax to local format. | END. | SET TRANSFER CHARACTER-SET LATIN1 (any single character set other than ASCII) | | N Y--> LEVEL 1: Transfer syntax is specified character set with CRLFs. | Sender translates from local format to specified character set. | Receiver translates from specified character set to local format. | END. | File composed of more than one character set: SET TRANSFER INTERNATIONAL | | N Y--> LEVEL 2: Transfer syntax is ISO-2022. Assumes that sender can | identify the different character sets in the local file, and | can translate them to registered character sets if necessary. | | | Sender specifies encoding ("*") to be International ("I"), | and lists ISO-2022 announcers. Sender also optionally lists the | alphabets to be used in new character-set ("2") attribute. | | | Receiver agrees to these facilities and alphabets? | | | | Y N --> Receiver rejects the file, indicating "*" and/or "2" | | as the reason. END. | | | Receiver accepts the file. | | | Transfer begins. Sender translates from local file format to | the character sets of the transfer syntax, using ISO-2022 | announcers, designators, and shifts to switch among them. | | | 2 --> Receiver heeds announcers, designators, and shifts, and | translates from the indicated character sets to local | representation. | | | If the receiver encounters an alphabet it does not know, it | will act according to the SET UNKNOWN-CHARACTER-SET command: | | | | KEEP CANCEL --> Reject the file by putting an X (Cancel | | File) code in the data field of its | | Acknowledgement. END. | | | (default) Continue to receive the file, but store the | designator for the unknown alphabet along with the | untranslated characters from that alphabet, until the next | known alphabet is encountered. Mark the end of the | untranslated material with d. Warn user. | | | END. | Reserved for future (END)