LISA Home page [© 2010 • ISSN 1420-3693 • www.localization.org]
© 2010 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


Multilingual API for internationalization of the network services

Borka Jerman-Blazic, Andrej Gogala, Laboratory for Open Systems and Networks, Slovenia

This paper describes an attempt for provision of internationalized network services by introduction of a tool for transformation of character set codes in different network applications. The tool helps the end user with the conversion of the received or sent data encoded in character set codes that are not supported by the user system. The tool considers the user cultural environment and provides conversion customized to the user needs and cultural requirements. The tool was developed within the Project C3 of the TERENA WG-CHAR Task Force. Currently, the tool caters for the scripts used in Europe, but the model developed and the API implemented can be used for any other cultural environment. The basic model make use of the UNICODE character sets as an universal basis for any kind of character sets conversion.


1. Introduction

The current networking world is in transition between the 8-bit coded characters sets scheme which provides support of the required national letters by use of the switching technique and the 16- bit coding provided with the Unicode scheme.

The major problem with different services providing transport mechanisms for carrying large number of different character sets is in the inconsistency of the applied data interchange standards and in the input/output (rendering) problems of the character sets in use due to the diversity of the user's equipment. We may say that the network services and the protocols dealing with different character sets are mature enough and that additional effort is required at the user side. In this paper we discuss an attempt for provision of a solution on the user side in the form of a tool for character sets transformation and conversion. The tool is represented by the general API and the coded character sets tables. The tool helps the end user by providing conversion of the received or sent data which are coded in a character set codes that are not supported by the user system. The tool considers the user cultural environment and provides conversion customized to the user needs and cultural requirements. The tool was developed within the Project C3 of the TERENA WG-CHAR Task Force and in cooperation with KTH, Stockholm, Sweden, DKUUG, Denmark. Currently, the tool caters for the scripts used in Europe but the model and the API developed and implemented (in the tool) can be used for any other cultural environment.

2. The applied model

The C3 systems (the coded character sets tables and the API) are designed to support conversion between any pair of a wide selection of coded character sets used in Europe in most efficient way.

Coded character set conversion is the kind of data conversion in which a source text, encoded in one coded character set, the source character set, is transformed to a target text, encoded in another coded character set, the target character set. Ideally, the text should be the same, only the method of encoding being changed.

By a coded character set is meant a function (in the mathematical sense) which for any sequence of encoding units tells if it is legal and in that case gives the sequence of characters it represents. Encoding units are any ordered group of bits of a certain length. They are normally bytes, groups of 8 bits. The set of characters that can be represented are called the character repertoire of the coded character set.

The tool is intended to be used in the following fields of applications:

  • Conversion of different coded character sets in text for integration in MIME and X.400 MUA/MTA systems, netnews, X.500 directory services, WWW and network information retrieval services (NIR),
  • Conversion in terminal communication, when the terminal program runs on a computer with another coded character set than the host,
  • Conversion in transfer of text between incompatible word-processing or text-processing systems by means of plain text files.
  • Conversion of character coded data within one computer system in connection with a change of that system's principal coded character set, e.g. from a '-bit character set to an 8- bit character set.

3. Basic Features of the System

When converting text from one coded character set to another, the first necessary task is to transform to the target character set the encoding units of those characters of the source text that are also present in the target character set. This only involves having accurate information for all possible encoding units of both character sets about which character it stands for in that character set. The conversion in this case is exact. The source characters are re- encoded in the target character set without distortion.

This information about the definition of a character set is included in the elementary table for this character set in the C3 system. By combining the elementary tables for the source character set and the target character set by a simple algorithm, a working table is created that is used for the direct transformation of the sequence of encoding units in the source text to the sequence of encoding units in the target text.

This design makes possible conversion of any coded character set of those supported by the C3 system to any other such character set. How this is to be done is fully specified by N elementary tables, compared to the N(N-1) conversion tables usually used in brute force approach. The basic table used is the UNICODE [16].

In the character sets world it is impossible to provide full coverage of a 8-bit target character set of the full repertoire of any source character set. For these cases a non-exact representations for the source character sets designed in the C3 system takes place. The non-exact conversion is implemented with a choice of three different types of conversion:

Conversion type 1 -- one-to-one conversion: Preserves the length of lines and data fields by always transforming one source character to one target character.

Conversion type 2 -- legible conversion: Gives the user as much information about the original character as possible by means of a legible representation, sometimes using more than one character.

Conversion type 3 -- reversible conversion: Uses a one-to-many- representation of characters not available in the target character set which is designed to make reconversion to the original character set possible without any information loss or distortion.

The C3 system does not change the interpretation of a character or how it is converted because of its context in the source text. For the most important case in which source context sensitive conversion is needed (the conversion between different ways of representing line breaks) a special mechanism is provided. In addition, there is a mode of using the C3 system in which special byte sequences in the source text, inserted by a pre-processor, will change on the fly the conversion options used.

When the target character set does not contain the letters of the script used in the source text, e.g. if the source text is written by Cyrillic letters but the target character set is purely Latin-script, the non-exact conversion applied by the C3 system is a transliteration or transcription of the source script in the script available in the target character set. The C3 system thus includes transliteration from the Cyrillic script to the Latin script.

The best choice of non-exact representations depends on several factors, among which the most important ones are the language of the source text (e.g. transliteration rules are usually not identical for Russian and Serbian) and the cultural environment for which the target text is intended (e.g. different transliteration systems are used in English- speaking countries, in Germany, and in the Scandinavian countries). To cater for the different conventions preferred by different user groups, a set of parameters is provided, for which the user can set the values in order to get the most adequate conversion for a particular cultural environment. This constitutes a run-time localization feature of the C3 system.

In the elementary table of a character set is included full information about the non- exact conversions of all characters outside the character set in the three different conversion types. The modifications of the non-exact conversions needed for different values of the conversion factors are specified in overlay tables, which are applied to the elementary tables before the construction of the working table.

The design of elementary tables for the coded character sets supported by the C3 system is facilitated by the prototype table. This table gives the preferred representation of every character included in any of the supported character sets by means of a string of characters in the common sub repertoire of all character sets.

4. Features list

The major C3 features are:

  • Full generality: conversion can be done in any direction between any pair of the coded character sets included in the system.
  • Approximate conversion when exact conversion is impossible: There are no arbitrary identification of different characters in the source and the target character sets. If the target character set lacks a source character, the best possible replacement character or string is used.
  • Can handle not only simple '-bit and 8-bit coded character sets, but also advanced character sets such as the 16-bit ISO 10646 character set/UNICODE (on implementation level 1) and state-full character sets like ISO 6937/T.61. Incomplete character sets, character sets lacking control characters, indeterministic character sets, and ambiguous character sets are also supported.
  • Easy to use for the unsophisticated user (by means of carefully chosen defaults).
  • Flexible and fully configurable for the sophisticated user/system administrator/application developer.
  • Conversion parameters control the exact conversions performed: different needs or restrictions in different situations is easily handled by means of the three conversion types (one-to-one, legible, reversible)
    • separate specification of the conversion of line breaks
    • the factor system (for varying cultural expectations affecting preferable approximate conversions).
  • Easy to customize: The conversion tables use a format optimized for human readability which only uses the subset of ISO 10646, 82 graphic characters available in all coded character sets (hexadecimal values are used to refer to other characters). Different full sets of conversion tables can be used in parallel.
  • Simple to extend: To add a new coded character set, only provide a definition table for it and approximate conversions for any character in it that isn't included in any already defined coded character set.
  • Scalable: To fully define the N(N-1) possible conversion paths between N different coded character sets, only N+1 conversion tables are needed. How conversion is to be done is defined by means of ISO 10646/UNICODE as a common interface, but the actual conversion is a direct transformation from source character set to target character set, not involving an ISO 10646 representation as an intermediate step. Temporary files are not needed.

5. Innovative elements in C3

C3 tool differ from any other coded character sets conversion tool because of its innovative elements. The approximation table is the most innovative element in the C3 approach to character set conversion. It specifies for each character in any of the character sets for which definition tables are given, how it is to be represented approximately (by fall-back) in the target character set, if the character is not included in that character set. Several alternative representations are specified for some characters, to take advantage of the different character repertoires of different target character sets.

The conversion tables use only the invariant part of ASCII. To indicate other characters, the hexadecimal form of the coded representations in UCS is used. No information specific to a certain coded character set is included in the approximation table.

The approximation table defines three types of conversion which the user can choose from: Type 1 converts one source character to one target character (best for tables and fields with length restrictions). Type 2 converts characters to a more understandable approximate representation, which may consists of one or a few target characters (best for prose). Type 3 is a reversible one-character-to- many-characters conversion, which is based on the mnemonics defined by RFC 1345.

A special problem in character set conversion is the treatment of control characters. The basic view taken in the design of the C3 system is that the abstract control characters (with encoding units in the intervals 0-31 and 127-160 of ISO 8-bit character sets) may be used to represent different control functions in different contexts. Since control functions are very often represented by sequences of control characters and graphic characters instead of simple control characters, making source context sensitive character set conversion necessary, the C3 system does not attempt conversion of control function representations. Instead, only the encoding units of individual abstract control characters are converted.

In most coded character sets the encoding units are simply bytes, but for more complicated character sets like T.61 (or ISO 6937 known by its non constant length of coding per character or application of ISO 2022 extension technique) the characters can be best treated by being represented as combination of one state (depending on the previous encoding units in the text) and one byte. The C3 system fully supports the conversion of such state- full coded character sets. The C3 system supports a character repertoire formed by including all characters of the 39 coded character sets which are used in Europe. (This includes the 65 control characters possible in an ISO 8-bit character set in the C0, DEL, and C1 areas.) A limited number of additional characters from the European subset of UNICODE, currently being designed by CEN/TC304. This repertoire is called the United Repertoire.

The Common Sub-repertoire of the coded character sets consists of the 83 invariant graphic characters (including SP) of ISO 646:1992. The elementary table in the C3 system for a coded character set indicates for each character of the United Repertoire if it is included in the character set or not. If it is, the table specifies its encoding unit. If it is not, the table specifies what non- exact representation of the character is to be used, when converting text to this character set. This is done for each of the three conversion types. The characters of the United Repertoire are specified by means of their encoding unit in UCS-2 (ISO 10646-1:1993).

6. C3 and network applications

C3 is incorporated in two mailing systems: Z-mail and EXMH (copyrighted by Xerox) EXMH is a TCL/TK based interface to the MH mail system. It is capable of receiving and representing MIME messages [13]. The representation of text formats with various character sets requires appropriate fonts. By calling C3 EXMH can convert the received message to selected character set. The target character set and the type of conversion can be set up in advance, so the user is not interrupted while viewing his mail.

Another use for C3 in EXMH is for messages which are coded in national versions of '-bit ASCII (ISO 646 IRV). This type of coding especially for e-mail is still heavily used in many European countries. By using this simple coded character set the messages are protected from distortion when they pass unknown gateways. After the receipt of the message the text part is converted to character set usually used on the system. The incorporation of C3 in WWW clients/servers is on the way.

7. Concluding remarks

There are many other conversion systems in the character set codes world but they differ from C3 system in their coverage of the scripts and the functionality. In general they provide conversion between a set of selected 8-bit character sets with the same or similar character repertoire, the missing characters in the converted data are represented with some "error" characters or some "similar" replacements. The C3 system is different and unique because it provides conversion between any pair of the supported character sets, enables three conversion types: 1-to-1, legible, reversible and reconversion (if required). The main property of the C3 system which makes it close to the requirements for an internationalized product is the transliteration facility and the selectivity in representations of the characters defined or chosen by the end user.

Additional important features of the C3 system reflect the requirements for portability of such system and its integration in applications such as SMTP, PP mail gateway system and terminal emulation. This is provided by the system API which enables simultaneously a conversion of several source texts. The API is designed to support also context- sensitive texts. We can also say that the major advantage of the C3 system is the handling of the defined subset of UCS (implementation level -1), UTF-2 format of data, Quoted-Printable, Base64 and other byte-oriented data formats. Some features in the API are also designed for the support and handling of representation- ambiguous character sets and flexible treatment of data errors.

The API for the C3 system is written in C for UNIX. The concept used follows the reference descriptions used in "Xlib Reference Manual for version 11" of the X Window System. The implementations for MSDOS and MAC as well as the integration in SMTP, PP mail system gateway and terminal emulation are underway.

The future extensions of C3 systems are planned towards handling bi-directional text, conversion between logical and visual encoding of Thai/Lao text and implementation of level 3 of UCS.

References

[1] ISO 646, 3 ed, 1992: "ISO '-Bit Coded Character Set"

[2] ISO 2022, 3ed, 1986: "Code Extension Technique"

[3] ISO 4873, 2ed, 1986: "ISO 8-Bit Code Structure and Rules"

[4] ISO 6937, 1991: "Coded Character Sets for Text Communication"

[5] ISO 8859-x, 1ed, 1987: "8-Bit Single-Byte Coded Graphic Character"

[6] ISO 10646: "Multiple Octet Coded Character Set"

[7] Unicode : "16-bit multi-lingual character set code"

[8] ISO 6937, 1991: "Coded Character Sets for Text Communication"

[9] ENV 41 503: "European graphic character repertoires and their coding"

[10] ENV 41 508: "East European graphic character repertoires"

[11] Working documents of the RARE WG-CHAR and C3 TF, 1993

[12] Olle Jarnefors: Status Report: "Overall design of C3", 1993

[13] Nathaniel Borenstein, Ned Freed, RFC, "MIME - Multipurpose Internet Mail Extensions"

[14] Keld Simonsen, RFC on Mnemonics 1345, 1992

[15] B.Jerman-Blazic, "Character handling and computer communications, in User needs in information technology standards", eds. C.D.Evans, B.L.Meek, R.S.Walker, Butterworth Heinemann 1992.

[16] European Subsets of ISO/IEC 10646-1, CEN/TC304 N393.


The latest C3 distribution and other C3 information is available in World Wide Web through http://www.nada.kth.se/i18n/c3/ or by anonymous FTP to ftp.nada.kth.se, directory "pub/i18n/c3", i.e. ftp://ftp.nada.kth.se/pub/i18n/c3/

Email addresses:

c3-questions@nada.kth.se: Questions, comments, bug reports, etc.
c3-info-request@nada.kth.se: Subscription to info-about-C3 list
c3-request@nada.kth.se: Subscription to discussion-about-C3 list
c3@nada.kth.se: Contribution to discussion-about-C3 list
The C3 Task Force within TERENA (Trans European Research and Academic Networks Association) consists of: Borka Jerman-Blazic jerman-blazic@ijs.si, Olle Jarnefors ojarnef@admin.kth.se, Peter Svanberg psv@nada.kth.se, Keld Simonsen keld@dkuug.dk.


Borka Jerman-Blazic is a chair of the Laboratory for open systems and networks at Jozef Stefan Institute, Ljubljana, Slovenia. She is teaching postgraduate course on Telecommunication Services at the Faculty for Economics, University of Ljubljana. She is a member of the ARNES (Slovenian academic network) Steering Committee and member of TERENA Technical Committee and convenor of TERENA WG on Internationalization . She is project leader of C3 Task Force. She is chairing the national standardization committee on information technologies (JTC1) and is representing Slovenia in ISO JTC1 SC2, JTC1 SC22WG20 and CEN TC 304. Her research currently focuses on networks applications and issues related to internationalization of software applications. She has been coordinator of the ex-Yugoslav part of the project COSINE and General Secretary of YUNAC - ex-Yugoslav academic and research network. She is author of more then 100 published scientific papers.




Contents


LISA Business Data

LISA Publications Catalog

Industry Insights Reports

Best Practice Guides

Surveys

QA Model

Forum Summaries and Presentations

LISA Globalization Consulting Network

Webinars and TouchPoint Advisory Calls


Join LISA

Subscribe


Upcoming Events

LISA Forum USA
(Foster City, California, April 13–16, 2010)

LISA@Chinasoft Fair
(Chengdu, China)

LISA Forum Asia
(Suzhou, June 28–July 1, 2010)

LISA Forum Europe
(Budapest, October, 2010)

LISA Forum India
(New Delhi, December, 2010)


Open StandardsTBXTMX

Terminology SIG

Job and CV Postings