|
In this issue…
Some Issues Associated with Handling Double-Byte Character Sets
Part of the challenge of localizing a writing system may depend on whether the characters of the writing system are single- or multi-byte characters. That is, some of the issues that may need to be considered are whether the format is the 8-bit character format used for the ASCII environment of European and/or English-speaking locales or whether the format is multi-byte, wherein a single character, European or Asian, is represented by one, two or more bytes in a code set. To further complicate matters, a writing system may be considered "simple": e.g., Roman, Greek, or Cyrillic. Alternatively, it might be large but "non-complex." The Japanese, PRC Chinese, Taiwanese, and Korean systems, with approximately 6900, 7500, 13,800 and 8200 official characters, respectively, are included in this large, non-complex classification. Hebrew, Arabic and Southeast Asian languages are considered complex because although the actual character sets may be relatively small, the writing systems are bidirectional and/or contextual. For instance, although Hebrew is a right-to-left writing system, numbers are written from left-to-right. So a sentence containing numbers will have script written in both directions. Also, quotes from left-to-right writing system would remain in that direction within the Hebrew text. In these situations, a split cursor might be shown in the display window to indicate where the next entry would be placed in either direction. Contextual writing systems are ones like Arabic, where the glyph of a character may change to four different shapes, depending on its position in or at the end of a word. All of the above contain challenges for a programmer struggling to write code for the best possible software targeted for a particular market. Input methods must be inventive when dealing with a writing system that contains thousands of characters. A keyboard with even a 10:1 representation would require hundreds of keys. So, the input of characters into the text processing system for these character sets requires some ingenuity and work is still being done on this front. One method for Japanese that was relatively popular is pronunciation based; the front-end processor works like a spell checker in that it produces a list of possible selections. One types in the romanized representation of a character (romaji), which is converted to a kana character. This shows up in one of three windows on-screen. A list of corresponding kanji, that is, homonymous, is displayed after the kana is entered. Since there are many, many homonyms in Japanese, the next step is to pick the correct kanji from a candidate window, which may show dozens of possible kanji, all having the same pronunciation. One selects the desired kanji from this list, which is then transferred to the main working window at the selected point. This is a very tedious process but was the best available for years. Recently, in-line conversions have been become popular because they eliminate one of the windows; the conversion and selection appear in the main window in the text being processed. However, the basic process is still the same. For Taiwanese Chinese, which recognizes about 13,000 official characters, even this method is arduous and research is being done that may result in a mix of several input methods, including voice recognition. Obviously, because of the complexity of these writing systems, much thought must be given to the methods to be employed for collating. For instance, the ideographic languages are collated using a) radicals (basic "root" forms that would correspond roughly to European morphemes); b) number of strokes (marks added to the radical in a specified order to make differentiated characters); and c) phonetic sequences using romanized spelling. In dictionaries, characters have collating numbers attached. However, when user-defined characters must be accepted, the dictionaries must also be modified to allow for collating these as well. The Japanese language uses phonetic units called hiragana and katakana--the combination of the two are called kana--in addition to kanji and romaji. Kana are the native Japanese additions to a modified Chinese system of characters. They are used for a variety of reasons, to indicate parts of speech, to show whether a word is a "loan-word" from a foreign language, etc. As such, they cannot be eliminated from the language. These kana characters are sorted in a fashion similar to alphabetical using the gojuonzu, which means "fifty sounds." In each of the sets of hiragana and katakana, there is a standard sequence of sounds in the lists of kana that is analogous to being in alphabetical order, and collating is done according to this sequence. There are other rules as well: hiragana characters precede katakana equivalents; shorter words precede longer ones as in English (e.g., "base" before "baseball"); unaccented kana (those with no vocalization mark such as one indicating a long vowel sound) come before those with accent marks. Other double-byte character sets have different issues to consider when it comes to collating. The single-case (no upper or lower cases) nature of Arabic simplifies some matters and the connecting character tatweel is not collated because it has no significance as a word. However, there are character codes for ligatures such as lam- alef, which does complicate the collating process. Arabic words are first sorted in code order, excluding Arabic vowels. Then groups of words with matching consonants are sorted in order, this time with vowel characters. Hebrew is also a single-case language, eliminating problems associated with upper and lower case sorting. However, there are three Hebrew character sets in use and all three contain both Latin and Hebrew characters. Therefore, collating rules must be available in each case. The Latin characters should be collated using rules for parent sets. For example, DEC collates Latin characters in the DEC Hebrew 7-bit set by ASCII sequence. However, for Latin characters in the DEC Hebrew 8-bit set, the DEC Multinational collating sequence is employed. Characters are collated alphabetically within each Hebrew set; Latin characters are always first. These languages do have some characteristics in common. For instance, it turns out that displays of many double-bye characters must be larger than what to which a European reader is normally accustomed. The complexity of the characters demands a larger point size for sheer legibility. A Chinese character with 20 or more strokes might closely resemble another character with the exception of the placement of a single stroke; if the stroke is unclear because its size is too small, the character--and word which contains that character--may be misread. Another issue is the lack of delimiters in some of these languages. For example, there are no spaces between words in Japanese. Obviously, this makes character wrapping a much more interesting situation for a programmer. Since there are no hyphenations in Japanese, it is important that words, which typically consist or two or perhaps three kanji, not be broken by inappropriate character wrapping. Especially in a mixed environment where part of a multi-byte character might contain the same code as a single-byte terminator or delimiter, all input parsing of the input stream should be character-based rather than byte-based. One method of limiting the parsing problems described above would be to globalize all programs using a double-byte environment. This would eliminate the problems generated by mixed environments and would also preclude the headaches entailed by converting a program in a single-byte format to a double-byte format when that software is being localized for Asian markets. The strongest objection to this from Anglocentric developers is that it would require too much storage space: it would double the amount of space required for every program. However, the advantages of being able to mix languages without regard to how many bits are involved offer the opportunity to advance the globalization of information interchange. This ability is becoming widely recognized as being worth the price of doubled storage space. The result is projects like the Unicode project, which is "...the proposal for a fixed-width multi-byte, multilingual character encoding," as described by Joseph Becker of Xerox. But that is another article...! References
|
LISA Business Data Forum Summaries and Presentations LISA Globalization Consulting Network Webinars and TouchPoint Advisory Calls LISA Forum USA LISA@Chinasoft Fair LISA Forum Asia LISA Forum Europe LISA Forum India Open Standards • TBX • TMX |
||