|
In this issue…
Getting Ready for the Japanese Market
In the April '93 issue of the Forum Newsletter, Jan wrote about the general concerns involved when dealing with DBCS--Double-Byte Character Sets. The following article expands upon that discussion by addressing some background material necessary for localizing and internationalizing for the Japanese language. This discussion will be continued further in future articles. Here is the scenario: You've decided to commit to the Japanese market by internationalizing and/or localizing for Japan. You've heard horror stories like the one about localizing Lotus 1-2-3 into Japanese--that it cost US$5 million and a considerable amount of time and effort, not to mention pain. Well, it can truly be a nightmare, but the nightmare can be unnecessary if you understand what you are attempting to do. The basic background information you need about the written Japanese language can be broken down into many topics. Some of them are the following:
Japanese output methods will not be discussed in this article but this topic will be covered in a future issue. This article also does not include anythingabout the conversions from single-byte to double-byte code that will be required for the Japanese language for non-internationalized programs. The various hardware systems will also be discussed in future articles along with other subjects. Japanese Writing SystemsThe first facts with which you must be acquainted are about the writing system used in Japanese. The Japanese use a multibyte writing system; it includes four written scripts. Characters from each writing system can be mixed and may be found in a sample of text. Further, they can often be found in the same sentence! They are:
Hiragana and katakana are the native Japanese, phonetically based scripts. Each character in hiragana can be mapped to a character in katakana. Together, there are approximately 208 characters in these two systems; each system has 46 basic characters, but vocalization (or voicing) marks expand the number in each to approximately 105. In the average text, approximately 60% will be hiragana, 10% katakana, and 30% kanji. 1 (I will discuss kanji below.) The percentages vary according to the text. for example, while there may be a higher percentage of kanji in technical material involving the physical sciences, katakana is widely used in documents like software manuals because the computer field is less established. As a result, authors end up using many loan-words transliterated into katakana because there are no suitable terms expressed in kanji. Hiragana and katakana serve different purposes. Hiragana, which can be considered to be the cursive set, are used to indicate syntax (the hiragana for wa following a word would indicate that a word is probably the subject of a sentence, for instance). They are also used when new terms with no equivalent in kanji exist, which is a common state of affairs in the fast-moving world of high technology, as I mentioned above. Hiragana is the first script that Japanese children learn in school. While text can be written entirely in hiragana, an educated reader would assume that such text was written by a child or by someone uneducated. In addition, because of the extremely common occurrence of homonyms in Japanese, a reader might take longer to work through hiragana-only text. This is because the reader would at times have to decide among the possible dozens of kanji that might be represented by a particular hiragana character. Katakana, which can be considered to be the "printed" or block set, is used to indicate that a word is one derived from a foreign language. Loan-words are numerous and the Japanized version of a word can look strange to someone speaking the language from which the words are borrowed. For example, the English words "setup diagram" become setto-appu-daiyaguramu. Katakana is also used in headlines, in telegrams, in advertising and sometimes just to provide emphasis on an idea. In some situations such as a computer interface where there are many loan-words and words without kanji representations, the use of text written entirely in katakana might be acceptable. However, in general it is unwise to use all katakana in text because it looks unnatural--and cheap--to a Japanese reader. About 30% of the average non-technical Japanese text is written in kanji. 2 Kanji are ideographic characters borrowed from Chinese. Kanji are written using strokes. A single stroke is the mark made while a writing instrument remains on paper. Once the writing utensil is lifted, the next mark made counts as the next stroke. Strokes are written in a predetermined order and can be straight lines, curves or some angles. Stroke counts of kanji can range from one to dozens. Each of thousands of kanji can represent an individual word or complex sound and can be combined to form other words. The Japanese use about 7000 on a regular basis, while the Chinese use about 40,000. In many cases, the Japanese version of a Chinese kanji looks very different from its parent because the Japanese have simplified the stroke count. The typical Japanese word is made up of two kanji and a hiragana character. Because there are so many of them, kanji pose special challenges for programmers. Because 8-bit bytes can only address 256 characters, code sets for Japanese require more than 8 bits and contain characters that include two or more bytes. A wide character is 16 bits or larger, greater than the normal 7- or 8-bit byte. Regardless of the similarity of names, display width and wide characters are not the same. There are half-width characters and full-width characters. Generally, ASCII characters used in the West are half-width, while most Japanese characters are full-width. The determination of half- versus full-width is based on whether a character occupies a space that is half or all of a square. Imagine that every full-width kanji or kana character is assigned the same amount of display space, much like mono-spaced English fonts. Using this method of visualization, English characters appear to use half the display width as kanji. Because of the limitations of ASCII, the characters first displayed on early Japanese computer monitors were half-width katakana. Although these half-width katakana at first used only one byte and full width used two bytes, this correspondence no longer holds true. Half-width katakana can be encoded as double-byte characters using EUC, Unicode or ISO 10646 encodings. However, no full-width characters can be encoded in a single byte. To some extent, with the emergence of advanced scalable font technologies, these distinctions are no longer relevant. The fourth writing system is romaji (Roman), in which phonetic Japanese is represented by alphanumeric characters. Upper case, lower case, and punctuation are included and used in the ways familiar to a US reader. However, there are no letters with diacritical marks. Romaji are used for expressions as written in languages using the Roman writing system without modifying them into loan-words that would be expressed as katakana. A Westerner with no knowledge of Japanese can use romaji for help with pronunciation of kanji and kana; and there are dictionaries that use romaji sorted in alphabetical order to aid in looking up kanji. There two ways to write romaji when used to transliterate Japanese into the Roman alphabet. These are the kunrei or Nipponsiki and the Hepburn methods. The main difference is the way in which some Japanese pronunciations have been transliterated. The sound represented by the Western "shi" is spelled "si" in the Nipponsiki method and "shi" in the Hepburn. (Nipponsiki would be spelled Nipponshiki using the Hepburn method.) The use of one of these methods must be taken into consideration when planning for sort ordering for romaji. Japanese Character Set StandardsUnlike English, which has no character set per se because of the relatively minute number of characters in its alphabet, Japanese uses several. Unfortunately, there is no universally recognized character set standard like ASCII. There are about 10,000 characters in Japanese, which include the 1,945 Johyoh kanji, the set taught in the Japanese school system. The Johyoh kanji comprise the kanji required for basic literacy. They are also known as "non-electronic" characters. In addition, there are the "electronic" character sets for use in computers and word processors. These were created to allow for the exchange of Japanese text between computers. The first electronic character set standard was created by the Japanese in 1978. The standard Code of the Japanese Graphic Character Set for Information Interchange was released in 1990. The basic "electronic set" provides for 6879 characters and includes 6355 kanji (JIS X 0208-1990). There are three different versions of this character set, each released in a different year, each containing a different number of characters, and each not 100% compatible with the others. These are called JIS level 1, in which kanji are arranged by ON (Chinese) reading (or pronunciation), and JIS level 2, in which kanji are arranged by radical (or root ideograph) and number of strokes. They make up approximately 99% of all kanji in common use. There is also an extended character set of 6607 characters. In addition, there are what are called "corporate character sets." These are derived from JIS X 0208-1990 and JIS X 0212-1990. Corporate character sets are versions of the basic electronic set that were created by individual computer manufacturers. Predictably, these are also not 100% compatible. A machine-independent way of indexing rather than encoding characters in JIS X 0208-1990 and JIS X 0212-1990 and the corporate characters that were generated from them is KUTEN, which means "row and cell" (or literally, "ward and point"). Japanese Encoding MethodsEncoding is the method of mapping a character to a numeric value. In Japan, there is no single widely recognized encoding method. However, there are three methods in common use. They include:
JIS encoding, the most basic Japanese encoding method, is modal. In other words, it uses escape sequences of one or more characters to signal a change in mode. This change can be the shift between one- and two-byte modes, between character sets, or between different versions of the same character set. Shift-JIS (abbreviated SJIS or called MS Kanji) is the encoding method most used on Japanese PCs. In JIS, the ASCII and DBC codes overlapped, so an escape sequence was needed to turn off the DBC code so that the two codes could both exist. In the SJIS encoding method, the DBC first byte ranges and ASCII characters are separated into different code spaces. It does not require an escape sequence (it is a non-modal encoding), thus using less space than modal and fixed-width encodings. Interestingly, because of its popularity for internal coding, Shift-JIS was used for Japanese PCs-- and KanjiTalk, the Apple Macintosh Japanese operating system. The EUC standard mixes ASCII, JIS X 0201, JIS X 0208 and JIS X 0212 character sets. EUC-JIS uses ASCII as Code set 0, JIS X 0208 as Code set 1, JIS X 0201 as Code set 2, and JIS X 0212 as Code set 3. This is also known as the EUC packed format encoding space. There is also JIS-Roman, which is the Japanese equivalent of ASCII. Most terminals support either ASCII or JIS-Roman; most Japanese software supports JIS-Roman rather than ASCII. Terminals that support only JIS-Roman display the ASCII backslash as the JIS-Roman yen sign. The three main encodings allow for a mixture of one- and two-byte characters. There are other encodings that allow for three- and four-byte characters as well. To shift between single- and double-byte character modes, escape sequences may be required. In addition, there are "wide characters." A wide character can range from 16 to 32 bits, while double-byte characters are, of course, 16 bits. Included in class libraries is a Japanese Input MethodsObviously, with such a large character base, the input of Japanese text requires some creativity. It would certainly be impractical and extremely inefficient to have a keyboard with hundreds, let alone thousands, of character keys. Therefore, a two-step process has been developed and is in general use. The first step is to input hiragana or transcribed Japanese using an FEP (front end processor, or as Microsoft refers to it, the IME, or input method). This is then converted to kanji in the form of a candidate character. Because of the large number of homonyms (there are often dozens of kanji with the same pronunciation or "reading"), the candidate characters are displayed and the user then selects the correct kanji. (When the user selects an kanji, this is the rough equivalent of a spelling error in English, which should be caught when proofread.) The selected kanji is then placed in the text being processed. The FEP handles both of these steps. At this point, the type of keyboard hardware is unimportant. The conversion of hiragana to kanji is handled by "conversion dictionaries" in a process similar to key-value lookup. These conversion dictionaries are sometimes customizable, perhaps allowing the user to have only the most frequently used kanji appear as candidate characters. In addition, the input can be in different units: single kanji, compound kanji (most Japanese are comprised of at least two kanji and one or more hiragana), and kanji phrases. A method growing in popularity is pen-based. No keyboard is needed for this type of input; instead, you use hardware devices that basically let you use handwritten characters. Since the Japanese are taught to write kanji and kana in very strict ways, most of one person's input is very similar to another's, and similar to what has been "taught" to the operating system of that device. This method allows the two-step front end conversion process to be skipped. Additionally, one can input characters using the numeric pad by entering the code point value of that character. Predictably, this is less than efficient. Laying Out Japanese TextJapanese can be written in two ways: from left to right and horizontally, or right to left and vertically. Technical documentation is usually set in the Western manner: from left to right, from top to bottom. Since vertical orientation may cause problems with Western software, it is fortunate that this writing style is acceptable. Meanwhile, most novels and popular reading are set in the traditional right to left, top to bottom vertical style. Because of the differences in orientation, a kanji may require 90 degree rotation when it laid out. Or, it may require being placed in a different position in the "m-square." (Each character in Japanese may be visualized as being placed in a grid or matrix of squares, each of which is approximately the size of the letter "m.") There are no delimiters for words in Japanese (there is not even a totally agreed-upon definition for what constitutes a word). However, Japanese does use commas, periods, dashes and parentheses in much the same way as in English, and there are spaces between sentences. (The punctuation marks are rotated 90 degrees in vertically laid out text.) Otherwise, a sentence is an uninterrupted flow of characters. Not surprisingly, this situation poses challenges for programmers. |
LISA Business Data Forum Summaries and Presentations LISA Globalization Consulting Network Webinars and TouchPoint Advisory Calls LISA Forum USA LISA@Chinasoft Fair LISA Forum Asia LISA Forum Europe LISA Forum India Open Standards • TBX • TMX |
||