LISA Home page [© 2010 • ISSN 1420-3693 • www.localization.org]
© 2010 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


New Character Sets for Asia

Ken Lunde, Adobe Systems

A truly magnificent thing has happened. The software industry has finally made the move to Unicode, and all major operating system, application, and font developers have embraced this standard. With such wonderful news, what else is there to consider in terms of characterset standards and their encodings? In a word, plenty! While it is true that other emerging character set standards often induce a change to Unicode for compatibility purposes (this is considered a good thing), they often pose other issues, either legal- or business-related.


Three emerging character set standards come to mind. They are GB 18030-2000 (Mainland China), Hong Kong SCS (Hong Kong), and JIS X 0213:2000 (Japan). Let's consider each one briefly.

In 2000, Mainland China published the national (and compulsory) standard GB 18030-2000, which is an extension of GBK (which itself is an extension of GB 2312-80), and defines 28,468 characters. The fact that this standard is compulsory means that software sold into Mainland China must conform to this standard. Both GB 2312-80 and GBK fit into the framework of mixed one- and two-byte encodings. GB 18030-2000 is best thought of as GBK with "CJK Unified Ideographs Extension A" (6,582 Chinese characters) added. Because there was no more room left in GBK encoding, a new encoding that uses up to four bytes to represent each character was developed. The one- and two-byte portions of this new encoding are identical with GBK encoding in terms of character allocation (with a couple minor exceptions), but a new four-byte region has been added. This four-byte region extends from 0x81308130 to 0xFE39FE39, which provides enough code points to be code-point compatible with all 17 planes of Unicode. This is, in fact, by design, and allows GB 18030-2000 to be forward-compatible with Unicode. So, as Unicode adds new characters to the BMP or Supplementary Planes, such as with Unicode Version 3.1 or 3.2, there are equivalent code points already available for them in the encoding of GB 18030-2000, specifically in the four-byte region.

In order to accommodate Chinese characters that are specific to Hong Kong, the Hong Kong government first developed GCCS (Government Chinese Character Set) in the mid-1990s, then further refined it as SCS (Supplementary Character Set) close to the year 2000. While Hong Kong SCS support is not mandated in order to sell software in Hong Kong, selling software to the Hong Kong government requires its support. Hong Kong SCS is based on Big Five, and adds 4,817 characters. That is, 18,312 characters in total. As with GB 18030-2000, compatibility with Unicode is important. Unicode Version 3.1 now includes full support for Hong Kong SCS through its "CJK Unified Ideographs Extension B" (42,711 Chinese characters). While the Big Five representation of Hong Kong SCS still fits within the one- and two-byte encoding framework, its Unicode representation now requires the use of four bytes regardless of which UTF encoding is used. Clearly, if you plan to sell software to the Hong Kong government, supporting Hong Kong SCS is a requirement. Selling to the rest of Hong Kong is still possible in the context of Big Five.

JIS X 0213:2000 was designed as an extension of JIS X 0208:1997, but also as a replacement for JIS X 0212-1990. JIS X 0208-1997 is considered the most basic Japanese character set standard, and enumerates 6,879 characters, 6,355 of which are kanji (Chinese characters). JIS X 0212-1990 was to be an extension to JIS X 0208:1997, but it never gained enough popularity (but did manage to get into Unicode during its earliest stages). Thus, JIS X 0213:2000 was born, which adds 4,344 characters, and fits within the framework of Shift-JIS encoding. Barely. It is not yet clear whether demand for JIS X 0213:2000 is high enough to force software developers to embrace this standard. Apple was the first company to embrace it, by providing full support in Mac OS Version 10.1. Now for the character set details.1,249 are its characters are kanji in JIS Level 3. 2,436 are kanji in JIS Level 4, and the remaining 659 characters are symbols. In the context of Shift-JIS encoding, JIS X 0208:1997 and JIS X 0213:2000 combine together to almost fill up the Shift-JIS encoding space.Shift-JIS encoding, by definition, contains 11,280 two-byte code points. JIS X 0208:1997 has 6,879 characters, and when combined with JIS X 0213:2000 become 11,223. This leaves only 57 unassigned two-byte code points! JIS X 0213:2000 obliterates various vendor-defined extensions to JIS X 0208:1997, such as those made by IBM, Fujitsu, and Apple. It also consumes the entire user-defined range, which affects other developers. The so-called "NEC Row 13" is now considered part of JIS X 0213:2000, except for a handful that are duplicates of JIS X 0208:1997 characters. In the context of Unicode, full support for JIS X 0213:2000 is provided in Version 3.2. Is JIS X 0213:2000 support necessary for success in the Japanese market? No. Or, at least not yet. Supporting JIS X 0208:1997 is still sufficient for the Japanese market. Market demand may eventually force the need for JIS X 0213:2000 in software products.

It is clear that character sets continue to be developed in the CJKV locales, as demonstrated by this short article. It is also important to understand that the national bodies who develop these character sets strive to incorporate them into future versions of Unicode. Compatibility with Unicode is clearly paramount, primarily because it is the closest thing that we have in terms of a universal character set.


Ken Lunde is a Senior Computer Scientist at Adobe Systems, and is the author of "CJKV Information Processing" (O'Reilly, 1999). He can be reached at lunde@adobe.com.




Contents


LISA Business Data

LISA Publications Catalog

Industry Insights Reports

Best Practice Guides

Surveys

QA Model

Forum Summaries and Presentations

LISA Globalization Consulting Network

Webinars and TouchPoint Advisory Calls


Join LISA

Subscribe


Upcoming Events

LISA Forum USA
(Foster City, California, April 13–16, 2010)

LISA@Chinasoft Fair
(Chengdu, China)

LISA Forum Asia
(Suzhou, June 28–July 1, 2010)

LISA Forum Europe
(Budapest, October, 2010)

LISA Forum India
(New Delhi, December, 2010)


Open StandardsTBXTMX

Terminology SIG

Job and CV Postings