|
A Look at China’s New GB 18030 Character Set Standard
In his article in the March 22, 2002, issue of The LISA Newsletter, Ken Lunde introduced the new Chinese national character set standard, GB 18030:2000. His piece raises several interesting (or at least important) questions concerning the implementation of GB18030, including: What are the requirements for GB 18030 certification? How do you add support for GB 18030? Does your favorite application platform/programming language support GB 18030? Products released in the People's Republic of China (PRC) after March 17, 2000, are considered (for the sake of GB 18030 compliance) to be "new products." This also includes updates to existing products during that time range. Products released prior to March 17, 2000 are "historical products" and are exempted from the conformance requirement. Any product (including upgrades) released between March 17, 2000, and August 31, 2001, must be updated to conform to the standard, and any product released on or after September 1, 2001, must be certified. Conformance requirements are discussed later in the article. GB 18030 has several advantages over its predecessors, GBK and GB 2312:1980:
Before the adoption of GB 18030 you could not represent Western European texts using diacritics in either of the Chinese encodings. With GB 18030 not only can you support these, but also thousands of other characters found outside plane 0. GB 18030 defines a multibyte character encoding, meaning each character is represented by a variable number of bytes: one, two, or four depending on the character. The complexity of multibyte string processing is well known amongst localization engineers. GB 18030 is particularly complex because you cannot tell from looking at a particular byte whether it is part of a two-byte or four-byte sequence. Given the alignment between GB 18030 and Unicode the cleanest way to support GB 18030 in your applications is to fully support Unicode 3.1, using UTF-16 as the internal character representation. This is the route that all of the major operating system and library developers have taken. However, using UTF-16 does not ameliorate the multi-byte character problem. GB 18030 contains characters in plane 2 of Unicode, which must be encoded using surrogate pairs in UTF-16. Hence any character outside of the BMP still requires multiple bytes (four instead of two), though these characters rarely occur except in personal names or classic texts. Alternatively, you can use UTF-32, giving you simpler string handling (since all characters are fixed-width) while sacrificing memory (since most characters will fit in two-bytes.) Very often the choice of representation will be based on what level of support for Unicode and GB 18030 is available on the platforms for which you develop. Microsoft Windows XP and Microsoft Windows 2000 provide support for GB 18030 with a freely available support package. Both operating system releases have been approved by the Chinese government for sale in China. Windows NT 4.0 and earlier are exempt from the law as they were released before the February 2000 limit. Windows Millennium has not been approved and may not be available for sale in China. Microsoft did not create a system locale for GB 18030, but a code page identifier has been defined to facilitate transcoding to and from Unicode. Apple Mac OS X version 10.1 includes support in the Text Encoding Converter for converting to and from GB 18030 and Unicode. Mac OS X uses Unicode (UTF-16) as the internal text representation. Apple is expected to release a full GB 18030-2000 soon. Sun Microsystems fully supports GB 18030 in Solaris 8 2/02 and has been approved by Beijing. Support exists at the OS level for the input, display, and conversion of GB 18030 encoded text to and from Unicode. Many applications will automatically benefit from this support, though some may need modification. Oracle 9i supports GB18030 for database and national character sets. Whether or not you want to store textual data in GB18030 or another encoding (such as UTF-16) is a complex question beyond the scope of this article. You need to consider the application and programming languages that will be interfacing with the data. Oracle's Oracle9i Globalization Support Guide provides a thorough overview of the issues. Developers using C and C++ have several options available: the most recent versions of Basis Technology's commercial Rosette Core Library for Unicode, IBM's open source International Components for Unicode (ICU), and the GNU Project's libiconv library support GB 18030. Recent versions of glibc include support for GB 18030 through the iconv API, so Linux developers may have 18030 support out of the box. Sun added support for GB 18030 in the Java 2 platform version 1.4. Regrettably, Java does not utilize UTF-16 as its internal character representation so full support of GB 18030 may be lacking. IBM's International Components for Unicode for Java (ICU4J) provides support for UTF-16 and UTF-32 strings, so a Java application can get full support for GB 18030 using ICU4J for string handling. For new products there are three classes of conformance: A+, A, and B:
Historical products are rated as Class C and are considered to be "non-conforming." For information on submitting a product for conformance testing, write to: Standard Conformity Testing Center for Information Products
is a Senior Computational Linguist at Basis Technology Corp., a provider of globalization solutions based in Cambridge, Massachusetts. He develops software for Chinese, German, and Korean language analysis. He can be reached at tree@basistech.com. |
![]() 8-12 December 2008 |
||||||||