LISA Home page [© 2008 • ISSN 1420-3693 • www.localization.org]
© 2008 SMP Marketing • ISSN 1420-3693 • www.localization.org
A Look at China’s New GB 18030 Character Set Standard

Tom Emerson, Basis Technology Corp.

Tom Emerson

In his article in the March 22, 2002, issue of The LISA Newsletter, Ken Lunde introduced the new Chinese national character set standard, GB 18030:2000. His piece raises several interesting (or at least important) questions concerning the implementation of GB18030, including: What are the requirements for GB 18030 certification? How do you add support for GB 18030? Does your favorite application platform/programming language support GB 18030?


Products released in the People's Republic of China (PRC) after March 17, 2000, are considered (for the sake of GB 18030 compliance) to be "new products." This also includes updates to existing products during that time range. Products released prior to March 17, 2000 are "historical products" and are exempted from the conformance requirement.

Any product (including upgrades) released between March 17, 2000, and August 31, 2001, must be updated to conform to the standard, and any product released on or after September 1, 2001, must be certified. Conformance requirements are discussed later in the article.

GB 18030 has several advantages over its predecessors, GBK and GB 2312:1980:

  1. It contains all of the Han characters defined in Unicode 3.1: those in the Unified Han Ideograph block and in CJK Ideographic Extension Blocks A (in plane 0) and B (in plane 2). This ideograph coverage allows texts utilizing characters from Taiwan's CNS 11643 or Hong Kong's Supplementary Character Set (HK-SCS) to be encoded. The latter is particularly important since Hong Kong and Macau (which reverted to Chinese control in 1997 and 1999 respectively) each use traditional form characters, many of which were not encoded in Unicode 1.1 and hence not part of GBK.
  2. It contains all of the scripts used by the minority languages of China, including Yi, Mongolian, Tibetan and Uyghur. Each of these scripts is in active use within its respective region and it is obviously important to provide a standard character encoding for these to facilitate interchange and storage with Beijing.
  3. It can encode all 17 planes of Unicode without modifying the encoding scheme. This allows texts using virtually any of the World's scripts to be consistently encoded.

Before the adoption of GB 18030 you could not represent Western European texts using diacritics in either of the Chinese encodings. With GB 18030 not only can you support these, but also thousands of other characters found outside plane 0.

GB 18030 defines a multibyte character encoding, meaning each character is represented by a variable number of bytes: one, two, or four depending on the character. The complexity of multibyte string processing is well known amongst localization engineers. GB 18030 is particularly complex because you cannot tell from looking at a particular byte whether it is part of a two-byte or four-byte sequence.

Given the alignment between GB 18030 and Unicode the cleanest way to support GB 18030 in your applications is to fully support Unicode 3.1, using UTF-16 as the internal character representation. This is the route that all of the major operating system and library developers have taken.

However, using UTF-16 does not ameliorate the multi-byte character problem. GB 18030 contains characters in plane 2 of Unicode, which must be encoded using surrogate pairs in UTF-16. Hence any character outside of the BMP still requires multiple bytes (four instead of two), though these characters rarely occur except in personal names or classic texts. Alternatively, you can use UTF-32, giving you simpler string handling (since all characters are fixed-width) while sacrificing memory (since most characters will fit in two-bytes.) Very often the choice of representation will be based on what level of support for Unicode and GB 18030 is available on the platforms for which you develop.

Microsoft Windows XP and Microsoft Windows 2000 provide support for GB 18030 with a freely available support package. Both operating system releases have been approved by the Chinese government for sale in China. Windows NT 4.0 and earlier are exempt from the law as they were released before the February 2000 limit. Windows Millennium has not been approved and may not be available for sale in China.

Microsoft did not create a system locale for GB 18030, but a code page identifier has been defined to facilitate transcoding to and from Unicode.

Apple Mac OS X version 10.1 includes support in the Text Encoding Converter for converting to and from GB 18030 and Unicode. Mac OS X uses Unicode (UTF-16) as the internal text representation. Apple is expected to release a full GB 18030-2000 soon.

Sun Microsystems fully supports GB 18030 in Solaris 8 2/02 and has been approved by Beijing. Support exists at the OS level for the input, display, and conversion of GB 18030 encoded text to and from Unicode. Many applications will automatically benefit from this support, though some may need modification.

Oracle 9i supports GB18030 for database and national character sets. Whether or not you want to store textual data in GB18030 or another encoding (such as UTF-16) is a complex question beyond the scope of this article. You need to consider the application and programming languages that will be interfacing with the data. Oracle's Oracle9i Globalization Support Guide provides a thorough overview of the issues.

Developers using C and C++ have several options available: the most recent versions of Basis Technology's commercial Rosette Core Library for Unicode, IBM's open source International Components for Unicode (ICU), and the GNU Project's libiconv library support GB 18030. Recent versions of glibc include support for GB 18030 through the iconv API, so Linux developers may have 18030 support out of the box.

Sun added support for GB 18030 in the Java 2 platform version 1.4. Regrettably, Java does not utilize UTF-16 as its internal character representation so full support of GB 18030 may be lacking. IBM's International Components for Unicode for Java (ICU4J) provides support for UTF-16 and UTF-32 strings, so a Java application can get full support for GB 18030 using ICU4J for string handling.

For new products there are three classes of conformance: A+, A, and B:

A+The product supports the input, output, edit, and display of all characters in GB 18030, including the ethnic minority scripts: Mongolian, Tibetan, Yi, and Uyghur. The product is considered to be in "full conformity with GB 18030."
A   The product supports the input, output, edit, and display of all characters in GB 18030, excluding the ethnic minority scripts. The product must not corrupt the ethnic minority characters even if it cannot display them. Such products are considered to be in "conformity with GB 18030."
BProducts that were released between March 17, 2000, and August 31, 2001, that satisfy the requirements for a Class A rating are given Class B and are considered to be in "basic conformity with GB 18030."

Historical products are rated as Class C and are considered to be "non-conforming."

For information on submitting a product for conformance testing, write to:

Standard Conformity Testing Center for Information Products
#1 Andingmen Dong Da Jie
Beijing, China
Tel: 84029573 or 84029792
Fax: 64007681


Tom Emerson is a Senior Computational Linguist at Basis Technology Corp., a provider of globalization solutions based in Cambridge, Massachusetts. He develops software for Chinese, German, and Korean language analysis. He can be reached at tree@basistech.com.




LISA 2008 events

Advertise with LISA


Adaquest

ADAPT Localization

Languages Media

LISA Forum Europe

8-12 December 2008
Registration Open


LISA Surveys

EventsNews

Joining LISA

Best Practice Guides

LISA Wireless Primer


OSCARTBXTMX

Terminology SIG

Job and CV Postings