|
In this issue…
How Unicode is Conquering the World
The transition strategy to Unicode
Unicode already surrounds most computer users, whether they realize it or not. Françoise Yergeau provides a good look at the present how users can go about moving toward fuller implementation of Unicode. The localization industry has long suffered at the hands of the character set problem. Any localization specialist who has strayed away from Western European languages on a single computing platform has been bitten by the complexities of this hairy subject, and the related subject of fonts. The problem is not new. It finds its sources, dating back to the dawn of the computer age, in the need and desire to get computers to process text, in the existence of numerous languages and in the creativity of computer scientists and engineers. The result has been, as is well known, a plethora of text encoding schemes: the Babel tower of languages multiplied by the variety of computing platforms, compounded by the rapid evolution of the latter. And the result of that, in turn, is a lot of problems in data exchange, data durability, software porting and, of course, localization. A little historyProblems call for solutions. In the case at hand, people started really suffering enough and devising a good solution in the 80’s. This is when the Unicode effort for a Universal Character Set started. 1993 saw the publication of Unicode 1.0 [1]. In 1993, the Unicode Consortium and the ISO sub-committee on character sets merged their efforts, resulting in the publication of Unicode 1.1 and ISO/IEC 10646:1993. Since then, new versions have been periodically published (3.1 is the latest [2], aligned with ISO/IEC 10646:2000 [3]), each one adding new characters so as to get closer to the goal of a character set that encompasses all of the world’s written languages. Ten years after version 1.0, Unicode is well on its way to conquering the world. Knowingly or not, most computer users have been exposed to Unicode, use or generate data in Unicode, and, even when legacy encodings* are involved, often interact with the data through Unicode-based software. Case in point: Microsoft has been an early member of the Unicode Consortium as well as an early adopter. Windows NT, initially released in 1993, is entirely Unicode-based (as are of course its descendants Windows 2000 and the forthcoming Windows XP). The Office suite has been slower to adopt Unicode, but many of its components (notably Word) have been Unicode-based since the 97 release and the others have been or will soon be made Unicode-based. Many other major Microsoft applications (Internet Explorer, SQL Server, etc.) have also adopted the standard. Therefore, many users today are literally bathed in Unicode, often without realizing it. Microsoft has stopped designing code pages for newly supported languages; these languages are supported only through Unicode. This alone indicates clearly that those in the localization industry that have not learned to deal with Unicode will need to do so shortly! One of the design principles of Unicode was Han Unification. This principle says that a given Han character (Chinese, Japanese, Korean or Vietnamese) should have only one code point in Unicode, even though two or more languages use it, possibly with some variation in shape details. If the shape differences are so large that readers are unlikely to recognize the character, however, it gets a separate code point for each distinct variant. This is very much analogous to how we have never had separate code points for roman, italic, serif or sans-serif A (regardless of language), but runs counter to the long standing habit of having separate character sets—and therefore code points—for each Asian language. This policy has generated a significant amount of opposition to Unicode in Asia, especially in Japan. Much of that pushback, however, tends to dissolve when Japanese opponents learn that their beloved JustSystem Ichitaro word processor, a best seller in Japan, has been Unicode-based for about five years. Despite such evidence, the belief that Unicode is inadequate for Asian languages is still very much alive; see for instance the recent diatribe at http://www.hastingsresearch.com/net/04-unicode-limitations.shtml (but don’t miss the response from one of Unicode’s technical directors at http://slashdot.org/article.pl?sid=01/06/06/0132203). Of course, JustSystem and Microsoft are far from alone in having adopted Unicode. Every major database supports it in one form or another; Web browsers have supported it for some time and the major ones are now based on it; a serious Linux internationalization effort is under way (see http://www.li18nux.org/), in good part based on Unicode. In fact, almost all the household names in the software industry have something to do with Unicode today. For an unfortunately incomplete list, see http://www.unicode.org/unicode/onlinedat/products.html. Reference Processing ModelBut just how does one go about applying Unicode today, in this period of transition when data and programs constitute a mixed bag of Unicode and non-Unicode? The answer lies in good part in a notion called the Reference Processing Model, which in turn has its roots in the effort of internationalization of the World Wide Web of the mid-nineties. Back then, the Web entirely revolved around html, but HTML was defined to be based on the ISO/IEC 8859-1 (a.k.a. ISO Latin-1) character set. Alas, this in principle denied Web access to any language not supported by this very limited encoding, i.e. to most languages. Things had to move, formally or not, and they did. Problems and ambiguities cropped up when people used other encodings in HTML. For instance, HTML defines the Numeric Character Reference (NCR) ‘é’ as an alternate representation of character number 233. But is it number 233 in the encoding used by the page, or in the official ISO Latin-1? Opinions diverged, interoperability suffered, etc. The Reference Processing Model, first embodied in RFC 20704, was formalized to deal with these issues. It goes roughly as follows:
It is noteworthy that Unicode’s coverage today is wide enough that Unicode-compatible encoding means just about any encoding. In fact, almost all important legacy encodings served as sources for the Unicode standard. As mentioned above, this model was first formalized for HTML in RFC 2070; it was later built into the W3C’s HTML [4][5], built into the very first (and still current) version of XML [6] and is a crucial part of the W3C’s Character Model for the World Wide Web [7]. Formally, this model applies mostly to Web specifications. Nevertheless, a growing number of applications use it, at least informally; these applications implement a Unicode text model and transcode anything that they read or write in legacy encodings. Unfortunately, some other applications that claim to support Unicode encodings do so in the reverse way: when presented with Unicode input, they transcode it to their legacy code page (thereby losing any character not in that code page), do the processing in that code page and transcode back the output to Unicode. This strategy offers the advantage of a very quick Unicode enablement, but the limitations are obvious. Buyers beware! ‘Unicode-enabled’ may not mean all that it seems to imply! Unicode and FontsLocalization specialists are very much aware of the issues with fonts, some of which they encounter every day when localizing to languages using foreign scripts. Does Unicode change the picture? The short answer is that it has already done so, and a longer answer is that more change is happening now or coming soon. First the changes already under our belts: the TrueType specification has been based on Unicode from day one. Since TrueType is used by MacOS, Windows and, iNCReasingly, by various Unix systems, this is an area where Unicode is already prevalent. Technically, this reliance on Unicode means that TrueType fonts normally contain a table mapping the glyphs (character images) in the font to Unicode characters. One short-term problem is that toolmakers and font designers are not all aware of that aspect, do not all understand it correctly and, consequently, we end up with fonts that lie about their contents. Many fonts claim (through their Unicode mapping table) to contain characters in the 0-255 range (Latin-1), but they actually contain other stuff. Until applications have been migrated to Unicode, there is actually an incentive for fonts to lie in this way, when the font is meant to be used with an application that supports only Latin-1. This is the old font-mapping trick to access foreign characters. For reasons not to go that route, see considered harmful at http://babel.alis.com/web_ml/html/fontface.html. A simple one-to-one mapping from Unicode characters to glyphs in a font, as provided by the TrueType specification, is not good enough to cover complete Unicode rendering. In general the mapping needs to be many-to-many, as exemplified by the case of Arabic: in this script, each letter varies in shape depending on context—one-to-many—and there is a compulsory, orthographically-required ligature in which two letters (lam and alif) are imaged by one glyph—many-to-one. In fact, since this ligature glyph changes shape in context, we actually have a case of many-to-many mapping with something as mundane and widespread as Arabic. The newer OpenType font specification, based on TrueType and inheriting its Unicode aspects, provides additional mapping tables to deal with these cases. This change is happening now, as OpenType spreads. One often hears about Unicode fonts. What does that mean? One possible meaning would be fonts that support Unicode, by having Unicode mapping tables as described above. Nevertheless, most often the term is used to mean a font that covers all of Unicode. This is not bad in itself, but a problem is that people often conclude from the existence of the notion that such fonts are required for proper Unicode rendering support. This is not true! Such pan-Unicode fonts, as I prefer to call them, do exist ‡, but they are huge, unwieldy and harder and harder to develop as Unicode expands. Only simplistic applications running on top of simplistic rendering engines need them, and then they are needed only to obtain extensive Unicode coverage. Even those simplistic applications can make use of any Unicode-mapped font (not pan-Unicode fonts) to render Unicode text, provided it is willing to limit its rendering capabilities to the subset of the Unicode repertoire provided by the font. For many applications this is enough, and allows Unicode to be used where a legacy encoding was used before, without changing fonts. The TrueType fonts provided with Windows and MacOS have long had Unicode mapping tables, they are Unicode-ready. To benefit more fully from the large Unicode character repertoire without resorting to pan-Unicode fonts, a technique variously called virtual fonts, font cascading or recently font linking is emerging. This works by grouping fonts covering different portions of Unicode into classes, and by upgrading the rendering engine so that, when called upon to render a string in a given font, it will resort to other fonts in the same class to render characters not in the given font. The Alis Tango browser may have been the first to implement this technique for Unicode rendering, providing multilingual Web browsing way back in 1995 by using only the standard set of Windows TrueType fonts. Others have followed suit and the technique is spreading (try it in your browser!). It has been formalized somewhat in the W3C’s css2 style sheet specification [8]. Some day soon, hopefully, font linking will be provided by the basic rendering facilities of operating systems, application developers will stop re-inventing it, and users (and localizers) will not need any more to switch fonts all the time, except when they really mean it. Endnotes* In this paper, the term legacy encoding designates any character encoding except the standard encodings of Unicode, even though many of these are actually very current. † Transcoding is the process of transforming a pIECe of encoded text from one encoding to another. ‡ The defunct Bitstream Cyberbit and the newer Arial Unicode MS are the most well known. The latter weighs in at 23 Mbytes, placing it squarely in the heavyweight category. References[1] The Unicode Consortium. The Unicode Standard, Version 1.0. Reading, MA, AddISOn-Wesley Publishing Company, 1991. ISBN 0-201-56788-1. [2] The Unicode Consortium. The Unicode Standard, Version 3.1.0. Unicode Standard Annex #27: Unicode 3.1 (which amends The Unicode Standard, Version 3.0). 2001-03-23. http://www.unicode.org/unicode/reports/tr27/.
[3] ISO/IEC 10646-1:2000, Information technology—Universal Multiple-Octet Coded Character Set (UCS)—Part 1: Architecture and Basic Multilingual Plane. (See http://www.ISO.ch/cate/d29819.html.) [4] F. Yergeau, G. Nicol, G. Adams, M. Dürst, Internationalization of the Hypertext Markup Language, ietf RFC 2070, January 1997. (See http://www.ietf.org/RFC/RFC2070.txt.) [5] Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.0 Specification, W3C Recommendation 18-Dec-1997 (See http://www.w3.org/TR/REC-html40-971218/.) [6] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Eds., Extensible Markup Language (XML) 1.0, W3C Recommendation. (See http://www.w3.org/TR/REC-XML.) [7] Martin J. Dürst, François Yergeau, Misha Wolf, Asmus Freytag, Tex Texin, Eds., Character Model for the World Wide Web 1.0, W3C Working Draft. (See http://www.w3.org/TR/charmod/.) [8] Bert Bos, Håkon Wium Lie, Chris Lilley, Ian Jacobs, Eds., Cascading Style Sheets, level 2 (css2 Specification), W3C Recommendation. (See http://www.w3.org/TR/REC-CSS2.) About the authoris one of the pioneers of Internet internationalization. Active in ISO, IETF and W3C standardization, he was editor or author of several Internet RFCs and W3C specs. He regularly fulfills speaking engagements around the world, represents Alis on the AdvISOry Committee of the World Wide Web Consortium (W3C), and is a member of the board of directors of the Centre international pour le développement de l’inforoute en français (CIDIF). He holds B.Sc. and Ph.D. degrees in physics. François can be reached at FYergeau@alis.com. |
LISA Business Data Forum Summaries and Presentations LISA Globalization Consulting Network Webinars and TouchPoint Advisory Calls LISA Forum USA LISA@Chinasoft Fair LISA Forum Asia LISA Forum Europe LISA Forum India Open Standards • TBX • TMX |
||