LISA Home page [© 2010 • ISSN 1420-3693 • www.localization.org]
© 2010 SMP Marketing • ISSN 1420-3693 • www.localization.org

In this issue…


We Need to Be in China Now!
(Or, How PayPal Migrated to Unicode)

Scott Atwood, Development Technical Lead for Global Foundations Engineering, PayPal

PayPal is now available in 103 markets worldwide, with a local presence in 13 countries outside of the U.S. However, this has not always been the case. PayPal’s experience shows that even a company with a large codebase, large quantities of data and a high transaction volume can successfully migrate to Unicode. In this very candid article, Scott Atwood (Development Technical Lead for PayPal’s Global Foundations Engineering) explains how his team overcame several obstacles to enable the company’s global reach.

Editor’s Note: If you need help with a Unicode migration project, then contact LISA at lisa@lisa.org, so that we can help you. Several of our Members have already been through this scenario and can guide you through the process. Or, consider joining LISA, so that you can network directly yourself. Visit http://www.lisa.org/info/membership.html.


When eBay bought PayPal in 2002, it was a leading online payment company, but not a global one. Why? Because, even in today’s more globalized world, payment systems vary from country to country, culture to culture, as do the accounting systems upon which they are based. No American would ever think of going to 7-Eleven to pay for and pick up a package from Amazon.com, but that’s precisely the preferred method in Japan. And, localizing for payment systems can’t be resolved through a few meetings with a localization services provider or even a large contract with an IT consulting firm.

Fast-forward to today: PayPal is now available in 103 markets worldwide, with a local presence in 13 countries outside of the U.S.: Australia, Austria, Belgium, Canada, China, France, Germany, Ireland, Italy, the Netherlands Spain, Switzerland and the U.K. Having a well-internationalized codebase, including ubiquitous Unicode support, has enabled PayPal's success in expanding to new markets around the world.

But PayPal hasn't always been a global company with strong Unicode support. It was founded in 1998 in Silicon Valley. Like many start-ups, PayPal initially focused exclusively on the U.S. market. At that time, the company followed the usual model of not expending any effort to lay the foundation for later internationalization or localization of its code base. In the beginning, all text-handling code was written with the implicit assumption that every character was one byte, and that all strings were in ISO-8859-1 encoding.

PayPal’s Initial Steps to “Go Global”

This foundation was sufficient for the U.S. market and served PayPal well during the first several years of its existence. However, as the company grew and began to consider expanding outside of its domestic market, it became clear that it would eventually be necessary to change the way that textual data was processed and stored. In order to operate globally (many of the fastest growing economies speak languages that are not supported by the Windows-1252 character set), PayPal would have to migrate to Unicode – and soon.

In late 2002, PayPal still had a local presence only in the United States, but it already enjoyed a broad international audience. The company was poised to begin expanding internationally, first into the United Kingdom, and then into other Western European markets.

Editor's Note: At this time, approximately 25% of PayPal's transactions involved at least one party outside of the U.S., and around 2 million of its 17.8 million accounts were held outside of the U.S. Unicode would enable this business to flow more smoothly by allowing languages to be mixed in a simple and straightforward way. For example, if someone in Germany wanted to transact with someone in China, Unicode could support a mix of both German and Chinese text, while neither Windows-1252 nor GB2312 (the character set standard for Mainland China) could. In addition, PayPal wanted to move beyond international transactions to intra-national transactions, in which people resident in a particular country could pay for their items in their own currency within their own country. In the fall of that same year, PayPal was acquired by eBay. The latter already had operations in 20 countries at that point.

Now that the company was ready to begin its international expansion, the Engineering Team realized that it was important to begin the migration to Unicode as soon as possible. Text processing code was ubiquitous throughout the PayPal system, and the system continued to grow in size and complexity. The team understood that as time passed, the cost and complexity of a migration to Unicode would also increase.

PayPal's roadmap for international expansion at that time included only those Western European countries that could be supported by the existing ISO-8859-1-based infrastructure. Even though there was no immediate business need for Unicode, the Engineering Team knew that it would be needed once the company decided to enter markets like China. They also recognized that the migration would be a much simpler task now than in the future. Therefore, in November of 2002, the Engineering Team began a project to migrate to Unicode. Initially, there was a small team of two engineers dedicated to the migration project.

It became clear that the scope of effort required for the migration had been underestimated.

As the project progressed, it became clear that the scope of effort required for the migration had been underestimated. The project would need to touch nearly all components of the PayPal system, but no other engineering teams were able to allocate any time or resources to work on the project. In addition, there was no cross-team plan to coordinate efforts across the company.

The Unicode Migration Project Is Suspended Indefinitely

In August of 2003, the project was suspended indefinitely. The management team understood that delaying the project would increase its eventual size and complexity, but the existing Windows 1252-based system was sufficient for supporting all the localizations to new markets that were on the product roadmap at that time. Without an immediate business need, PayPal decided not to expend any more time or effort on the Unicode migration.

Although the project was suspended, the team had made some progress over the ten months it had been in existence. The system had been modified to accept only plain ASCII data for most user input. This was done to limit the growth of non-ASCII data in the database, which would ease the task of converting the database to Unicode. The existing non-ASCII data in the database had been identified and analyzed. The team tested a number of character set detection packages for use in analyzing the data in the database, and eventually decided on the UniversalCharsetDetection library from the Mozilla project. It identified the areas of the system that would be impacted by the transition and the approximate effort that would be required to accommodate Unicode in those areas. The team also came up with a database conversion strategy. At that time, the amount of non-ASCII data in the database was quite small, and it would have been possible to convert it all to Unicode during one of the normal downtime windows for database maintenance.

PayPal Enters Western Europe

PayPal's first international market was the United Kingdom, which launched in September of 2003. In addition to the ASCII repertoire used in the U.S., the U.K. required the '£' symbol. To avoid increasing the non-ASCII data in the database, this character was stored in the database as the HTML character entity '£'.

The next market was Germany, where PayPal launched in February 2004. The character repertoire of Germany provided more challenges than that of the U.K., since it required a minimum of seven additional non-ASCII characters to properly represent the German character set. By this time, the Unicode project had been suspended for several months, so it was decided to partially lift the restriction on non-ASCII characters in the database. Users would be permitted to enter the characters required for the German language, and those characters would be added to the database in the ISO-8859-1 encoding. As additional Western European markets were added, additional characters were added to the permitted repertoire. Of course, this meant that the number of non-ASCII records in the database began to grow again, which would complicate the eventual efforts to convert the database to Unicode.

The Project Is Revived for China

The Unicode migration project was revived in June 2004 when PayPal decided to expand its service to China. The project was restarted with three engineers and an initial estimate of 280 days. By the time the project was completed in November 2005, over 20 engineers were dedicated to the project, with additional contributions from nearly every Engineering Team in the company. In the end, the total engineering effort was approximately 1500 person-days, and over 3000 source code files were changed.

Because the scope of this project was so large, the project had to be executed in multiple phases. During the first phase, the internal implementation of the core text processing library was replaced with an implementation based on Unicode. The interfaces were changed as little as possible to minimize the impact on the rest of the code. Rather than implementing a full Unicode text processing library, the team decided to use the UnicodeString and related classes from IBM's International Components for Unicode (ICU). PayPal was already using ICU extensively for many of its other internationalization needs, so it was a natural choice to leverage it for Unicode support as well. The foundations for Unicode were laid in this phase, but all storage, display and processing of text were still based on the ISO-8859-1 encoding. This phase of the project was released on the PayPal web site in February 2005.

Migrating the Database

By this time, so much non-ASCII data had been added to the database that it was no longer feasible to convert all of it during a database maintenance window. Instead, the data would have to be converted while the site remained active. In order to accommodate this new approach, the database access code was rewritten to handle a mixture of both ISO-8859-1 encoded and UTF-8 encoded data during the second phase of the project. When reading from the database, the database access layer would now attempt to interpret text as if it were encoded in UTF-8, and if that failed, it would treat the text as ISO-8859-1. In addition, the database access layer would write all new and updated records in UTF-8. This would ensure that no additional data would be added to the database that would need to be converted. During this phase of the project, the output encoding of all PayPal pages was changed to UTF-8, the first change visible to PayPal customers. This phase of the project was released in May 2005.

Shortly after this code was released, the database administrators began the process of converting database records to UTF-8. Approximately 55 million non-ASCII records were converted. To convert the records, the database administrators used the Oracle csscan tool to determine which tables contained non-ASCII data. Next, they executed a PayPal scan script over these tables to determine which records within these tables contained non-ASCII data. The encoding of each non-ASCII text record was detected using the Mozilla UniversalCharsetDetection library, with the user's country of residence and email address top level domain name used as hints. The text was converted to UTF-8 and stored in the original table, while the original data and detected encoding were saved in a temporary table for data recovery purposes.

As expected, the vast majority of text was detected as Windows-1252 or ISO-8859-1. For critical tables containing name and address data, we examined by hand all data that was detected as a character set other than Windows-1252 or ISO-8859-1. Approximately 7,000 such records were examined, and any data that was incorrectly converted was manually corrected. We didn't attempt to validate or correct any data that was detected as Windows-1252, since there were so many records of this type. Also, because we had limited the amount of non-ASCII data that users could enter since November 2002, most legitimate non-ASCII, non-Windows-1252 data was more than two-and-a-half years old and came from a time before PayPal had localized for any international markets.

Only the most astute user would recognize the enormous changes that had occurred behind the scenes.

In the final phase of the Unicode migration project, released in June of 2005, the default character set for internal text process was changed to UTF-8, and Chinese characters were enabled for user input. The final product had remarkably few bugs, and only the most astute user of our services would be aware of the enormous changes that had occurred behind the scenes. The hard work paid off. Less than a month later, PayPal released a fully localized service for China.

Success!

At the time it was completed, the Unicode migration project was one of the largest single projects that had ever been completed at PayPal, and it was highly successful by every measure. Having an immediate business need helped the technology team drive the project to completion with strong management support. Having a Unicode-based infrastructure paid off immediately by enabling PayPal’s expansion into the China market. PayPal’s experience shows that even a company with a large codebase, large quantities of data and a high transaction volume can successfully make the change to Unicode.


Scott Atwood is a Development Technical Lead on the Global Foundations Engineering Team at PayPal, which is responsible for localization. Previously, he was a Senior Engineer on the China Localization Team, and an Engineer on the Unicode Migration Project. Atwood can be reached at satwood@paypal.com.




Contents


LISA Business Data

LISA Publications Catalog

Industry Insights Reports

Best Practice Guides

Surveys

QA Model

Forum Summaries and Presentations

LISA Globalization Consulting Network

Webinars and TouchPoint Advisory Calls


Join LISA

Subscribe


Upcoming Events

LISA Forum USA
(Foster City, California, April 13–16, 2010)

LISA@Chinasoft Fair
(Chengdu, China)

LISA Forum Asia
(Suzhou, June 28–July 1, 2010)

LISA Forum Europe
(Budapest, October, 2010)

LISA Forum India
(New Delhi, December, 2010)


Open StandardsTBXTMX

Terminology SIG

Job and CV Postings