« Do you think Unicode support doesn't matter? Think again | Main | Terminology problems? »

February 07, 2005

Avoid MSHTML for localization projects

I have recently seen some HTML files translated using a process that involves Microsoft’s HTML editing system MSHTML, and have found that this system is seriously flawed in how it deals with HTML files. These flaws could break many web localization projects, and MSHTML should be avoided if at all possible.

Most of these problems come because MSHTML doesn’t leave the structure of HTML content alone, but instead (apparently) loads the structure and then regenerates it. Specific problems include the following:

  1. MSHTML adds tags that should not be added. For example, it adds <TBODY> tags to all tables, whether they are required or not. If <tbody> has a style applied via CSS, this could change the appearance of tables. It also tends to add DOCTYPE declarations, even if they already exist in the file! This last problem is quite inexplicable and a major problem.
  2. MSHTML changes all tags to upper-case. This would not be a major problem, except that XHTML requires that all tags be lower-case.
  3. MSHTML removes quotes around some HTML attributes. Not enclosing attributes in quotes was a common practice with HTML 4, but even then was considered poor coding practice. XHTML requires all attributes to have quotes. Together with the last point, this means that MSHTML will not generate valid XHTML.
  4. MSHTML does not understand PHP tags embedded in HTML content. I found that about half of the embedded PHP content in the files I was looking at were corrupted in MSHTML, typically by conversion of the angle brackets surrounding the tags into entities like &lt; . This conversion breaks the code severely since executable code in the file is (a) no longer recognized as code, and (b) visible in the browser source, a potentially severe security flaw.
  5. MSHTML apparently likes to have local rather than relative links, and links to files are recoded to refer to local copies rather than their proper server location.

All together, these are serious flaws for anything but the most casual HTML editing, and can generate major problems in a localization framework. MSHTML should be avoided, and instead the built-in document profiles and tag protection of translators’ workbenches (like Trados, SDLX, etc.) should be used. While these may have their own problems at times, they do at least leave document structure alone and don’t try to interpret and rebuild the document structure.

Posted by at February 7, 2005 01:09 PM

Comments

Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?