TMX 2.0 Specification Draft

OSCAR Public Committee Draft - 2009 March 9

lisasig.gif


This draft of TMX 2.0 is released for public comment. Public feedback is encouraged and should be sent to Arle R. Lommel <arle@lisa.org> for consideration by April 10, 2009.

Editors:

Rodolfo M. Raya <rmraya@maxprograms.com>
Arle R. Lommel <arle@lisa.org>

Previous Editors:

Yves Savourel
Alan K. Melby

Copyright © The Localisation Industry Standards Association [ LISA ] 1997-2009. All Rights Reserved.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to LISA.

The limited permissions granted above are perpetual and will not be revoked by LISA or its successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and LISA DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Abstract

This document defines version 2.0 of the Translation Memory eXchange format (TMX). The purpose of the TMX format is to provide a standard method to describe translation memory data that is being exchanged among tools and/or translation vendors, while introducing little or no loss of critical data during the process.

Status of this Document

This document constitutes an initial draft for discussion. Comments may be sent to tmx2@lisa.org.

Table of Contents


Abstract

1. Introduction
1.1. XML Conformance
1.2. Character Encoding
1.3 Extensibility
1.3.1. Extension Points

2. General Structure
2.1. Header
2.2. Body

3. Detailed Specifications
3.1. Elements
3.1.1. Structural Elements
3.1.2. Inline Elements
3.2. Attributes
3.2.1. TMX Attributes
3.2.2. XML Namespace Attributes

4. Content Markup
4.1. Overview
4.2. Representing Inline Elements
4.2.1. When the Content of Tags Is Available in the Translation Memory
4.2.2. When the Content of Tags Is Unavailable in the Translation Memory

5. TMX Compliance
5.1 Validation of TMX Files

6. Changes Since Previous Version (Non-Normative)
6.1 Upgrading TMX Files

Appendices
A. Sample Document
B. XML Schema for TMX
C. Glossary
D. References
Normative
Non-Normative


1. Introduction

TMX is defined in two parts:

  • A specification of the format of the container (the higher-level elements that provide information about the file as a whole and about entries). In TMX, an entry consisting of aligned segments of text in two or more languages is called a Translation Unit (the <tu> element).

  • A specification of a low-level meta-markup format for the content of a segment of translation-memory text. In TMX, an individual segment of translation-memory text in a particular language is denoted by a <seg> element. See the section on Content Markup for more details.

1.1. XML Conformance

TMX is XML-conformant. The TMX vocabulary is defined using an XML Schema (see Appendix B) It also uses various third party standards for date/time and language codes. See the References section for more details.

TMX files are intended to be created automatically by export routines and processed automatically by import routines. TMX files are well-formed XML documents that can be processed without explicit reference to the TMX Schema. However, a valid TMX file must conform to the TMX Schema, and any TMX file about which there are concerns should be verified against the TMX Schema using a validating XML parser.

Since XML syntax is case sensitive, any XML application must define casing conventions. All elements and attributes names of TMX are defined in lowercase.

The namespace URI for TMX 2.0 is defined as "http://www.lisa.org/tmx20". For example, TMX used in another (non-TMX) XML document would appear something like this:

<?xml version="1.0" encoding="utf-8"?>
<myformat xmlns:tmx="http://www.lisa.org/tmx20">
<data>
  <tmx:tmx version="2.0">
    <tmx:header ...
       ... TMX data ... 
    </tmx:body>      				
  </tmx:tmx>
</data>
</myformat>
		

1.2. Character Encoding

TMX files are always in Unicode. They can use either of three encoding methods: UTF-16 (16-bit files), UTF-8 (8-bit files) or ISO-646 [a.k.a. US-ASCII] (7-bit files).

In all non 7-bit cases, unlike in HTML, only the following five character entity references are allowed: &amp; (&), &lt; (<), &gt; (>), &apos; ('), and &quot; ("). For 7-bit files, extended (non-ASCII) characters are always represented by numeric character references. For example: &#x0396; or &#918; for a GREEK CAPITAL LETTER DELTA.

Note that proper UTF-16 files always start with the Unicode byte-order-mark (BOM) values U+FEFF or U+FFFE (indicating “big-endian” and “little-endian” byte orders respectively). UTF-8 files may (but are not required to) begin with the UTF-8 BOM (EF BB BF).

Since all XML processors must accept the UTF-8 and UTF-16 encodings and since US-ASCII is a subset of UTF-8, a TMX document can omit the encoding declaration in the XML declaration, although its inclusion is recommended. Note, however, that for accurate character set detection, UTF-16 files must being with the BOM. Applications that support TMX must be able to read files stored in all three of these encodings (including both UTF-16 byte orders) regardless of the encoding the tools use internally. They must also correctly interpret UTF-8 files beginning with the optional BOM.

In addition, if the source database or application generating a TMX file uses character codes in the Private Use Area of Unicode (code points U+E000–U+F8FF) it must convert those code points to their corresponding character entities in TMX files. For example, if a source document uses the “fft” ligature found in certain Adobe OpenType fonts at code point U+E097 in the Private Use Area, the corresponding TMX document would represent this character as &xE097;. This process is required since many text-processing tools do not support the PUA. Inclusion of such character entities in TMX files may necessitate additional negotiation between the creator and receiver of the file if such code points are to be properly interpreted. Such negotiations are outside the scope of the TMX standard and use of the PUA is discouraged when possible.

1.3 Extensibility

TMX provides a mechanism for the exchange of translation memory data, not application-specific features or data. Transferring data alone may not transfer the knowledge of how to process data. As a result, although TMX provides a rich set of elements for exchanging Translation Memory data, sometimes it may be necessary to extend TMX vocabulary using XML Namespaces in order to support functions needed for specific tasks.

It is possible to add non-TMX elements, as well as attributes and attribute values, to any TMX document. All foreign elements and attributes added to a TMX file must be defined using an XML Schema. All XML Schemas declared in a TMX document must be made available to permit validation of the foreign constructs included in the file.

Although TMX offers this extensibility mechanism, in order to avoid difficulty in processing and increase interoperability between tools, it is strongly recommended to use TMX capabilities whenever possible, rather than to create non-standard user-defined elements or attributes.

Applications that depend on the TMX format for exchanging Translation Memory data are not required to understand or support non-TMX elements or attributes. A TMX application can safely ignore foreign elements or attributes present in a TMX document.

1.3.1. Extension Points

TMX supports the use of foreign XML elements in the following elements: <body>, <header>, <internal-file>, <tu> and <tuv>.

Foreign attributes can be added to any TMX element, provided that the attribute name is fully qualified with the corresponding namespace prefix.


2. General Structure

A TMX document is enclosed in a <tmx> root element. The <tmx> element contains two elements: <header> and <body>.

2.1. Header

The <header> contains meta-data about the document. In addition to its attributes, <header> can also store document-level information in <note> elements. Any SRX 2.0-format representation of segmentation rules used to generate a TMX file must be included in the <header> using a <segmentation> element. The <segmentation> element does not need to be used if no such rules are available in SRX 2.0 format or if no segmentation rules apply (e.g., because the source file was pre-segmented).

2.2. Body

The <body> contains the collection of translation units (the <tu> elements). This collection is in no specific order.

Each <tu> element contains at least one translation unit variant (the <tuv> element). Each <tuv> contains the segment and the information pertaining to that segment for a given language. (Note that if fewer than two <tuv> elements appear in a <tu> element, that <tu> element is considered to be incomplete. Incomplete <tu> elements may be needed for some applications, although they would not generally be useful for translation memory (TM) applications.)

The text itself is stored in the <seg> element, while <note> allows for storage of additional information specific to each <tuv>.

A segment can contain markup content elements: The <itag> element allows for the location of native inline codes to be indicated, along with their relationship to each other (e.g., paired tags). It also provides the optional capability to encapsulate native native inline codes. The <hi> element allows for the addition of extra markup not related to existing inline codes. And the <sub> element, used inside encapsulated inline code, allows for the delimitation of translatable text within markup (e.g., the content of an HTML <alt> tag).

See the Sample Document section for an example of TMX document.


3. Detailed Specifications

3.1. Elements

TMX elements are divided into two main categories: the structural elements (the container), and the inline elements (the content markup).

3.1.1. Structural Elements

The structural elements are the following:


<body>

Body - The <body> element encloses the main data, the set of <tu> elements that are comprised within the file.

Required attributes:

None.

Optional attributes:

None.

Contents:

Zero, one or more <tu> elements and
Zero, one or more non-TMX elements, in any order.


<context>

Context Information - The <context> element describes the context of a <tu>. The purpose of this context information is to allow certain pieces of text to have different translations depending on where they came from. The translation of a piece of text may differ if it is a web form or a dialog or an Oracle form or a Lotus form for example. This information is thus required by a translator when working on the file. Likewise, the information may be used by any tool proposing to automatically leverage the text successfully. Note that the local context (i.e., text that surrounds a given <tu> element) is indicated using the group and g-order attributes.

Required attributes:

context-type.

Optional attributes:

None.

Contents:

Text.
Suggested values taken from the TBX Basic specification for use in a software localization environment include, but are not limited to, Menu item, Dialog box, Group box, Text box, Combo box, Combo box element, Check box, Tab, Push button, Radio button, Spin box, Progress bar, Slider, Informative message, Interactive message, ToolTip, Table text, and User-defined type.


<external-file>

External file - The <external-file> element specifies the location of the actual SRX file being referenced. The required href attribute provides a URL to the file. The crc attribute accepts a value that can be used to assure the integrity of the file. The optional uid attribute allows a unique ID to be assigned to the file.

Required attributes:

href.

Optional attributes:

crc, uid.

Contents:

Empty.


File header - The <header> element contains information pertaining to the whole document.

Required attributes:

creationtool, creationtoolversion, segtype, o-tmf, adminlang, srclang, datatype.

Optional attributes:

o-encoding, creationdate, creationid, changedate, changeid.

Contents:

Zero, one or more <note> elements, followed by
Zero or one <segmentation> element, followed by
Zero, one or more non-TMX elements.


<internal-file>

Internal file - The <internal-file> element contains the actual SRX file with the segmentation rules used when generating the TMX document.

Required attributes:

None.

Optional attributes:

None.

Contents:

One SRX file embedded using SRX namespace.


<note>

Note - The <note> element is used for comments.

Required attributes:

None.

Optional attributes:

creationdate, creationid, changedate, changeid, o-encoding, xml:lang.

Contents:

Text.


<seg>

Segment - The <seg> element contains the text of the given segment. There is no length limitation to the content of a <seg> element. If the optional xml:space attribute is set to "preserve", all spacing and line-breaking characters are significant within a <seg> element.

Required attributes:

None.

Optional attributes:

xml:space.

Contents:

Text data,
Zero, one or more of the following elements: <hi>, and <itag>.
They can be in any order.


<segmentation>

Segmentation - The <segmentation> element points to or contains the SRX segmentation rules that were used in the generation of the TMX file.

Required attributes:

None.

Optional attributes:

None.

Contents:

Either exactly one <internal-file> or one <external-file> element.


<tmx>

TMX document - The <tmx> element encloses all the other elements of the document.

Required attributes:

version.

Contents:

One <header> followed by
One <body> element.


<tu>

Translation unit - The <tu> element contains the data for a given translation unit.

Required attributes:

None.

Optional attributes:

tuid, o-encoding, datatype, usagecount, lastusagedate, creationtool, creationtoolversion, creationdate, creationid, changedate, segtype, changeid, o-tmf, srclang, group, g-order.

Contents:

Zero, one or more <note> or <context> elements in any order, followed by
One or more <tuv> elements, followed by
Zero, one or more non-TMX elements.


<tuv>

Translation Unit Variant - The <tuv> element specifies text in a given language.

Required attributes:

xml:lang.

Optional attributes:

o-encoding, datatype, usagecount, lastusagedate, creationtool, creationtoolversion, creationdate, creationid, changedate, changeid, o-tmf, xml:space.

Contents:

Zero, one or more <note> elements, followed by
One <seg> element, followed by
Zero, one or more non-TMX elements.


3.1.2. Inline Elements

The inline elements are the elements that can appear inside a segment. See also the Content Markup section for more information.

The inline elements are the following:


<hi>

Highlight - The <hi> element delimits a section of text that has special meaning, such as a terminological unit, a proper name, an item that should not be modified, etc. It can be used for various processing tasks such as indicating to a Machine Translation tool proper names that should not be translated, for terminology verification, or to mark suspect expressions after a grammar checking.

Required attributes:

type.

Optional attributes:

x, comment.

Contents:

Text data,
Zero, one or more <itag> elements


<itag>

internal tag - The <itag> element is used to indicate the position of native internal markup used in segments. This element replaces the now-deprecated <bpt>, <ept>, <it>, <ph>, and <ut> elements. The <itag> element can also encapsulate application file format markup if this information is stored in a translation memory application (see the section on content markup below), and must do so when the creating application stores this information. If this information is not stored in an application, this element appears as an empty XML element.

Required attributes:

type,
x

Optional attributes:

pos, assoc, equiv-text.

Contents:

(May be empty),
Code data,
One or more <sub> elements.


<sub>

Sub-flow - The <sub> element is used to delimit sub-flow text inside a sequence of native code, for example: the definition of a footnote or the text of title in a HTML anchor element. The <sub> element may only be used within <itag> elements.

Here are some examples (translatable text underlined, sub-flow is bolded):

Footnote in RTF:

Original RTF:

Elephants{\cs16\super \chftn {\footnote \pard\plain \s15\widctlpar \f4\fs20
{\cs16\super \chftn } An elephant is a very large animal. }} are big.

TMX with content mark-up:

Elephants<itag type="fnote" x="1">{\cs16\super \chftn {\footnote \pard\plain \s15\widctlpar \f4\fs20
{\cs16\super \chftn } <sub type="fnote"> An elephant is a very large animal. </sub>}}</itag> are big.

TMX without content mark-up:

Elephants<itag type="fnote" x="1"><sub type="fnote"> An elephant is a very large animal. </sub>}}</itag> are big.

Index marker in RTF:

Original RTF:

Elephants{\pard\plain \widctlpar
\v\f4\fs20 {\xe { Big animal \bxe }}} are big.

TMX with content mark-up:

Elephants<itag type="index" x="1">{\pard\plain \widctlpar
\v\f4\fs20 {\xe {<sub type="index"> Big animal </sub>\bxe }}}</itag> are big.

TMX without content mark-up:

Elephants<itag type="index" x="1"><sub type="index"> Big animal </sub></itag> are big.

Text of an attribute in a HTML element:

Original HTML:

See the <a title=" Go to Notes "
href="notes.htm">Notes</a> for more details.

TMX with content mark-up:

See the <itag x="1" pos="start" type="link">&lt;a title="<sub type="link"> Go to Notes </sub>"
href="notes.htm"></itag>Notes<itag x="1" pos="end">&lt;/a></itag> for more details.

Note that sub-flows are related to segmentation and can cause interoperability issues when one tool uses sub-flow within its main segment, while another extracts the sub-flow text as an independent segment. Resolving these differences is beyond the scope of TMX and users may expect some loss of leverage in cases involving sub-flow, although tool developers may implement processes to minimize data loss caused by this issue.

Required attributes:

type.

Optional attributes:

datatype.

Contents:

Text data,
Zero, one or more <itag> elements


3.2. Attributes

This section lists the various attributes used in the TMX elements.

3.2.1. TMX Attributes
adminlang

Administrative language - Specifies the default language for the administrative and informative element <note>.

Value description:

A language code as described in the [RFC 4646]. Unlike the other TMX attributes, the values for adminlang are not case-sensitive.

Default value:

Undefined.

Used in:

<header>.


assoc

Association - Indicates the association of a <itag> with the text prior or after.

Value description:

"p" (the element is associated with the text preceding the element), "f" (the element is associated with the text following the element), or "b" (the element is associated with the text on both sides). Note: The assoc attribute should not be confused with the x attribute, which is used to indicate pairing of <itag> elements within a segment and their correlation to corresponding markup elements used in other <tuv> elements within a single <tu> element

Default value:

Undefined.

Used in:

<itag>.


changedate

Change date - Specifies the date of the last modification of the element.

Value description:

Date in [ISO 8601] Format. The recommended pattern to use is: YYYYMMDDThhmmssZ
Where: YYYY is the year (4 digits), MM is the month (2 digits), DD is the day (2 digits), hh is the hours (2 digits), mm is the minutes (2 digits), ss is the second (2 digits), and Z indicates the time is UTC time. For example:

date="20020125T210600Z"
is January 25, 2002 at 9:06pm GMT
is January 25, 2002 at 2:06pm US Mountain Time
is January 26, 2002 at 6:06am Japan time

Default value:

Undefined.

Used in:

<header>, <tu>, <tuv>.


changeid

Change identifier - Specifies the identifier of the user who modified the element last.

Value description:

Text.

Default value:

Undefined.

Used in:

<header>, <tu>, <tuv>.


comment

Comment - a comment in a tag

Value description:

Text.

Default value:

Undefined.

Used in:

<hi>.


context-type

Context type - The context-type attribute specifies the context and the type of resource or style of the data of a given element. For example, to define if it is a label, or a menu item in the case of resource-type data, or the style in the case of document-related data.

Value description:

Text without spaces. Pre-defined values are as follow:

database

Indicates database content.

element

Indicates the content of an element within an XML document.

elementtitle

Indicates the name of an element within an XML document.

linenumber

Indicates the line number from the sourcefile (see context-type="sourcefile") where the source text is found.

numparams

Indicates a the number of parameters contained within the source text.

paramnotes

Indicates notes pertaining to the parameters in the source text.

record

Indicates the content of a record within a database.

recordtitle

Indicates the name of a record within a database.

sourcefile

Indicates the original source file from which the TMX file is created.

In addition, user-defined values can be used with this attribute. A user-defined value must start with an "x-" prefix.

Default value:

Undefined.

Used in:

<context>.


creationdate

Creation date - Specifies the date of creation of the element.

Value description:

Date in [ISO 8601] Format. The recommended pattern to use is: YYYYMMDDThhmmssZ
Where: YYYY is the year (4 digits), MM is the month (2 digits), DD is the day (2 digits), hh is the hours (2 digits), mm is the minutes (2 digits), ss is the second (2 digits), and Z indicates the time is UTC time. For example:

date="20020125T210600Z"
is January 25, 2002 at 9:06pm GMT
is January 25, 2002 at 2:06pm US Mountain Time
is January 26, 2002 at 6:06am Japan time

Default value:

Undefined.

Used in:

<header>, <tu>, <tuv>.


creationid

Creation identifier - Specifies the identifier of the user who created the element.

Value description:

Text.

Default value:

Undefined.

Used in:

<header>, <tu>, <tuv>.


creationtool

Creation tool - Identifies the tool that created the TMX document. Its possible values are not specified by the standard but each tool provider should publish the string identifier it uses.

Value description:

Text.

Default value:

Undefined.

Used in:

<header>, <tu>, <tuv>.


creationtoolversion

Creation tool version - Identifies the version of the tool that created the TMX document. Its possible values are not specified by the standard but each tool provider should publish the string identifier it uses.

Value description:

Text.

Default value:

Undefined.

Used in:

<header>, <tu>, <tuv>.


crc

Cyclic redundancy checking - A private value used to verify data as it is returned to the producer. The generation and verification of this number is tool-specific.

Value description:

Number (possibly not decimal).

Default value:

Undefined.

Used in:

<external-file>.


datatype

Data type - Specifies the type of data contained in the element. Different processes may be applied to the data depending on the value of the datatype attribute.

Value description:

Text.

It is highly recommended that developers use official MIME types (as defined in IETF FRC 2046 and registered in the IANA list of MIME types) where possible as datatype values. In addition, the values provided in the following table for datatype may be used for compatibility with the XLIFF specification or for localization-specific formats that lack official, distinctive MIME type values (or for which the applicable MIME types are insufficiently specific). Note that, in some instances, more general MIME types that may apply are not provided in this table.

datatype value (XLIFF compatible)

Description

Equivalent MIME type

unknown

undefined (default)

alptext

WinJoust data.

cdf

Channel Definition Format.

cmx

Corel CMX Format.

cpp

C and C++ style text.

dita

Darwin Information Typing Architecture (DITA)

hptag

HP-Tag.

html

HTML, DHTML, etc.

text/html

interleaf

Interleaf documents.

ipf

IPF/BookMaster.

java

Java, source and property files.

application/java

javascript

JavaScript, ECMAScript scripts.

application/x-javascript

lisp

Lisp.

application/x-lisp

mif

Framemaker MIF, MML, etc.

application/x-frame

opendocument

Open Document file.

(There are a variety of MIME types for Open Document files, depending on the exact type of Open Document file. Open Document MIME types begin with application/vnd.oasis.opendocument.)

opentag

OpenTag data.

pascal

Pascal, Delphi style text.

text/pascal

plaintext

Plain text.

text/plain

pm

PageMaker.

application/x-pagemaker

resx

Windows .NET resources.

rtf

Rich Text Format.

application/rtf

sgml

SGML.

text/sgml

stf-f

S-Tagger for FrameMaker.

stf-i

S-Tagger for Interleaf.

transit

Transit data.

vbscript

Visual Basic scripts.

winres

Windows resources from RC, DLL, EXE.

xliff

XLIFF (XML Localization Interchange File Format).

xml

XML.

text/xml

xptag

Quark XPressTag.

Used in:

<header>, <tu>, <tuv>, <sub>.


equiv-text

Equivalent text - Indicates the equivalent text to substitute in place of an inline tag.

The following example shows use of the attribute to specify that an html <br /> tag is to be interpreted as a linefeed character, using both full and empty <itag> elements.

Version 1, with content:

<itag x="1" equiv-text="linefeed character">&lt;br&gt; /<itag>

Version 2, empty element:

<itag x="1" equiv-text="linefeed character" />

Value description:

Text.

Used in:

<itag>.


group

Group identifier - indicates that a given <tu> element belongs to a logical group of related translation units.

Value description:

Text without spaces.

Used in:

<tu>


g-order

Group order - defines the order of the <tu> within a given logical group. Used together with group attribute.

In the following portion of a TMX file the group attribute shows that the three tu elements are part of a logical group in the source document. The g-order element shows in which order they occurred within that logical grouping.

<tu group="p0001" g-order="1" datatype="plaintext">
     <tuv xml:lang="hu">
          <seg>Nyomja a piros gombot.</seg>
     </tuv>
     <tuv xml:lang="en">
          <seg>Press the red button.</seg>
     </tuv>
</tu>
<tu group="p0001" g-order="2" datatype="plaintext">
     <tuv xml:lang="hu">
          <seg>Az inditás után, nézze meg, hogy elég az olajnyomás.</seg>
     </tuv>
     <tuv xml:lang="en">
          <seg>After starting it, make sure that the oil pressure is sufficient</seg>
     </tuv>
</tu>
<tu group="numbers" g-order="3" datatype="plaintext">
     <tuv xml:lang="hu">
          <seg>Ha nem elég, a gépet ki kell kapcsolni.</seg>
     </tuv>
     <tuv xml:lang="en">
          <seg>If it is not sufficient, you must shut the machine down.</seg>
     </tuv>
</tu>

Value description:

Number starting at 1 and incremented in steps of 1 unit. Must be unique within each logical group defined with the group attribute. Its initial value is reset to 1 in each logical group.


href

Hypertext reference - The "href" attribute contains a valid URL that describes the location of a file.

Value description:

Text.

Default value:

Undefined.

Used in:

<external-file>.


lastusagedate

Last usage date - Specifies when the last time the content of a <tu> or <tuv> element was used in the original translation memory environment.

Value description:

Date in [ISO 8601] Format. The recommended pattern to use is: YYYYMMDDThhmmssZ
Where: YYYY is the year (4 digits), MM is the month (2 digits), DD is the day (2 digits), hh is the hours (2 digits), mm is the minutes (2 digits), ss is the second (2 digits), and Z indicates the time is UTC time. For example:

date="20020125T210600Z"
is January 25, 2002 at 9:06pm GMT
is January 25, 2002 at 2:06pm US Mountain Time
is January 26, 2002 at 6:06am Japan time

Default value:

Undefined.

Used in:

<tu>, <tuv>.


o-encoding

Original encoding - As stated in the Encoding section, all TMX documents are in Unicode. However, it is sometimes useful to know what code set was used to encode text that was converted to Unicode for purposes of interchange. The o-encoding attribute specifies the original or preferred code set of the data of the element in case it is to be re-encoded in a non-Unicode code set.

Value description:

One of the [IANA] recommended "charset identifier", if possible.

Default value:

Undefined.

Used in:

<header>, <tu>, <tuv>, <note>.


o-tmf

Original translation memory format - Specifies the format of the translation memory file from which the TMX document or segment thereof have been generated.

Value description:

Text.

Default value:

Undefined.

Used in:

<header>, <tu>, <tuv>.


pos

Position - Indicates whether a tag replaced by <itag> was the start or the end tag of a matched pair.

Value description:

"start" or "end" (Note that if pos is not specified it is assumed that the markup represented by <itag> is unpaired, e.g., an XHTML <br /> tag.)

Default value:

Undefined.

Used in:

<itag>


segtype

Segment type - Specifies the kind of segmentation used in the <tu> element. If a <tu> element does not have a segtype attribute specified, it uses the one defined in the <header> element.

The "block" value is used when the segment does not correspond to one of the other values, for example it may be desirable in some instances to store a chapter composed of several paragraphs in a single <tu>.

The rules on how the text was segmented can be carried in a Segmentation Rules eXchange (SRX) document.

Value description:

"block", "paragraph", "sentence", or "phrase".

Default value:

Undefined.

Used in:

<header>, <tu>.


srclang

Source language - Specifies the language of the source text. In other words, the <tuv> holding the source segment will have its xml:lang attribute set to the same value as srclang. (except if srclang is set to "*all*"). If a <tu> element does not have a srclang attribute specified, it uses the one defined in the <header> element.

Value description:

A language code as described in the [RFC 4646], or the value "*all*" if any language can be used as the source language. Unlike the other TMX attributes, the values for srclang are not case-sensitive.

Default value:

Undefined.

Used in:

<header>, <tu>.


tuid

Translation unit identifier - Specifies an identifier for the <tu> element. Its value must be unique within the file.

Value description:

Text without spaces.

Default value:

Undefined.

Used in:

<tu>.


type

Type - Specifies the kind of data an <itag> element represents.

Value description:

Text. Depends on the element where the attribute is used.

The recommended values for the type attribute, when used in an <itag> element are as follows. Note that some values of type should logically be used only for pairs of tags.

bold

Bold.

color

Color change.

dulined

Doubled-underlined.

emphasis

Emphasis.

font

Font change.

italic

Italic.

link

Linked text.

scap

Small caps.

strong

Strong.

struct

XML/SGML structure.

ulined

Underlined.

xliff-bpt

XLIFF <bpt> tag.

xliff-g

XLIFF <g> tag.

index

Index marker.

date

Date.

time

Time.

fnote

Footnote.

enote

End-note.

alt

Alternate text.

image

Image

pb

Page break.

lb

Line break.

cb

column break.

inset

Inset.

xliff-bx

XLIFF <bx/> tag.

xliff-ex

XLIFF <ex/> tag.

xliff-it

XLIFF <it> tag.

xliff-ph

XLIFF <ph> tag.

xliff-x

XLIFF <x/> tag.

The recommended values for the type attribute, when used in <hi> are as follow:

abbrev

Indicates the marked text is an abbreviation.

abbreviated-form

ISO-12620 2.1.8: A term resulting from the omission of any part of the full term while designating the same concept.

abbreviation

ISO-12620 2.1.8.1: An abbreviated form of a simple term resulting from the omission of some of its letters (e.g. 'adj.' for 'adjective').

acronym

ISO-12620 2.1.8.4: An abbreviated form of a term made up of letters from the full form of a multi-word term strung together into a sequence pronounced only syllabically (e.g. 'radar' for 'radio detecting and ranging').

appellation

ISO-12620: A proper-name term, such as the name of an agency or other proper entity.

collocation

ISO-12620 2.1.18.1: A recurrent word combination characterized by cohesion in that the components of the collocation must co-occur within an utterance or series of utterances, even though they do not necessarily have to maintain immediate proximity to one another.

common-name

ISO-12620 2.1.5: A synonym for an international scientific term that is used in general discourse in a given language.

datetime

Indicates the marked text is a date and/or time.

equation

ISO-12620 2.1.15: An expression used to represent a concept based on a statement that two mathematical expressions are, for instance, equal as identified by the equal sign (=), or assigned to one another by a similar sign.

expanded-form

ISO-12620 2.1.7: The complete representation of a term for which there is an abbreviated form.

formula

ISO-12620 2.1.14: Figures, symbols or the like used to express a concept briefly, such as a mathematical or chemical formula.

head-term

ISO-12620 2.1.1: The concept designation that has been chosen to head a terminological record.

initialism

ISO-12620 2.1.8.3: An abbreviated form of a term consisting of some of the initial letters of the words making up a multi-word term or the term elements making up a compound term when these letters are pronounced individually (e.g. 'BSE' for 'bovine spongiform encephalopathy').

international-scientific​-term

ISO-12620 2.1.4: A term that is part of an international scientific nomenclature as adopted by an appropriate scientific body.

internationalism

ISO-12620 2.1.6: A term that has the same or nearly identical orthographic or phonemic form in many languages.

logical-expression

ISO-12620 2.1.16: An expression used to represent a concept based on mathematical or logical relations, such as statements of inequality, set relationships, Boolean operations, and the like.

materials-management​-unit

ISO-12620 2.1.17: A unit to track object.

name

Indicates the marked text is a name.

near-synonym

ISO-12620 2.1.3: A term that represents the same or a very similar concept as another term in the same language, but for which interchangeability is limited to some contexts and inapplicable in others.

part-number

ISO-12620 2.1.17.2: A unique alphanumeric designation assigned to an object in a manufacturing system.

phrase

Indicates the marked text is a phrase.

phraseological-unit

ISO-12620 2.1.18: Any group of two or more words that form a unit, the meaning of which frequently cannot be deduced based on the combined sense of the words making up the phrase.

protected

Indicates the marked text should not be translated.

romanized-form

ISO-12620 2.1.12: A form of a term resulting from an operation whereby non-Latin writing systems are converted to the Latin alphabet.

set-phrase

ISO-12620 2.1.18.2: A fixed, lexicalized phrase.

short-form

ISO-12620 2.1.8.2: A variant of a multi-word term that includes fewer words than the full form of the term (e.g. 'Group of Twenty-four' for 'Intergovernmental Group of Twenty-four on International Monetary Affairs').

sku

ISO-12620 2.1.17.1: Stock keeping unit, an inventory item identified by a unique alphanumeric designation assigned to an object in an inventory control system.

standard-text

ISO-12620 2.1.19: A fixed chunk of recurring text.

symbol

ISO-12620 2.1.13: A designation of a concept by letters, numerals, pictograms or any combination thereof.

synonym

ISO-12620 2.1.2: Any term that represents the same or a very similar concept as the main entry term in a term entry.

synonymous-phrase

ISO-12620 2.1.18.3: Phraseological unit in a language that expresses the same semantic content as another phrase in that same language.

term

Indicates the marked text is a term.

transcribed-form

ISO-12620 2.1.11: A form of a term resulting from an operation whereby the characters of one writing system are represented by characters from another writing system, taking into account the pronunciation of the characters converted.

transliterated-form

ISO-12620 2.1.10: A form of a term resulting from an operation whereby the characters of an alphabetic writing system are represented by characters from another alphabetic writing system.

truncated-term

ISO-12620 2.1.8.5: An abbreviated form of a term resulting from the omission of one or more term elements or syllables (e.g. 'flu' for 'influenza').

variant

ISO-12620 2.1.9: One of the alternate forms of a term.

Any of the suggested values listed in the tables above can be used with <sub> element.

In addition, user-defined values can be used with this attribute. A user-defined value must start with an "x-" prefix.

Default value:

Undefined.

Used in:

<itag>, <hi>, <sub>.


uid

Unique ID - The "uid" attribute is used to provide a unique ID to identify the file that contains the segmentation rules used when generating the TMX document.

Value description:

Text.

Default value:

Undefined.

Used in:

<external-file>.


usagecount

Usage count - Specifies the number of times a <tu> or the content of the <tuv> element has been accessed in the original TM environment.

Value description:

Number.

Default value:

Undefined.

Used in:

<tu>, <tuv>.


version

TMX version - The version attribute indicates the version of the TMX format to which the document conforms.

Value description:

Fixed text: the major version number, a period, and the minor version number. For example: version="2.0".

Default value:

"2.0"

Used in:

<tmx>.


x

Tag match - The x attribute is used to match inline <itag> and <hi> elements between each <tuv> element of a given <tu> element and to facilitate pairing of <itag> elements within a <tuv> element. This mechanism facilitates the pairing of allied codes in source and target text, even if the order of code occurrence differs between the two because of the translation syntax. Note that <itag> elements representing logically paired tags must share identical values of the x attribute but will differ in their pos attribute, as shown in the examples below.

Also note that, due to differences between languages, not all values of x found in one <tuv element will necessarily be found across all <tuv>s within a single <tu> and that values of the type attribute may differ on <itag> elements in different <tuv>s that share x values. For example, an English source text might have two instances of italic text that correspond to one span of bold text in a Spanish translation or a translation may have formatting not found in the source text. Appropriate use of the type attribute, along with use of matching x values can improve reuse in such circumstances.

The following example shows how x can be used to indicate pairs of tags and matches of tags across languages:

<seg>link to <itag pos="start" type="link" x="1">&lt;amp;a href="www.mysite.com" title="<sub type="x-title">my site</sub>"&gt;</itag>my web site<itag pos="end" x="1">&lt;/a&gt;</itag>, and this is<itag type="image" x="2">&lt;img src="john.gif" alt="<sub type="alt">John's picture</sub>"/&gt;</itag> John.</seg>

<seg>enlace a <itag pos="start" type="link" x="1">&lt;a href="www.mysite.com/es" title="<sub type="x-title">mi sitio</sub>"&gt;</itag>mi sitio web<itag pos="end" x="1">&lt;/a&gt;,</ept> y este es <itag type="image" x="2">&lt;img src="juan.gif" alt="<sub type="alt">foto de Juan</sub>"/&gt;</itag> Juan.</seg>

The corresponding examples from a translation memory tool that does not store markup would be:

<seg>link to <itag pos="start" type="link" x="1"><sub type="x-title">my site</sub>"</itag>my web site<itag pos="end" x="1" />, and this is<itag type="image" x="2" ><sub type="alt">John's picture</sub></itag> John.</seg>

<seg>enlace a <itag pos="start" type="link" x="1"><sub type="x-title">mi sitio</sub></itag>mi sitio web<itag pos="end" x="1" /> y este es <itag type="image" x="2"><sub type="alt">foto de Juan</sub></itag> Juan.</seg>

Value description:

Number starting in 1 and incremented in steps of 1 unit. Within a given <seg> element, the value of the x attribute must be unique for each <hi> element or <itag> element that lacks a pos attribute or has a value of "start" for the pos attribute . Its initial value is reset to 1 in every <seg> element.

Default value:

Undefined.

Used in:

<itag>, <hi>.


3.2.2. XML Namespace Attributes
xml:lang

Language - The "xml:lang" attribute specifies the locale of the text of a given element.

Value description:

A language code as described in the [RFC 4646]. This declared value is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:lang attribute. Unlike the other TMX attributes, the values for xml:lang are not case-sensitive. For more information see the section on xml:lang in the XML specification.

Default value:

Undefined.

Used in:

<tuv>, <note>.


xml:space

White spaces - The "xml:space" attribute specifies how white spaces (ASCII spaces, tabs and line-breaks) should be treated.

Value description:

default or preserve. The value default signals that an application's default white-space processing modes are acceptable for this element; the value preserve indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute. For more information see the section on xml:space in the XML specification.

Default value:

default.

Used in:

<seg>


4. Content Markup

4.1. Overview

TM systems use a variety of methods of marking up or representing internal formatting. Formats are constantly evolving, and new formats are introduced on a regular basis. Attempting to collect, interpret, disseminate and maintain finite descriptions of each formatting tag used at any given time by TM systems is not possible. In addition, TM tools may or may not include the actual content of formatting tools in their databases, i.e., some tools may include only markers for the location of tags, relying instead on the presence of formatting tags in source files to insert tags when the memory is used to provide leverage against actual files that are to be translated.

At present, the best way to deal with these native codes in general is to delimit them by a specific set of elements that convey where they begin and end, and possibly additional information about what they are (bold, italic, footnote, etc.). (Note, however, that in some cases inline content markup may be left unencapsulated to meet specific needs. Guidance about how best to represent markup for specific needs and cases is beyond the scope of this standard.)

The element <sub> is provided to delimit (potentially translatable) sub-flow text within a sequence of native codes. For instance, if the text content of a footnote is defined within the footnote marker code, it may be demarked with the <sub> element.

4.2. Representing Inline Elements

TMX provides a mechanism for indicating the position of inline markup and encapsulating this markup when it is available to the tool creating a TMX file.

4.2.1. When the Content of Tags Is Available in the Translation Memory

When a TM tool contains the content of tags, this information must be included in the TMX file produced by that tool. Including this information allows other tools capable of using this information to do so. Tools that do not store tags internally may discard the content of tags but should instead include appropriate tag markers in translation memories generated from TMX files that do include this information.

Inline markup in translation memory data is stored by using the <itag> element to surround the markup (with any < or & characters converted to their corresponding character entities, &lt; and &amp;). The <itag> element can take one of three forms:

  1. If the element encloses the start tag in a set of paired tags, <itag> is given the value of "start" for the pos attribute. For example: This text contains an <itag pos="start" x="1">&lt;em><HTML start tag. (This usage replaces the <bpt> element found in previous versions of TMX.)

  2. If the element encloses an end tag it is given the value of "end" for the pos attribute and the value of its x attribute must agree with the value of x found in the <itag> element that encloses the corresponding start tag. For example: It is finished <itag pos="end" x="1">&lt;/em>< in this text. (This usage replaces the <ept> element found in previous versions of TMX.)

  3. If the element encloses unpaired content (such as an XHTML <br /> tag) it is not given a value for the pos attribute. (This usage replaces the <ut> element found in previous versions of TMX.)

Note that if a <itag> element encapsulates paired markup for which corresponding start or end markup is not present in the same <seg> element that it should have a unique value for the x attribute. Correct use of the pos attribute will enable TMX-compliant applications to correctly interpret the tag content as a start or end tag that has been isolated. (This ability replaces the functionality of the <it> (isolated tag) element found in previous versions of TMX.)

4.2.2. When the Content of Tags Is Unavailable in the Translation Memory

When a TM tool does not contain the content of tags, empty <itag> elements are used in a manner otherwise identical to the case in which markup is encapsulated. Note that even if empty <itag> elements are otherwise used, start and end tag versions must be used if sub-flow (marked with the <sub> element) is represented. For example, a translatable title attribute in an HTML <a> element would be represented by something like <itag x="4" type="title"><sub type="title">Site title</sub></itag>.

Examples:

  1. Paired tags

    Source text:

    <p>link to <em><a href="www.mysite.com" title="My Site">my web site</a></em>.</p>

    Text with encapsulated content markup:

    <seg>link to <itag pos="start" x="1" type="emphasis">&amp;lt;em></itag><itag pos="start" x="2" 
    type="link"> &lt;a href="www.mysite.com" title="<sub type="x-title">My Site</sub>"&gt;
    </itag>my web site<itag pos="end" x="2">&lt;/a&gt;,</itag><itag pos="end" x="1">&lt;/em></itag>.</seg>

    Text without encapsulated content markup:

    <seg>link to <itag pos="start" x="1" type="emphasis" /><itag pos="start" x="2" type="link">
    <sub type="x-title">My Site</sub></itag>
    my web site<itag pos="end" x="2" /><itag pos="end" x="1" />&.</seg>
  2. Paired tags

    Source text:

    <p>There were <em>many</em> French ships involved.</p>

    Text with encapsulated content markup:

    <seg>There were <itag pos="start" x="1" type="emphasis">&lt;em></itag>many<itag pos="end" 
    x="1" />&lt;/em></itag> French ships involved.</seg>

    Text without encapsulated content markup:

    <seg>There were <itag pos="start" x="1" type="emphasis" />many<itag pos="end" x="1" /> French 
    ships involved.</seg>
  3. Unpaired tags

    Source text:

    This is <br /><img src="john.gif" alt="John's picture"/> John.

    Text with encapsulated content markup:

    ...<seg>This is <itag type="break" x="1" equiv-text="linefeed">&lt;br /></itag><itag type="image"
    x="2">&lt;img src="john.gif" alt="<sub type="alt">John's picture</sub>"/&gt;</itag> John.</seg>

    Text without encapsulated content markup:

    ...<seg>This is <itag type="break" x="1" equiv-text="linefeed" /><itag type="image" x="2"><sub
    type="alt">John's picture</sub></itag> John.</seg>
  4. Text with a paired tag whose pair is not found in the same segment

    Source text:

    This warning applies to users of model CR245 only.</strong>

    Text with encapsulated content markup:

    This warning applies to users of model CR245 only.<itag pos="end" type="strong" x="1">&lt;/strong></itag>

    Text without encapsulated content markup:

    This warning applies to users of model CR245 only.<itag pos="end" type="strong" x="1" />

Note that both methods of representing inline markup are considered valid TMX. TMX-compliant tools that do not store source-format markup in their databases may simply discard encapsulated markup that they are unable to use. TMX-compliant tools that do store markup and receive files without encapsulated source-format markup may require access to source files and additional processing to properly interpret these files and should not simply discard empty <itag> elements.


5. TMX Compliance

TMX compliance is defined as follow:

  • Given:

    • An original document with inline codes (for example an HTML file) translated by a tool XYZ.

    • The translation memory of that document saved in TMX format, using <itag> elements as described in the section on Representing Inline Elements.

    • The segmentation rules in SRX format used to break blocks of source text into smaller fragments, either embedded in the TMX document or referenced in an <external-file> element.

  • Assuming:

    • The translated segments do not have more or less tags than the source segments.

    • All non-TMX elements and attributes have been removed from the TMX file.

The tool XYZ supports TMX Export if the TMX document created by tool XYZ contains all the information required to re-create the translated document without loss of text, data or formatting.

The tool XYZ supports TMX Import if any TMX document containing all the information required to re-create the translated document (possibly created by a TMX Export compliant tool), can be imported in tool XYZ and effectively be used to re-create the translated document without loss of text, data or formatting.

Tools that offers both import and export features must support both TMX Import and TMX Export to be TMX compliant.

Whenever possible, the original formatting information should be included in the exported TMX file, enclosed in <itag> elements

Because many translation memory tools do not store source markup in their databases (and instead extract markup from source files at translation time), it may not be possible to include the original source formatting codes in inline elements. In such cases, the inline elements must still be present in the correct places in the form of empty <itag> elements and they must comply with the section on Representing Inline Elements.

5.1 Validation of TMX Files

A cross-platform utility that validates TMX documents against TMX Schema and also verifies if they follow the requirements described in this document is included as part of the TMX 2.0 specifications.

Source code of the validation tool is available for download in OSCAR’s web site.


6. Changes Since Previous Version (Non-Normative)

The main changes in this version (2.0) relative to the previous version (1.4b) are as follows:

  • TMX 2.0 is based on an XML Schema instead of a DTD.

  • New elements. The following elements were added to TMX standard: <context>, <itag>, <segmentation>, <internal-file> and <external-file>.

  • Removed elements. The following elements were removed from the TMX standard: <bpt>, <ept>, <it>, <ph>, <map>, <prop>, <ude>, <ut>.

  • New attributes. The following attributes were incorporated: xml:space, comment, context-type, crc, group, g-order, href, equiv-text

  • A new set of unified and simplified rules for representing inline elements was designed. See the section on Representing Inline Elements for more details.

  • Attribute type marked as required in all inline elements.

  • Replaced implementation levels 1 and 2 with a unique level of compliance. TMX files must include all the necessary inline data to re-create the translation of source documents (optionally requiring the actual source document at processing time) to be considered TMX compliant. See section TMX Compliance for more details.

  • Required uniqueness of tuid attribute within a TMX file.

  • Added a new attribute, pos (position) for use in inline markup to allow recording of the type of the position (i.e., start or end tag) encapsulated by that element.

  • Values of the datatype are now mandated to be from the list of MIME types. The previously existing values are now listed for compatibility with XLIFF or for use when MIME types are insufficiently specific for language-processing purposes.

  • All metamarkup from previous versions of TMX was eliminated in favor of a single tag, <itag>, which indicates the location of tags in the source document and, optionally, can also encapsulate the content of those tags, if this information is available to the application creating a TMX file.

6.1 Upgrading TMX Files

It should be possible to upgrade a valid TMX 1.4b file to 2.0 by:

  1. Removing any DOCTYPE declaration from the file

  2. Changing the value of version attribute from "1.4" to "2.0"

  3. Removing all TMX 1.4 elements and attributes that have been deprecated in TMX 2.0 (i.e. <ut>)

  4. Converting all <prop> elements to attributes using another XML namespace. If the content of <prop> elements is too complex to be represented in attributes, the use of elements from another XML namespace may be required to represent them fully.

  5. Replacing old-style metamarkup (e.g., <bpt>/<ept> pairs) with <itag> elements as necessary to comply with the section on Representing Inline Elements.


A. Sample Document

<?xml version="1.0" encoding="UTF-8"?>
<tmx version="2.0" 
   xmlns="http://www.lisa.org/tmx20"
   xsi:schemaLocation="http://www.lisa.org/tmx20 tmx20.xsd"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns:xyz="urn:myApps:xyz">
   <header creationtool="Sample Creator" creationtoolversion="1.1.1" 
      segtype="block" o-tmf="unknown" adminlang="en-US" srclang="*all*" datatype="x-sample">
      <segmentation>
         <internal-file xyz:myattribute="custom rules">
            <!-- Segmentation rules in SRX 2.0 format -->
            <srx:srx version="2.0" xmlns:srx="http://www.lisa.org/srx20">
               <srx:header segmentsubflows="yes" cascade="yes">
                  <srx:formathandle type="start" include="no"/>
                  <srx:formathandle type="end" include="yes"/>
                  <srx:formathandle type="isolated" include="yes"/>
               </srx:header>
               <srx:body>
                  <srx:languagerules>
                     <srx:languagerule languagerulename="Default">
                        <!-- Common rule for most languages -->
                        <srx:rule break="yes">
                           <srx:beforebreak>[\.\?!]+</srx:beforebreak>
                           <srx:afterbreak>\s</srx:afterbreak>
                        </srx:rule>
                     </srx:languagerule>
                  </srx:languagerules>
                  <srx:maprules>
                     <!-- Common breaking rules -->
                     <srx:languagemap languagepattern=".*" languagerulename="Default"/>
                  </srx:maprules>
               </srx:body>
            </srx:srx>
         </internal-file>
      </segmentation>
      <!-- Other elements -->
      <xyz:other />
   </header>   
   <body>
      <!-- Paired codes with translatable text -->
      <tu srclang="en-US" datatype="html" tuid="sample1">
         <tuv xml:lang="en" datatype="html">
            <seg>link to <itag type="link" x="1" pos="start">&amp;a href="www.mysite.com" 
            title="<sub type="x-title">my site</sub>"&gt;</itag>my web site<itag 
            pos="end" type="link">&lt;/a&gt;,</itag>.</seg>
         </tuv>
         <tuv xml:lang="es" datatype="html">
            <seg>enlace a <itag type="link" x="1" pos="start">&amp;a href="www.mysite.com/es" 
            title="<sub type="x-title">mi sitio</sub>"&gt;</itag>mi sitio web<itag pos="end" 
            type="link">&lt;/a&gt;,</itag>.</seg>
         </tuv>
      </tu>
      <!-- Paired codes without translatable text -->
      <tu datatype="rtf">
         <context context-type="x-my-context">text formatting options</context>
         <tuv xml:lang="en">
            <seg>Text in <itag type="italic">italics</itag>.</seg>
         </tuv>
         <tuv xml:lang="fr">
            <seg>Texte en <itag type="italic">italiques</itag>.</seg>
         </tuv>
      </tu>
      <!-- Standalone sequence with translatable text -->
      <tu datatype="html">
         <tuv xml:lang="en-US">
            <seg>This is <itag type="image">&lt;img src="john.gif" alt="<sub type="alt">John's 
            picture</sub>"/&gt;</itag> John.</seg>
         </tuv>
         <tuv xml:lang="es">
            <seg>Este es <itag type="image">&lt;img src="juan.gif" alt="<sub type="alt">foto 
            de Juan</sub>"/&gt;</itag> Juan.</seg>
         </tuv>
      </tu>
      <!-- Standalone sequence without translatable text -->
      <tu>
         <tuv xml:lang="en">
            <seg>text displayed in <itag type="lb" equiv-text="&#0010;"/> two lines.</seg>
         </tuv>
         <tuv xml:lang="es">
            <seg>texto en <itag type="lb" equiv-text="&#0010;"/> dos lineas.</seg>
         </tuv>
      </tu>
      <!-- Notes -->
      <tu tuid="90293837" creationid="jean-claude" srclang="zh-CN" segtype="phrase">
         <note>Salutations</note>
         <note>Machine translation</note>
         <tuv xml:lang="en">
            <seg>Hello!</seg>
         </tuv>         
         <tuv o-encoding="BIG5" xml:lang="zh-CN">
            <note>Enable Unicode support for viewing this entry.</note>
            <seg>你好!</seg>
         </tuv>
      </tu>
      <!-- Untranslatable text -->
      <tu o-tmf="xliff" creationdate="20060125T210600Z" changedate="20060315T130700Z" 
      creationid="ted@mail.com">
         <tuv xml:lang="en" xml:space="default">
            <seg><hi type="protected" comment="product name">Ultrabalancer</hi> support 
            is excellent.</seg>
         </tuv>
         <tuv xml:lang="es" xml:space="default">
            <seg>El soporte de <hi type="protected">Ultrabalancer</hi> es excelente.</seg>
         </tuv>
      </tu>
      <!-- Foreign elements -->
      <xyz:database>main server</xyz:database>
      <xyz:purpose>general</xyz:purpose>
      <!-- grouped segments -->
      <tu group="numbers" g-order="1" datatype="plaintext" creationdate="20060125T210600Z">
         <tuv xml:lang="fr">
            <seg>un</seg>
         </tuv>
         <tuv xml:lang="de">
            <seg>eine</seg>
         </tuv>
         <tuv xml:lang="en">
            <seg>one</seg>
         </tuv>
      </tu>
      <tu group="numbers" g-order="2" datatype="plaintext">
         <tuv xml:lang="de">
            <seg>zwei</seg>
         </tuv>
         <tuv xml:lang="fr">
            <seg>deux</seg>
         </tuv>
         <tuv xml:lang="en">
            <seg>two</seg>
         </tuv>
      </tu>
      <tu group="numbers" g-order="3" datatype="plaintext">
         <tuv xml:lang="en">
            <seg>three</seg>
         </tuv>
         <tuv xml:lang="de">
            <seg>drei</seg>
         </tuv>
         <tuv xml:lang="fr">
            <seg>trois</seg>
         </tuv>
      </tu>     
   </body>
</tmx>


B. XML Schema for TMX

The XML Schma for TMX is available at: http://www.lisa.org/tmx/tmx20.xsd.

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Document     : tmx20.xsd
  Version      : 1.0
  Created on   : December 2, 2006
  Author       : Rodolfo M. Raya (rmraya@maxprograms.com)
  Modified     :  February 18, 2009 by Rodolfo M. Raya (rmraya@maxprograms.com)
                  July 23, 2008 by Arle Lommel (arle@lisa.org)
  Description  : This XML Schema defines the structure of TMX 2.0
  Status       : Preliminary draft
  
  Copyright © The Localisation Industry Standards Association [LISA] 1997-2009. 
  All Rights Reserved.
-->
<xs:schema xmlns:tmx="http://www.lisa.org/tmx20" targetNamespace="http://www.lisa.org/tmx20"
    xml:lang="en" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xs:import namespace="http://www.w3.org/XML/1998/namespace"
        schemaLocation="http://www.w3.org/2001/xml.xsd" />
    <!--
    ================================================== 
     Restrictions
    ================================================== 
    -->
    <!-- Restrictions for segtype attribute -->
    <xs:simpleType name="segtypes">
        <xs:restriction base="xs:token">
            <xs:enumeration value="block" />
            <xs:enumeration value="paragraph" />
            <xs:enumeration value="sentence" />
            <xs:enumeration value="phrase" />
        </xs:restriction>
    </xs:simpleType>
    <!-- Restrictions for xml:space attribute -->
    <xs:simpleType name="space">
        <xs:restriction base="xs:token">
            <xs:enumeration value="default" />
            <xs:enumeration value="preserve" />
        </xs:restriction>
    </xs:simpleType>
    <!-- Restrictions for assoc attribute -->
    <xs:simpleType name="assoc_type">
        <xs:restriction base="xs:token">
            <xs:enumeration value="p" />
            <xs:enumeration value="f" />
            <xs:enumeration value="b" />
        </xs:restriction>
    </xs:simpleType>

    <!-- Restrictions for type attribute when used in <itag>  -->
    <xs:simpleType name="placeholder_type">
        <xs:restriction base="xs:token">
            <xs:enumeration value="bold" />
            <xs:enumeration value="color" />
            <xs:enumeration value="dulined" />
            <xs:enumeration value="emphasis" />
            <xs:enumeration value="font" />
            <xs:enumeration value="italic" />
            <xs:enumeration value="link" />
            <xs:enumeration value="scap" />
            <xs:enumeration value="strong" />
            <xs:enumeration value="struct" />
            <xs:enumeration value="ulined" />
            <xs:enumeration value="xliff-bpt" />
            <xs:enumeration value="xliff-g" />
            <xs:enumeration value="index" />
            <xs:enumeration value="date" />
            <xs:enumeration value="time" />
            <xs:enumeration value="fnote" />
            <xs:enumeration value="enote" />
            <xs:enumeration value="alt" />
            <xs:enumeration value="image" />
            <xs:enumeration value="pb" />
            <xs:enumeration value="lb" />
            <xs:enumeration value="cb" />
            <xs:enumeration value="inset" />
            <xs:enumeration value="xliff-bx" />
            <xs:enumeration value="xliff-ex" />
            <xs:enumeration value="xliff-it" />
            <xs:enumeration value="xliff-ph" />
            <xs:enumeration value="xliff-x" />
        </xs:restriction>
    </xs:simpleType>
    <!-- Restrictions for type attribute when used in <hi> -->
    <xs:simpleType name="term_type">
        <xs:restriction base="xs:token">
            <xs:enumeration value="abbrev" />
            <xs:enumeration value="abbreviated-form" />
            <xs:enumeration value="abbreviation" />
            <xs:enumeration value="acronym" />
            <xs:enumeration value="appellation" />
            <xs:enumeration value="collocation" />
            <xs:enumeration value="common-name" />
            <xs:enumeration value="datetime" />
            <xs:enumeration value="equation" />
            <xs:enumeration value="expanded-form" />
            <xs:enumeration value="formula" />
            <xs:enumeration value="head-term" />
            <xs:enumeration value="initialism" />
            <xs:enumeration value="international-scientific-term" />
            <xs:enumeration value="internationalism" />
            <xs:enumeration value="logical-expression" />
            <xs:enumeration value="materials-management-unit" />
            <xs:enumeration value="name" />
            <xs:enumeration value="near-synonym" />
            <xs:enumeration value="part-number" />
            <xs:enumeration value="phrase" />
            <xs:enumeration value="phraseological-unit" />
            <xs:enumeration value="protected" />
            <xs:enumeration value="romanized-form" />
            <xs:enumeration value="set-phrase" />
            <xs:enumeration value="short-form" />
            <xs:enumeration value="sku" />
            <xs:enumeration value="standard-text" />
            <xs:enumeration value="symbol" />
            <xs:enumeration value="synonym" />
            <xs:enumeration value="synonymous-phrase" />
            <xs:enumeration value="term" />
            <xs:enumeration value="transcribed-form" />
            <xs:enumeration value="transliterated-form" />
            <xs:enumeration value="truncated-term" />
            <xs:enumeration value="variant" />
        </xs:restriction>
    </xs:simpleType>
    <!-- Restrictions for context-type attribute -->
    <xs:simpleType name="context_type">
        <xs:restriction base="xs:token">
            <xs:enumeration value="database" />
            <xs:enumeration value="element" />
            <xs:enumeration value="elementtitle" />
            <xs:enumeration value="linenumber" />
            <xs:enumeration value="numparams" />
            <xs:enumeration value="paramnotes" />
            <xs:enumeration value="record" />
            <xs:enumeration value="recordtitle" />
            <xs:enumeration value="sourcefile" />
        </xs:restriction>
    </xs:simpleType>
    <!--  Restrictions for date values -->
    <xs:simpleType name="date_type">
        <xs:restriction base="xs:string">
            <!-- YYYYMMDDThhmmssZ -->
            <xs:pattern value="[1-2][0-9][0-9][0-9][0-1][0-9][0-3][0-9]T[0-5][0-9][0-5][0-9][0-5][0-9]Z"/>
        </xs:restriction>
    </xs:simpleType>
    <!-- Restrictions for user-defined attribute values -->
    <xs:simpleType name="Custom">
        <xs:restriction base="xs:string">
            <xs:pattern value="x-[^\s]+" />
        </xs:restriction>
    </xs:simpleType>
    <!--
    ================================================== 
    Structural Elements     
    ================================================== 
    -->
    <!-- Base Document Element -->
    <xs:element name="tmx">
        <xs:complexType>
            <xs:sequence>
                <xs:element ref="tmx:header" />
                <xs:element ref="tmx:body" />
            </xs:sequence>
            <xs:attribute name="version" use="required">
                <xs:simpleType>
                    <xs:restriction base="xs:string">
                        <xs:enumeration value="2.0" />
                    </xs:restriction>
                </xs:simpleType>
            </xs:attribute>
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Body -->
    <xs:element name="body">
        <xs:complexType>
           <xs:choice minOccurs="0" maxOccurs="unbounded">
              <xs:element ref="tmx:tu" />
              <xs:any namespace="##other" processContents="lax" />
            </xs:choice>
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Context Information -->
    <xs:element name="context">
        <xs:complexType mixed="true">
            <xs:attribute name="context-type" use="required">
                <xs:simpleType>
                    <xs:union memberTypes="tmx:context_type tmx:Custom" />
                </xs:simpleType>
            </xs:attribute>
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- External File -->
    <xs:element name="external-file">
        <xs:complexType>
            <xs:attribute name="href" use="required" />
            <xs:attribute name="crc" />
            <xs:attribute name="uid" />
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Header -->
    <xs:element name="header">
        <xs:complexType>
            <xs:sequence>
                <xs:choice minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="tmx:note" />
                </xs:choice>
                <xs:element minOccurs="0" ref="tmx:segmentation" />
                <xs:any maxOccurs="unbounded" minOccurs="0" namespace="##other"
                    processContents="lax" />
            </xs:sequence>

            <xs:attribute name="creationtool" use="required" />
            <xs:attribute name="creationtoolversion" use="required" />
            <xs:attribute name="segtype" use="required" type="tmx:segtypes" />
            <xs:attribute name="o-tmf" use="required" />
            <xs:attribute name="adminlang" use="required" />
            <xs:attribute name="srclang" use="required" />
            <xs:attribute name="datatype" use="required" />
            <xs:attribute name="o-encoding" />
            <xs:attribute name="creationdate" type="tmx:date_type"/>
            <xs:attribute name="creationid" />
            <xs:attribute name="changedate" type="tmx:date_type"/>
            <xs:attribute name="changeid" />
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Internal File -->
    <xs:element name="internal-file">
        <xs:complexType mixed="true">
            <xs:sequence>
                <xs:any maxOccurs="1" minOccurs="1" namespace="http://www.lisa.org/srx20"
                    processContents="lax" />
            </xs:sequence>            
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Note -->
    <xs:element name="note">
        <xs:complexType mixed="true">
            <xs:attribute name="o-encoding" />
            <xs:attribute ref="xml:lang" />
            <xs:attribute name="creationdate" type="tmx:date_type"/>
            <xs:attribute name="creationid" />
            <xs:attribute name="changedate" type="tmx:date_type"/>
            <xs:attribute name="changeid" />
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Segment -->
    <xs:element name="seg">
        <xs:complexType mixed="true">
            <xs:choice minOccurs="0" maxOccurs="unbounded">
                <xs:element ref="tmx:itag" />
                <xs:element ref="tmx:hi" />
            </xs:choice>
            <xs:attribute ref="xml:space" default="default" />
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Segmentation -->
    <xs:element name="segmentation">
        <xs:complexType>
            <xs:choice>
                <xs:element ref="tmx:internal-file" />
                <xs:element ref="tmx:external-file" />
            </xs:choice>
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Translation Unit -->
    <xs:element name="tu">
        <xs:complexType>
            <xs:sequence>
                <xs:choice minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="tmx:note" />
                    <xs:element ref="tmx:context" />
                </xs:choice>
                <xs:element ref="tmx:tuv" minOccurs="2" maxOccurs="unbounded" />
                <xs:any maxOccurs="unbounded" minOccurs="0" namespace="##other"
                    processContents="lax" />
            </xs:sequence>
            <xs:attribute name="tuid" />
            <xs:attribute name="o-encoding" />
            <xs:attribute name="datatype" />
            <xs:attribute name="usagecount" >
                <xs:simpleType>
                    <xs:restriction base="xs:integer">
                        <xs:minInclusive value="0" />
                    </xs:restriction>
                </xs:simpleType>                
            </xs:attribute>
            <xs:attribute name="lastusagedate" type="tmx:date_type"/>
            <xs:attribute name="creationtool" />
            <xs:attribute name="creationtoolversion" />
            <xs:attribute name="creationdate" type="tmx:date_type"/>
            <xs:attribute name="creationid" />
            <xs:attribute name="changedate" type="tmx:date_type"/>
            <xs:attribute name="segtype" type="tmx:segtypes" />
            <xs:attribute name="changeid" />
            <xs:attribute name="o-tmf" />
            <xs:attribute name="srclang" />
            <xs:attribute name="group" />
            <xs:attribute name="g-order">
                <xs:simpleType>
                    <xs:restriction base="xs:integer">
                        <xs:minInclusive value="1" />
                    </xs:restriction>
                </xs:simpleType>
            </xs:attribute>
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Translation Unit Variant -->
    <xs:element name="tuv">
        <xs:complexType>
            <xs:sequence>
                <xs:choice minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="tmx:note" />
                </xs:choice>
                <xs:element ref="tmx:seg" />
                <xs:any maxOccurs="unbounded" minOccurs="0" namespace="##other"
                    processContents="lax" />
            </xs:sequence>
            <xs:attribute ref="xml:lang" use="required" />
            <xs:attribute name="o-encoding" />
            <xs:attribute name="datatype" />
            <xs:attribute name="usagecount" >
                <xs:simpleType>
                    <xs:restriction base="xs:integer">
                        <xs:minInclusive value="0" />
                    </xs:restriction>
                </xs:simpleType>
            </xs:attribute>
            <xs:attribute name="lastusagedate" type="tmx:date_type"/>
            <xs:attribute name="creationtool" />
            <xs:attribute name="creationtoolversion" />
            <xs:attribute name="creationdate" type="tmx:date_type"/>
            <xs:attribute name="creationid" />
            <xs:attribute name="changedate" type="tmx:date_type"/>
            <xs:attribute name="o-tmf" />
            <xs:attribute name="changeid" />
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!--
    ================================================== 
     Content Markup 
    ================================================== 
    -->
    <!-- Highlight -->
    <xs:element name="hi">
        <xs:complexType mixed="true">
            <xs:choice minOccurs="0" maxOccurs="unbounded">
                <xs:element ref="tmx:itag" />
            </xs:choice>
            <xs:attribute name="x">
                <xs:simpleType>
                    <xs:restriction base="xs:integer">
                        <xs:minInclusive value="1" />
                    </xs:restriction>
                </xs:simpleType>
            </xs:attribute>
            <xs:attribute name="type" use="required">
                <xs:simpleType>
                    <xs:union memberTypes="tmx:term_type tmx:Custom" />
                </xs:simpleType>
            </xs:attribute>
            <xs:attribute name="comment" />
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Internal Tag -->
    <xs:element name="itag">
        <xs:complexType mixed="true">
            <xs:sequence>
                <xs:element minOccurs="0" maxOccurs="unbounded" ref="tmx:sub" />
            </xs:sequence>
            <xs:attribute name="x">
                <xs:simpleType>
                    <xs:restriction base="xs:integer">
                        <xs:minInclusive value="1" />
                    </xs:restriction>
                </xs:simpleType>
            </xs:attribute>
            <xs:attribute name="assoc" type="tmx:assoc_type" />
            <xs:attribute name="equiv-text" />
            <xs:attribute name="pos">
                <xs:simpleType>
                    <xs:restriction base="xs:token">
                        <xs:enumeration value="start" />
                        <xs:enumeration value="end" />
                    </xs:restriction>
                </xs:simpleType>
            </xs:attribute>
            <xs:attribute name="type" use="required">
                <xs:simpleType>
                    <xs:union memberTypes="tmx:placeholder_type tmx:Custom" />
                </xs:simpleType>
            </xs:attribute>
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>
    <!-- Subflow -->
    <xs:element name="sub">
        <xs:complexType mixed="true">
            <xs:sequence minOccurs="0" maxOccurs="unbounded">
                <xs:element ref="tmx:itag" />
            </xs:sequence>
            <xs:attribute name="datatype" />
            <xs:attribute name="type" use="required">
                <xs:simpleType>
                    <xs:union memberTypes="tmx:placeholder_type tmx:term_type tmx:Custom" />
                </xs:simpleType>
            </xs:attribute>
            <xs:anyAttribute namespace="##any" processContents="lax" />
        </xs:complexType>
    </xs:element>    
</xs:schema>
<!-- End -->


C. Glossary

OSCAR

LISA special interest group (Open Standards for Container/Content Allowing Re-use).

UTC

UTC stands for Coordinated Universal Time.

XML

XML stands for Extensible Markup Language. XML is a simplified and restricted subset of Standard Generalized Markup Language (SGML).

XML Schema

A description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntax constraints imposed by XML itself. An XML schema provides a view of the document type at a relatively high level of abstraction.


D. References

Normative
[IANA Charsets]

IANA Names for Character Sets. IANA (Internet Assigned Numbers Authority), Aug 2001

[MIME Media Types]

IANA MIME Media Types. IANA (Internet Assigned Numbers Authority), 2007

[ISO 8601]

Representation of dates and times. ISO (International Organization for Standardization), Dec 2000.

[RFC 2046]

Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types. IETF (Internet Engineering Task Force), November 1996.

[RFC 4646]

RFC 4646 Tags for the Identification of Languages. IETF (Internet Engineering Task Force), September 2006. This document, in combination with RFC 4647, replaces RFC 3066, which replaced RFC 1766.

[SRX 2.0]

Segmentation Rules Exchange (SRX) is an XML-based standard for description of the ways in which translation and other language-processing tools segment text for processing.

[XML 1.0]

Extensible Markup Language (XML) 1.0 Second Edition. W3C (World Wide Web Consortium), Oct 2000.

[XML Namespaces]

Namespaces in XML. W3C (World Wide Web Consortium), August 2006.


Non-Normative
[ISO]

International Organization for Standardization Web site.

[LISA]

Localisation Industry Standards Association Web site.

[Unicode]

Unicode Consortium Web site.

[W3C]

World Wide Web Consortium Web site.