TMX 2.0 Specification Draft
OSCAR Public Committee Draft - 2009 March 9
|
|
|
This draft of TMX 2.0 is released for public comment. Public
feedback is encouraged and should be sent to Arle R. Lommel
<arle@lisa.org> for consideration by April 10, 2009.
|
Editors:
Rodolfo M. Raya <rmraya@maxprograms.com> Arle R. Lommel <arle@lisa.org>
Previous Editors:
Yves Savourel Alan K. Melby
Copyright © The Localisation Industry Standards Association [
LISA ] 1997-2009. All Rights Reserved.
This document and translations of it may be copied and furnished to others, and
derivative works that comment on or otherwise explain it or assist in its implementation
may be prepared, copied, published and distributed, in whole or in part, without
restriction of any kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this document itself may not
be modified in any way, such as by removing the copyright notice or references to LISA.
The limited permissions granted above are perpetual and will not be revoked by LISA or
its successors or assigns.
This document and the information contained herein is provided on an "AS IS" basis and
LISA DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY
WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Abstract
This document defines version 2.0 of the Translation Memory eXchange format (TMX). The
purpose of the TMX format is to provide a standard method to describe translation memory
data that is being exchanged among tools and/or translation vendors, while introducing
little or no loss of critical data during the process.
Status of this Document
This document constitutes an initial draft for discussion. Comments may be sent to
tmx2@lisa.org.
Table of Contents
Abstract
1. Introduction
1.1. XML Conformance
1.2. Character Encoding
1.3 Extensibility
1.3.1. Extension Points
2. General Structure
2.1. Header
2.2. Body
3. Detailed Specifications
3.1. Elements
3.1.1. Structural Elements
3.1.2. Inline Elements
3.2. Attributes
3.2.1. TMX Attributes
3.2.2. XML Namespace Attributes
4. Content Markup
4.1. Overview
4.2. Representing Inline Elements
4.2.1. When the Content of Tags Is Available in the Translation Memory
4.2.2. When the Content of Tags Is Unavailable in the Translation Memory
5. TMX Compliance
5.1 Validation of TMX Files
6. Changes Since Previous Version (Non-Normative)
6.1 Upgrading TMX Files
Appendices
A. Sample Document
B. XML Schema for TMX
C. Glossary
D. References
Normative
Non-Normative
1. Introduction
TMX is defined in two parts:
A specification of the format of the container (the higher-level elements
that provide information about the file as a whole and about entries). In TMX,
an entry consisting of aligned segments of text in two or more languages is
called a Translation Unit (the <tu>
element).
A specification of a low-level meta-markup format for the content of a
segment of translation-memory text. In TMX, an individual segment of
translation-memory text in a particular language is denoted by a <seg> element. See the section on Content Markup for more details.
1.1. XML Conformance
TMX is XML-conformant. The TMX vocabulary is defined using an XML Schema (see Appendix B) It also uses various third party standards for
date/time and language codes. See the References
section for more details.
TMX files are intended to be created automatically by export routines and processed
automatically by import routines. TMX files are well-formed XML
documents that can be processed without explicit reference to the TMX Schema. However, a
valid TMX file must conform to the TMX Schema, and any TMX file
about which there are concerns should be verified against the TMX Schema using a
validating XML parser.
Since XML syntax is case sensitive, any XML application must define casing
conventions. All elements and attributes names of TMX are defined in
lowercase.
The namespace URI for TMX 2.0 is defined as "http://www.lisa.org/tmx20". For example,
TMX used in another (non-TMX) XML document would appear something like this:
<?xml version="1.0" encoding="utf-8"?>
<myformat xmlns:tmx="http://www.lisa.org/tmx20">
<data>
<tmx:tmx version="2.0">
<tmx:header ...
... TMX data ...
</tmx:body>
</tmx:tmx>
</data>
</myformat>
|
1.2. Character Encoding
TMX files are always in Unicode. They can use either of three encoding methods: UTF-16
(16-bit files), UTF-8 (8-bit files) or ISO-646 [a.k.a. US-ASCII] (7-bit files).
In all non 7-bit cases, unlike in HTML, only the following five character entity
references are allowed: & (&), < (<), >
(>), ' ('), and " ("). For 7-bit files, extended
(non-ASCII) characters are always represented by numeric character references. For
example: Ζ or Ζ for a GREEK CAPITAL LETTER DELTA.
Note that proper UTF-16 files always start with the Unicode byte-order-mark (BOM)
values U+FEFF or U+FFFE (indicating “big-endian” and “little-endian” byte orders
respectively). UTF-8 files may (but are not required to) begin with
the UTF-8 BOM (EF BB BF).
Since all XML processors must accept the UTF-8 and UTF-16 encodings and since US-ASCII
is a subset of UTF-8, a TMX document can omit the encoding declaration in the XML
declaration, although its inclusion is recommended. Note, however, that for accurate
character set detection, UTF-16 files must being with the BOM. Applications that support
TMX must be able to read files stored in all three of these
encodings (including both UTF-16 byte orders) regardless of the encoding the tools use
internally. They must also correctly interpret UTF-8 files beginning with the optional
BOM.
In addition, if the source database or application generating a TMX file uses
character codes in the Private Use Area of Unicode (code points U+E000–U+F8FF) it
must convert those code points to their corresponding character
entities in TMX files. For example, if a source document uses the “fft” ligature found
in certain Adobe OpenType fonts at code point U+E097 in the Private Use Area, the
corresponding TMX document would represent this character as &xE097;. This
process is required since many text-processing tools do not support the PUA. Inclusion
of such character entities in TMX files may necessitate additional negotiation between
the creator and receiver of the file if such code points are to be properly interpreted.
Such negotiations are outside the scope of the TMX standard and use of the PUA is
discouraged when possible.
1.3 Extensibility
TMX provides a mechanism for the exchange of translation memory data, not
application-specific features or data. Transferring data alone may not transfer the
knowledge of how to process data. As a result, although TMX provides a rich set of
elements for exchanging Translation Memory data, sometimes it may be necessary to extend
TMX vocabulary using XML Namespaces in order to support
functions needed for specific tasks.
It is possible to add non-TMX elements, as well as attributes and attribute values, to
any TMX document. All foreign elements and attributes added to a TMX file must be
defined using an XML Schema. All XML Schemas declared in a TMX document must be made
available to permit validation of the foreign constructs included in the file.
Although TMX offers this extensibility mechanism, in order to avoid difficulty in
processing and increase interoperability between tools, it is strongly recommended to
use TMX capabilities whenever possible, rather than to create non-standard user-defined
elements or attributes.
Applications that depend on the TMX format for exchanging Translation Memory data are
not required to understand or support non-TMX elements or attributes. A TMX application
can safely ignore foreign elements or attributes present in a TMX document.
1.3.1. Extension Points
TMX supports the use of foreign XML elements in the following elements: <body>, <header>, <internal-file>, <tu>
and <tuv>.
Foreign attributes can be added to any TMX element, provided that the attribute name
is fully qualified with the corresponding namespace prefix.
2. General Structure
A TMX document is enclosed in a <tmx> root
element. The <tmx> element contains two elements:
<header> and <body>.
The <header> contains meta-data about the
document. In addition to its attributes, <header> can also store document-level information in <note> elements. Any SRX
2.0-format representation of segmentation rules used to generate a TMX file must be
included in the <header> using a <segmentation> element. The <segmentation> element does not need to be
used if no such rules are available in SRX 2.0 format or if no segmentation rules apply
(e.g., because the source file was pre-segmented).
2.2. Body
The <body> contains the collection of
translation units (the <tu> elements). This
collection is in no specific order.
Each <tu> element contains at least one
translation unit variant (the <tuv> element).
Each <tuv> contains the segment and the
information pertaining to that segment for a given language. (Note that if fewer than
two <tuv> elements appear in a <tu> element, that <tu> element is
considered to be incomplete. Incomplete <tu> elements may be needed for some
applications, although they would not generally be useful for translation memory (TM)
applications.)
The text itself is stored in the <seg> element,
while <note> allows for storage of additional
information specific to each <tuv>.
A segment can contain markup content elements: The <itag> element allows for the location of native inline codes
to be indicated, along with their relationship to each other (e.g., paired tags). It
also provides the optional capability to encapsulate native native inline codes. The
<hi> element allows for the addition of extra
markup not related to existing inline codes. And the <sub> element, used inside encapsulated inline code, allows
for the delimitation of translatable text within markup (e.g., the content of an HTML
<alt> tag).
See the Sample Document section for an example of TMX
document.
3. Detailed Specifications
3.1. Elements
TMX elements are divided into two main categories: the structural elements (the
container), and the inline elements (the content markup).
Structural elements
<body>, <context>, <external-file>, <header>, <internal-file>, <note>, <seg>,
<segmentation>, <tmx>, <tu>, <tuv>.
Inline elements
<hi>, <itag>, <sub>.
3.1.1. Structural Elements
The structural elements are the following:
<body>
Body - The <body> element encloses the main data,
the set of <tu> elements that are comprised within
the file.
Required attributes:
None.
Optional attributes:
None.
Contents:
Zero, one or more <tu> elements and
Zero, one or more non-TMX elements, in any order.
<context>
Context Information - The <context> element
describes the context of a <tu>. The purpose of
this context information is to allow certain pieces of text to have different
translations depending on where they came from. The translation of a piece of text may
differ if it is a web form or a dialog or an Oracle form or a Lotus form for example.
This information is thus required by a translator when working on the file. Likewise,
the information may be used by any tool proposing to automatically leverage the text
successfully. Note that the local context (i.e., text that surrounds a given <tu> element) is indicated using the group and g-order attributes.
Required attributes:
context-type.
Optional attributes:
None.
Contents:
Text. Suggested values taken from the TBX Basic specification for use in a
software localization environment include, but are not limited to, Menu item, Dialog
box, Group box, Text box, Combo box, Combo box element, Check box, Tab, Push button,
Radio button, Spin box, Progress bar, Slider, Informative message, Interactive message,
ToolTip, Table text, and User-defined type.
<external-file>
External file - The <external-file> element
specifies the location of the actual SRX file being referenced.
The required href attribute provides a URL to the file. The
crc attribute accepts a value that can be used to assure
the integrity of the file. The optional uid attribute allows a
unique ID to be assigned to the file.
Required attributes:
href.
Optional attributes:
crc, uid.
Contents:
Empty.
File header - The <header> element contains
information pertaining to the whole document.
Required attributes:
creationtool, creationtoolversion, segtype, o-tmf, adminlang, srclang, datatype.
Optional attributes:
o-encoding, creationdate, creationid, changedate, changeid.
Contents:
Zero, one or more <note> elements, followed by
Zero or one <segmentation>
element, followed by Zero, one or more non-TMX elements.
<internal-file>
Internal file - The <internal-file> element
contains the actual SRX file with the segmentation rules used
when generating the TMX document.
Required attributes:
None.
Optional attributes:
None.
Contents:
One SRX file embedded using SRX namespace.
<note>
Note - The <note> element is used for comments.
Required attributes:
None.
Optional attributes:
creationdate, creationid, changedate, changeid, o-encoding, xml:lang.
Contents:
Text.
<seg>
Segment - The <seg> element contains the text of
the given segment. There is no length limitation to the content of a <seg>
element. If the optional xml:space attribute is set to "preserve", all spacing and
line-breaking characters are significant within a <seg> element.
Required attributes:
None.
Optional attributes:
xml:space.
Contents:
Text data, Zero, one or more of the following elements: <hi>, and <itag>. They can be in any order.
<segmentation>
Segmentation - The <segmentation> element points to
or contains the SRX segmentation rules that were used in the
generation of the TMX file.
Required attributes:
None.
Optional attributes:
None.
Contents:
Either exactly one <internal-file> or
one <external-file> element.
<tmx>
TMX document - The <tmx> element encloses all the
other elements of the document.
Required attributes:
version.
Contents:
One <header> followed by One
<body> element.
<tu>
Translation unit - The <tu> element contains the
data for a given translation unit.
Required attributes:
None.
Optional attributes:
tuid, o-encoding, datatype, usagecount, lastusagedate, creationtool, creationtoolversion,
creationdate, creationid, changedate, segtype, changeid, o-tmf, srclang, group,
g-order.
Contents:
Zero, one or more <note> or <context> elements in any order, followed
by One or more <tuv> elements,
followed by Zero, one or more non-TMX elements.
<tuv>
Translation Unit Variant - The <tuv> element
specifies text in a given language.
Required attributes:
xml:lang.
Optional attributes:
o-encoding, datatype,
usagecount, lastusagedate, creationtool, creationtoolversion, creationdate, creationid, changedate, changeid, o-tmf, xml:space.
Contents:
Zero, one or more <note> elements, followed
by One <seg> element, followed by
Zero, one or more non-TMX elements.
3.1.2. Inline Elements
The inline elements are the elements that can appear inside a segment. See also the
Content Markup section for more
information.
The inline elements are the following:
<hi>
Highlight - The <hi> element delimits a section of
text that has special meaning, such as a terminological unit, a proper name, an item
that should not be modified, etc. It can be used for various processing tasks such as
indicating to a Machine Translation tool proper names that should not be translated, for
terminology verification, or to mark suspect expressions after a grammar checking.
Required attributes:
type.
Optional attributes:
x, comment.
Contents:
Text data, Zero, one or more <itag> elements
<itag>
internal tag - The <itag> element is used to
indicate the position of native internal markup used in segments. This element replaces
the now-deprecated <bpt>, <ept>, <it>, <ph>, and <ut>
elements. The <itag> element can also encapsulate application file format markup
if this information is stored in a translation memory application (see the section on
content markup below), and must do so when
the creating application stores this information. If this information is not stored in
an application, this element appears as an empty XML element.
Required attributes:
type, x
Optional attributes:
pos, assoc, equiv-text.
Contents:
(May be empty), Code data, One or more <sub> elements.
<sub>
Sub-flow - The <sub> element is used to delimit
sub-flow text inside a sequence of native code, for example: the definition of a
footnote or the text of title in a HTML anchor element. The <sub> element
may only be used within <itag> elements.
Here are some examples (translatable text underlined, sub-flow is bolded):
Footnote in RTF:
|
Original RTF:
Elephants{\cs16\super \chftn {\footnote \pard\plain
\s15\widctlpar \f4\fs20 {\cs16\super \chftn }
An elephant is a very large animal.
}} are big.
TMX with content mark-up:
Elephants<itag type="fnote"
x="1">{\cs16\super \chftn {\footnote \pard\plain \s15\widctlpar
\f4\fs20 {\cs16\super \chftn } <sub type="fnote">
An elephant is a very large animal.
</sub>}}</itag> are
big.
TMX without content mark-up:
Elephants<itag type="fnote"
x="1"><sub type="fnote">
An elephant is a very large animal.
</sub>}}</itag> are
big.
|
Index marker in RTF:
|
Original RTF:
Elephants{\pard\plain
\widctlpar \v\f4\fs20 {\xe {
Big animal
\bxe }}} are big.
TMX with content mark-up:
Elephants<itag
type="index" x="1">{\pard\plain \widctlpar \v\f4\fs20
{\xe {<sub type="index">
Big animal
</sub>\bxe }}}</itag> are
big.
TMX without content mark-up:
Elephants<itag
type="index" x="1"><sub type="index">
Big animal
</sub></itag> are
big.
|
Text of an attribute in a HTML element:
|
Original HTML:
See the <a title="
Go to Notes
" href="notes.htm">Notes</a>
for more details.
TMX with content mark-up:
See the <itag x="1"
pos="start" type="link"><a title="<sub type="link">
Go to Notes
</sub>" href="notes.htm"></itag>Notes<itag
x="1" pos="end"></a></itag>
for more details.
|
Note that sub-flows are related to segmentation and can cause interoperability issues
when one tool uses sub-flow within its main segment, while another extracts the sub-flow
text as an independent segment. Resolving these differences is beyond the scope of TMX
and users may expect some loss of leverage in cases involving sub-flow, although tool
developers may implement processes to minimize data loss caused by this issue.
Required attributes:
type.
Optional attributes:
datatype.
Contents:
Text data, Zero, one or more <itag> elements
3.2. Attributes
This section lists the various attributes used in the TMX elements.
TMX attributes
adminlang, assoc,
changedate, changeid, comment, context-type
creationdate, creationid, creationtool, creationtoolversion, crc, datatype, equiv-text, group, g-order, lastusagedate, o-encoding, o-tmf, equiv-text, segtype, srclang, tuid, type, uid, usagecount, version,
x.
XML namespace attributes
xml:lang, xml:space
3.2.1. TMX Attributes
adminlang
Administrative language - Specifies the default language for the
administrative and informative element <note>.
Value description:
A language code as described in the [RFC 4646]. Unlike
the other TMX attributes, the values for adminlang are not case-sensitive.
Default value:
Undefined.
Used in:
<header>.
assoc
Association - Indicates the association of a <itag> with the text prior or after.
Value description:
"p" (the element is associated with the text preceding the element), "f" (the element
is associated with the text following the element), or "b" (the element is associated
with the text on both sides). Note: The assoc attribute should not be confused with the
x attribute, which is used to indicate pairing of <itag> elements within a segment and their correlation to
corresponding markup elements used in other <tuv>
elements within a single <tu> element
Default value:
Undefined.
Used in:
<itag>.
changedate
Change date - Specifies the date of the last modification of the
element.
Value description:
Date in [ISO 8601] Format. The recommended pattern to
use is: YYYYMMDDThhmmssZ Where: YYYY is the year (4 digits), MM is the
month (2 digits), DD is the day (2 digits), hh is the hours (2 digits), mm is the
minutes (2 digits), ss is the second (2 digits), and Z indicates the time is UTC time.
For example:
date="20020125T210600Z" is January 25, 2002 at 9:06pm GMT is
January 25, 2002 at 2:06pm US Mountain Time is January 26, 2002 at 6:06am
Japan time
Default value:
Undefined.
Used in:
<header>, <tu>, <tuv>.
changeid
Change identifier - Specifies the identifier of the user who
modified the element last.
Value description:
Text.
Default value:
Undefined.
Used in:
<header>, <tu>, <tuv>.
Comment - a comment in a tag
Value description:
Text.
Default value:
Undefined.
Used in:
<hi>.
context-type
Context type - The context-type attribute specifies the context
and the type of resource or style of the data of a given element. For example, to define
if it is a label, or a menu item in the case of resource-type data, or the style in the
case of document-related data.
Value description:
Text without spaces. Pre-defined values are as follow:
|
database
|
Indicates database content.
|
|
element
|
Indicates the content of an element within an XML document.
|
|
elementtitle
|
Indicates the name of an element within an XML document.
|
|
linenumber
|
Indicates the line number from the sourcefile (see
context-type="sourcefile") where the source text is found.
|
|
numparams
|
Indicates a the number of parameters contained within the source
text.
|
|
paramnotes
|
Indicates notes pertaining to the parameters in the source text.
|
|
record
|
Indicates the content of a record within a database.
|
|
recordtitle
|
Indicates the name of a record within a database.
|
|
sourcefile
|
Indicates the original source file from which the TMX file is created.
|
In addition, user-defined values can be used with this attribute. A user-defined value
must start with an "x-" prefix.
Default value:
Undefined.
Used in:
<context>.
creationdate
Creation date - Specifies the date of creation of the element.
Value description:
Date in [ISO 8601] Format. The recommended pattern to
use is: YYYYMMDDThhmmssZ Where: YYYY is the year (4 digits), MM is the
month (2 digits), DD is the day (2 digits), hh is the hours (2 digits), mm is the
minutes (2 digits), ss is the second (2 digits), and Z indicates the time is UTC time.
For example:
date="20020125T210600Z" is January 25, 2002 at 9:06pm GMT is
January 25, 2002 at 2:06pm US Mountain Time is January 26, 2002 at 6:06am
Japan time
Default value:
Undefined.
Used in:
<header>, <tu>, <tuv>.
creationid
Creation identifier - Specifies the identifier of the user who
created the element.
Value description:
Text.
Default value:
Undefined.
Used in:
<header>, <tu>, <tuv>.
Creation tool - Identifies the tool that created the TMX
document. Its possible values are not specified by the standard but each tool provider
should publish the string identifier it uses.
Value description:
Text.
Default value:
Undefined.
Used in:
<header>, <tu>, <tuv>.
Creation tool version - Identifies the version of the tool that
created the TMX document. Its possible values are not specified by the standard but each
tool provider should publish the string identifier it uses.
Value description:
Text.
Default value:
Undefined.
Used in:
<header>, <tu>, <tuv>.
crc
Cyclic redundancy checking - A private value used to verify data
as it is returned to the producer. The generation and verification of this number is
tool-specific.
Value description:
Number (possibly not decimal).
Default value:
Undefined.
Used in:
<external-file>.
datatype
Data type - Specifies the type of data contained in the element.
Different processes may be applied to the data depending on the value of the datatype
attribute.
Value description:
Text.
It is highly recommended that developers use official MIME types (as defined in IETF FRC 2046 and registered in the IANA list of MIME types) where possible as datatype
values. In addition, the values provided in the following table for datatype may be used
for compatibility with the XLIFF specification or for localization-specific formats that
lack official, distinctive MIME type values (or for which the applicable MIME types are
insufficiently specific). Note that, in some instances, more general MIME types that may
apply are not provided in this table.
|
datatype value (XLIFF compatible)
|
Description
|
Equivalent MIME type
|
|
unknown
|
undefined (default)
|
—
|
|
alptext
|
WinJoust data.
|
—
|
|
cdf
|
Channel Definition Format.
|
—
|
|
cmx
|
Corel CMX Format.
|
—
|
|
cpp
|
C and C++ style text.
|
—
|
|
dita
|
Darwin Information Typing Architecture (DITA)
|
—
|
|
hptag
|
HP-Tag.
|
—
|
|
html
|
HTML, DHTML, etc.
|
text/html
|
|
interleaf
|
Interleaf documents.
|
—
|
|
ipf
|
IPF/BookMaster.
|
—
|
|
java
|
Java, source and property files.
|
application/java
|
|
javascript
|
JavaScript, ECMAScript scripts.
|
application/x-javascript
|
|
lisp
|
Lisp.
|
application/x-lisp
|
|
mif
|
Framemaker MIF, MML, etc.
|
application/x-frame
|
|
opendocument
|
Open Document file.
|
(There are a variety of MIME types for Open Document files, depending on
the exact type of Open Document file. Open Document MIME types begin with
application/vnd.oasis.opendocument.)
|
|
opentag
|
OpenTag data.
|
—
|
|
pascal
|
Pascal, Delphi style text.
|
text/pascal
|
|
plaintext
|
Plain text.
|
text/plain
|
|
pm
|
PageMaker.
|
application/x-pagemaker
|
|
resx
|
Windows .NET resources.
|
—
|
|
rtf
|
Rich Text Format.
|
application/rtf
|
|
sgml
|
SGML.
|
text/sgml
|
|
stf-f
|
S-Tagger for FrameMaker.
|
—
|
|
stf-i
|
S-Tagger for Interleaf.
|
—
|
|
transit
|
Transit data.
|
—
|
|
vbscript
|
Visual Basic scripts.
|
—
|
|
winres
|
Windows resources from RC, DLL, EXE.
|
—
|
|
xliff
|
XLIFF (XML Localization Interchange File Format).
|
—
|
|
xml
|
XML.
|
text/xml
|
|
xptag
|
Quark XPressTag.
|
—
|
Used in:
<header>, <tu>, <tuv>, <sub>.
equiv-text
Equivalent text - Indicates the equivalent text to substitute in
place of an inline tag.
The following example shows use of the attribute to specify that an html <br />
tag is to be interpreted as a linefeed character, using both full and empty
<itag> elements.
Version 1, with content:
|
<itag x="1" equiv-text="linefeed
character"><br> /<itag>
|
Version 2, empty element:
|
<itag x="1" equiv-text="linefeed character" />
|
Value description:
Text.
Used in:
<itag>.
group
Group identifier - indicates that a given <tu> element belongs to a logical group of related translation
units.
Value description:
Text without spaces.
Used in:
<tu>
g-order
Group order - defines the order of the <tu> within a given logical group. Used together with group attribute.
In the following portion of a TMX file the group attribute
shows that the three tu elements are part of a logical group in
the source document. The g-order element shows in which
order they occurred within that logical grouping.
<tu group="p0001" g-order="1" datatype="plaintext">
<tuv xml:lang="hu">
<seg>Nyomja a piros gombot.</seg>
</tuv>
<tuv xml:lang="en">
<seg>Press the red button.</seg>
</tuv>
</tu>
<tu group="p0001" g-order="2" datatype="plaintext">
<tuv xml:lang="hu">
<seg>Az inditás után, nézze meg, hogy elég az olajnyomás.</seg>
</tuv>
<tuv xml:lang="en">
<seg>After starting it, make sure that the oil pressure is sufficient</seg>
</tuv>
</tu>
<tu group="numbers" g-order="3" datatype="plaintext">
<tuv xml:lang="hu">
<seg>Ha nem elég, a gépet ki kell kapcsolni.</seg>
</tuv>
<tuv xml:lang="en">
<seg>If it is not sufficient, you must shut the machine down.</seg>
</tuv>
</tu>
|
Value description:
Number starting at 1 and incremented in steps of 1 unit. Must be unique within each
logical group defined with the group attribute. Its initial
value is reset to 1 in each logical group.
href
Hypertext reference - The "href" attribute contains a valid URL
that describes the location of a file.
Value description:
Text.
Default value:
Undefined.
Used in:
<external-file>.
lastusagedate
Last usage date - Specifies when the last time the content of a
<tu> or <tuv> element was used in the original translation memory
environment.
Value description:
Date in [ISO 8601] Format. The recommended pattern to
use is: YYYYMMDDThhmmssZ Where: YYYY is the year (4 digits), MM is the
month (2 digits), DD is the day (2 digits), hh is the hours (2 digits), mm is the
minutes (2 digits), ss is the second (2 digits), and Z indicates the time is UTC time.
For example:
date="20020125T210600Z" is January 25, 2002 at 9:06pm GMT is
January 25, 2002 at 2:06pm US Mountain Time is January 26, 2002 at 6:06am
Japan time
Default value:
Undefined.
Used in:
<tu>, <tuv>.
o-encoding
Original encoding - As stated in the Encoding section, all TMX documents are in Unicode. However, it is sometimes
useful to know what code set was used to encode text that was converted to Unicode for
purposes of interchange. The o-encoding attribute specifies the original or preferred
code set of the data of the element in case it is to be re-encoded in a non-Unicode code
set.
Value description:
One of the [IANA] recommended "charset
identifier", if possible.
Default value:
Undefined.
Used in:
<header>, <tu>, <tuv>, <note>.
o-tmf
Original translation memory format - Specifies the format of the
translation memory file from which the TMX document or segment thereof have been
generated.
Value description:
Text.
Default value:
Undefined.
Used in:
<header>, <tu>, <tuv>.
pos
Position - Indicates whether a tag replaced by <itag> was the start or the end tag of a matched pair.
Value description:
"start" or "end" (Note that if pos is not specified it is assumed that the markup
represented by <itag> is unpaired,
e.g., an XHTML <br /> tag.)
Default value:
Undefined.
Used in:
<itag>
segtype
Segment type - Specifies the kind of segmentation used in the
<tu> element. If a <tu> element does not have a segtype attribute specified, it
uses the one defined in the <header> element.
The "block" value is used when the segment does not correspond to one of the other
values, for example it may be desirable in some instances to store a chapter composed of
several paragraphs in a single <tu>.
The rules on how the text was segmented can be carried in a Segmentation Rules
eXchange (SRX) document.
Value description:
"block", "paragraph", "sentence", or "phrase".
Default value:
Undefined.
Used in:
<header>, <tu>.
srclang
Source language - Specifies the language of the source text. In
other words, the <tuv> holding the source segment
will have its xml:lang attribute set to the same value as
srclang. (except if srclang is set to "*all*"). If a <tu> element does not have a srclang attribute specified, it uses
the one defined in the <header> element.
Value description:
A language code as described in the [RFC 4646], or the
value "*all*" if any language can be used as the source language. Unlike the other TMX
attributes, the values for srclang are not case-sensitive.
Default value:
Undefined.
Used in:
<header>, <tu>.
tuid
Translation unit identifier - Specifies an identifier for the
<tu> element. Its value must be unique within
the file.
Value description:
Text without spaces.
Default value:
Undefined.
Used in:
<tu>.
type
Type - Specifies the kind of data an <itag> element represents.
Value description:
Text. Depends on the element where the attribute is used.
The recommended values for the type attribute, when used in an <itag> element are as follows. Note that some values of type
should logically be used only for pairs of tags.
|
bold
|
Bold.
|
|
color
|
Color change.
|
|
dulined
|
Doubled-underlined.
|
|
emphasis
|
Emphasis.
|
|
font
|
Font change.
|
|
italic
|
Italic.
|
|
link
|
Linked text.
|
|
scap
|
Small caps.
|
|
strong
|
Strong.
|
|
struct
|
XML/SGML structure.
|
|
ulined
|
Underlined.
|
|
xliff-bpt
|
XLIFF <bpt> tag.
|
|
xliff-g
|
XLIFF <g> tag.
|
|
index
|
Index marker.
|
|
date
|
Date.
|
|
time
|
Time.
|
|
fnote
|
Footnote.
|
|
enote
|
End-note.
|
|
alt
|
Alternate text.
|
|
image
|
Image
|
|
pb
|
Page break.
|
|
lb
|
Line break.
|
|
cb
|
column break.
|
|
inset
|
Inset.
|
|
xliff-bx
|
XLIFF <bx/> tag.
|
|
xliff-ex
|
XLIFF <ex/> tag.
|
|
xliff-it
|
XLIFF <it> tag.
|
|
xliff-ph
|
XLIFF <ph> tag.
|
|
xliff-x
|
XLIFF <x/> tag.
|
The recommended values for the type attribute, when used in <hi> are as follow:
|
abbrev
|
Indicates the marked text is an abbreviation.
|
|
abbreviated-form
|
ISO-12620 2.1.8: A term resulting from the omission of any part of the
full term while designating the same concept.
|
|
abbreviation
|
ISO-12620 2.1.8.1: An abbreviated form of a simple term resulting from the
omission of some of its letters (e.g. 'adj.' for 'adjective').
|
|
acronym
|
ISO-12620 2.1.8.4: An abbreviated form of a term made up of letters from
the full form of a multi-word term strung together into a sequence
pronounced only syllabically (e.g. 'radar' for 'radio detecting and
ranging').
|
|
appellation
|
ISO-12620: A proper-name term, such as the name of an agency or other
proper entity.
|
|
collocation
|
ISO-12620 2.1.18.1: A recurrent word combination characterized by cohesion
in that the components of the collocation must co-occur within an utterance
or series of utterances, even though they do not necessarily have to
maintain immediate proximity to one another.
|
|
common-name
|
ISO-12620 2.1.5: A synonym for an international scientific term that is
used in general discourse in a given language.
|
|
datetime
|
Indicates the marked text is a date and/or time.
|
|
equation
|
ISO-12620 2.1.15: An expression used to represent a concept based on a
statement that two mathematical expressions are, for instance, equal as
identified by the equal sign (=), or assigned to one another by a similar
sign.
|
|
expanded-form
|
ISO-12620 2.1.7: The complete representation of a term for which there is
an abbreviated form.
|
|
formula
|
ISO-12620 2.1.14: Figures, symbols or the like used to express a concept
briefly, such as a mathematical or chemical formula.
|
|
head-term
|
ISO-12620 2.1.1: The concept designation that has been chosen to head a
terminological record.
|
|
initialism
|
ISO-12620 2.1.8.3: An abbreviated form of a term consisting of some of the
initial letters of the words making up a multi-word term or the term
elements making up a compound term when these letters are pronounced
individually (e.g. 'BSE' for 'bovine spongiform encephalopathy').
|
|
international-scientific-term
|
ISO-12620 2.1.4: A term that is part of an international scientific
nomenclature as adopted by an appropriate scientific body.
|
|
internationalism
|
ISO-12620 2.1.6: A term that has the same or nearly identical orthographic
or phonemic form in many languages.
|
|
logical-expression
|
ISO-12620 2.1.16: An expression used to represent a concept based on
mathematical or logical relations, such as statements of inequality, set
relationships, Boolean operations, and the like.
|
|
materials-management-unit
|
ISO-12620 2.1.17: A unit to track object.
|
|
name
|
Indicates the marked text is a name.
|
|
near-synonym
|
ISO-12620 2.1.3: A term that represents the same or a very similar concept
as another term in the same language, but for which interchangeability is
limited to some contexts and inapplicable in others.
|
|
part-number
|
ISO-12620 2.1.17.2: A unique alphanumeric designation assigned to an
object in a manufacturing system.
|
|
phrase
|
Indicates the marked text is a phrase.
|
|
phraseological-unit
|
ISO-12620 2.1.18: Any group of two or more words that form a unit, the
meaning of which frequently cannot be deduced based on the combined sense of
the words making up the phrase.
|
|
protected
|
Indicates the marked text should not be translated.
|
|
romanized-form
|
ISO-12620 2.1.12: A form of a term resulting from an operation whereby
non-Latin writing systems are converted to the Latin alphabet.
|
|
set-phrase
|
ISO-12620 2.1.18.2: A fixed, lexicalized phrase.
|
|
short-form
|
ISO-12620 2.1.8.2: A variant of a multi-word term that includes fewer
words than the full form of the term (e.g. 'Group of Twenty-four' for
'Intergovernmental Group of Twenty-four on International Monetary
Affairs').
|
|
sku
|
ISO-12620 2.1.17.1: Stock keeping unit, an inventory item identified by a
unique alphanumeric designation assigned to an object in an inventory
control system.
|
|
standard-text
|
ISO-12620 2.1.19: A fixed chunk of recurring text.
|
|
symbol
|
ISO-12620 2.1.13: A designation of a concept by letters, numerals,
pictograms or any combination thereof.
|
|
synonym
|
ISO-12620 2.1.2: Any term that represents the same or a very similar
concept as the main entry term in a term entry.
|
|
synonymous-phrase
|
ISO-12620 2.1.18.3: Phraseological unit in a language that expresses the
same semantic content as another phrase in that same language.
|
|
term
|
Indicates the marked text is a term.
|
|
transcribed-form
|
ISO-12620 2.1.11: A form of a term resulting from an operation whereby the
characters of one writing system are represented by characters from another
writing system, taking into account the pronunciation of the characters
converted.
|
|
transliterated-form
|
ISO-12620 2.1.10: A form of a term resulting from an operation whereby the
characters of an alphabetic writing system are represented by characters
from another alphabetic writing system.
|
|
truncated-term
|
ISO-12620 2.1.8.5: An abbreviated form of a term resulting from the
omission of one or more term elements or syllables (e.g. 'flu' for
'influenza').
|
|
variant
|
ISO-12620 2.1.9: One of the alternate forms of a term.
|
Any of the suggested values listed in the tables above can be used with <sub> element.
In addition, user-defined values can be used with this attribute. A user-defined value
must start with an "x-" prefix.
Default value:
Undefined.
Used in:
<itag>, <hi>, <sub>.
uid
Unique ID - The "uid" attribute is used to provide a unique ID to
identify the file that contains the segmentation rules used when generating the TMX
document.
Value description:
Text.
Default value:
Undefined.
Used in:
<external-file>.
usagecount
Usage count - Specifies the number of times a <tu> or the content of the <tuv> element has been accessed in the original TM
environment.
Value description:
Number.
Default value:
Undefined.
Used in:
<tu>, <tuv>.
version
TMX version - The version attribute indicates the version of the
TMX format to which the document conforms.
Value description:
Fixed text: the major version number, a period, and the minor version number. For
example: version="2.0".
Default value:
"2.0"
Used in:
<tmx>.
x
Tag match - The x attribute is used to match inline <itag> and <hi> elements between each <tuv> element of a given <tu> element and to facilitate pairing of <itag> elements within a <tuv>
element. This mechanism facilitates the pairing of allied codes in source and target
text, even if the order of code occurrence differs between the two because of the
translation syntax. Note that <itag> elements
representing logically paired tags must share identical values of the x attribute but
will differ in their pos attribute, as shown in the examples below.
Also note that, due to differences between languages, not all values of x found in one
<tuv element will necessarily be found across all
<tuv>s within a single <tu> and that values of the type attribute may differ on <itag> elements in different <tuv>s that share x values. For example, an English source text might
have two instances of italic text that correspond to one span of bold text in a Spanish
translation or a translation may have formatting not found in the source text.
Appropriate use of the type attribute, along with use of matching x values can improve
reuse in such circumstances.
The following example shows how x can be used to indicate pairs of tags and matches of
tags across languages:
|
<seg>link to <itag pos="start"
type="link" x="1"><amp;a
href="www.mysite.com" title="<sub type="x-title">my
site</sub>"></itag>my web
site<itag pos="end"
x="1"></a></itag>,
and this is<itag type="image"
x="2"><img src="john.gif"
alt="<sub type="alt">John's
picture</sub>"/></itag>
John.</seg>
<seg>enlace a <itag pos="start" type="link"
x="1"><a href="www.mysite.com/es"
title="<sub type="x-title">mi
sitio</sub>"></itag>mi sitio
web<itag pos="end"
x="1"></a>,</ept>
y este es <itag type="image"
x="2"><img src="juan.gif"
alt="<sub type="alt">foto de
Juan</sub>"/></itag>
Juan.</seg>
|
The corresponding examples from a translation memory tool that does not store markup
would be:
|
<seg>link to <itag pos="start"
type="link" x="1"><sub
type="x-title">my site</sub>"</itag>my
web site<itag pos="end" x="1" />, and
this is<itag type="image" x="2"
><sub type="alt">John's
picture</sub></itag> John.</seg>
<seg>enlace a <itag
pos="start" type="link" x="1"><sub
type="x-title">mi sitio</sub></itag>mi
sitio web<itag pos="end" x="1" /> y este
es <itag type="image" x="2"><sub
type="alt">foto de Juan</sub></itag>
Juan.</seg>
|
Value description:
Number starting in 1 and incremented in steps of 1 unit. Within a given <seg> element, the value of the x attribute must be
unique for each <hi> element or <itag> element that lacks a pos attribute or has a value of "start"
for the pos attribute . Its initial value is reset to 1 in every <seg> element.
Default value:
Undefined.
Used in:
<itag>, <hi>.
3.2.2. XML Namespace Attributes
xml:lang
Language - The "xml:lang" attribute specifies the locale of the
text of a given element.
Value description:
A language code as described in the [RFC 4646]. This
declared value is considered to apply to all elements within the content of the element
where it is specified, unless overridden with another instance of the xml:lang
attribute. Unlike the other TMX attributes, the values for xml:lang are not
case-sensitive. For more information see the section on xml:lang in the XML
specification.
Default value:
Undefined.
Used in:
<tuv>, <note>.
xml:space
White spaces - The "xml:space" attribute specifies how white
spaces (ASCII spaces, tabs and line-breaks) should be treated.
Value description:
default or preserve. The value
default signals that an application's default white-space processing
modes are acceptable for this element; the value preserve indicates the
intent that applications preserve all the white space. This declared intent is
considered to apply to all elements within the content of the element where it is
specified, unless overridden with another instance of the xml:space attribute. For more
information see the section on xml:space in the XML specification.
Default value:
default.
Used in:
<seg>
4. Content Markup
4.1. Overview
TM systems use a variety of methods of marking up or representing internal formatting.
Formats are constantly evolving, and new formats are introduced on a regular basis.
Attempting to collect, interpret, disseminate and maintain finite descriptions of each
formatting tag used at any given time by TM systems is not possible. In addition, TM
tools may or may not include the actual content of formatting tools in their databases,
i.e., some tools may include only markers for the location of tags, relying instead on
the presence of formatting tags in source files to insert tags when the memory is used
to provide leverage against actual files that are to be translated.
At present, the best way to deal with these native codes in general is to delimit them
by a specific set of elements that convey where they begin and end, and possibly
additional information about what they are (bold, italic, footnote, etc.). (Note,
however, that in some cases inline content markup may be left unencapsulated to meet
specific needs. Guidance about how best to represent markup for specific needs and cases
is beyond the scope of this standard.)
The element <sub> is provided to delimit
(potentially translatable) sub-flow text within a sequence of native codes. For
instance, if the text content of a footnote is defined within the footnote marker code,
it may be demarked with the <sub> element.
4.2. Representing Inline Elements
TMX provides a mechanism for indicating the position of inline markup and
encapsulating this markup when it is available to the tool creating a TMX file.
When a TM tool contains the content of tags, this information must be included in the
TMX file produced by that tool. Including this information allows other tools capable of
using this information to do so. Tools that do not store tags internally may discard the
content of tags but should instead include appropriate tag markers in translation
memories generated from TMX files that do include this information.
Inline markup in translation memory data is stored by using the <itag> element to surround the markup (with any < or &
characters converted to their corresponding character entities, < and
&). The <itag> element can take one of
three forms:
If the element encloses the start tag in a set of paired tags, <itag> is given the value of "start" for the pos
attribute. For example: This text contains an <itag pos="start"
x="1"><em><HTML start tag. (This usage replaces the
<bpt> element found in previous versions of TMX.)
If the element encloses an end tag it is given the value of "end" for the pos
attribute and the value of its x attribute must agree with the value of x found
in the <itag> element that encloses the
corresponding start tag. For example: It is finished <itag pos="end"
x="1"></em>< in this text. (This usage replaces the
<ept> element found in previous versions of TMX.)
If the element encloses unpaired content (such as an XHTML <br /> tag)
it is not given a value for the pos attribute. (This usage replaces the
<ut> element found in previous versions of TMX.)
Note that if a <itag> element encapsulates paired
markup for which corresponding start or end markup is not present in the same <seg> element that it should have a unique value for the x
attribute. Correct use of the pos attribute will enable TMX-compliant applications to
correctly interpret the tag content as a start or end tag that has been isolated. (This
ability replaces the functionality of the <it> (isolated tag) element found in
previous versions of TMX.)
When a TM tool does not contain the content of tags, empty <itag> elements are used in a manner otherwise identical to the case
in which markup is encapsulated. Note that even if empty <itag> elements are otherwise used, start and end tag versions must
be used if sub-flow (marked with the <sub> element) is
represented. For example, a translatable title attribute in an HTML <a> element
would be represented by something like <itag x="4"
type="title"><sub type="title">Site
title</sub></itag>.
Examples:
Paired tags
Source text:
|
|
<p>link to <em><a href="www.mysite.com" title="My Site">my web site</a></em>.</p>
|
Text with encapsulated content markup:
|
|
<seg>link to <itag pos="start" x="1" type="emphasis">&lt;em></itag><itag pos="start" x="2"
type="link"> <a href="www.mysite.com" title="<sub type="x-title">My Site</sub>">
</itag>my web site<itag pos="end" x="2"></a>,</itag><itag pos="end" x="1"></em></itag>.</seg>
|
Text without encapsulated content markup:
|
|
<seg>link to <itag pos="start" x="1" type="emphasis" /><itag pos="start" x="2" type="link">
<sub type="x-title">My Site</sub></itag>
my web site<itag pos="end" x="2" /><itag pos="end" x="1" />&.</seg>
|
Paired tags
Source text:
|
|
<p>There were <em>many</em> French ships involved.</p>
|
Text with encapsulated content markup:
|
|
<seg>There were <itag pos="start" x="1" type="emphasis"><em></itag>many<itag pos="end"
x="1" /></em></itag> French ships involved.</seg>
|
Text without encapsulated content markup:
|
|
<seg>There were <itag pos="start" x="1" type="emphasis" />many<itag pos="end" x="1" /> French
ships involved.</seg>
|
Unpaired tags
Source text:
|
|
This is <br /><img src="john.gif" alt="John's picture"/> John.
|
Text with encapsulated content markup:
|
|
...<seg>This is <itag type="break" x="1" equiv-text="linefeed"><br /></itag><itag type="image"
x="2"><img src="john.gif" alt="<sub type="alt">John's picture</sub>"/></itag> John.</seg>
|
Text without encapsulated content markup:
|
|
...<seg>This is <itag type="break" x="1" equiv-text="linefeed" /><itag type="image" x="2"><sub
type="alt">John's picture</sub></itag> John.</seg>
|
Text with a paired tag whose pair is not found in the same segment
Source text:
|
|
This warning applies to users of model CR245 only.</strong>
|
Text with encapsulated content markup:
|
|
This warning applies to users of model CR245 only.<itag pos="end" type="strong" x="1"></strong></itag>
|
Text without encapsulated content markup:
|
|
This warning applies to users of model CR245 only.<itag pos="end" type="strong" x="1" />
|
Note that both methods of representing inline markup are considered valid TMX.
TMX-compliant tools that do not store source-format markup in their databases may simply
discard encapsulated markup that they are unable to use. TMX-compliant tools that do
store markup and receive files without encapsulated source-format markup may require
access to source files and additional processing to properly interpret these files and
should not simply discard empty <itag> elements.
5. TMX Compliance
TMX compliance is defined as follow:
Given:
An original document with inline codes (for example an HTML file)
translated by a tool XYZ.
The translation memory of that document saved in TMX format, using
<itag> elements as described
in the section on Representing Inline
Elements.
The segmentation rules in SRX format used to break blocks of source
text into smaller fragments, either embedded in the TMX document or
referenced in an <external-file> element.
Assuming:
The tool XYZ supports TMX Export if the TMX document created by
tool XYZ contains all the information required to re-create the translated document
without loss of text, data or formatting.
The tool XYZ supports TMX Import if any TMX document containing
all the information required to re-create the translated document (possibly created by a
TMX Export compliant tool), can be imported in tool XYZ and effectively be used to
re-create the translated document without loss of text, data or formatting.
Tools that offers both import and export features must support both TMX Import and TMX
Export to be TMX compliant.
Whenever possible, the original formatting information should be included in the
exported TMX file, enclosed in <itag> elements
Because many translation memory tools do not store source markup in their databases
(and instead extract markup from source files at translation time), it may not be
possible to include the original source formatting codes in inline elements. In such
cases, the inline elements must still be present in the correct places in the form of
empty <itag> elements and they must comply with the
section on Representing Inline Elements.
5.1 Validation of TMX Files
A cross-platform utility that validates TMX documents against TMX Schema and also
verifies if they follow the requirements described in this document is included as part
of the TMX 2.0 specifications.
Source code of the validation tool is available for download in OSCAR’s web site.
6. Changes Since Previous Version (Non-Normative)
The main changes in this version (2.0) relative to the previous version (1.4b) are as
follows:
TMX 2.0 is based on an XML Schema instead of a
DTD.
New elements. The following elements were added to TMX
standard: <context>, <itag>, <segmentation>, <internal-file> and <external-file>.
Removed elements. The following elements were removed
from the TMX standard: <bpt>, <ept>,
<it>, <ph>, <map>,
<prop>, <ude>, <ut>.
New attributes. The following attributes were
incorporated: xml:space, comment, context-type, crc, group, g-order, href, equiv-text
A new set of unified and simplified rules for representing inline elements was
designed. See the section on Representing
Inline Elements for more details.
Attribute type marked as required in all inline
elements.
Replaced implementation levels 1 and 2 with a unique level of compliance. TMX
files must include all the necessary inline data to re-create the translation of
source documents (optionally requiring the actual source document at processing
time) to be considered TMX compliant. See section TMX
Compliance for more details.
Required uniqueness of tuid attribute within a TMX
file.
Added a new attribute, pos (position) for use in
inline markup to allow recording of the type of the position (i.e., start or end
tag) encapsulated by that element.
Values of the datatype are now mandated to be
from the list of MIME types. The previously existing values are now listed for
compatibility with XLIFF or for use when MIME types are insufficiently specific
for language-processing purposes.
All metamarkup from previous versions of TMX was eliminated in favor of a
single tag, <itag>, which indicates the
location of tags in the source document and, optionally, can also encapsulate
the content of those tags, if this information is available to the application
creating a TMX file.
6.1 Upgrading TMX Files
It should be possible to upgrade a valid TMX 1.4b file to 2.0 by:
Removing any DOCTYPE declaration from the file
Changing the value of version attribute from
"1.4" to "2.0"
Removing all TMX 1.4 elements and attributes that have been deprecated in TMX
2.0 (i.e. <ut>)
Converting all <prop> elements to attributes using another XML
namespace. If the content of <prop> elements is too complex to be
represented in attributes, the use of elements from another XML namespace may be
required to represent them fully.
Replacing old-style metamarkup (e.g., <bpt>/<ept>
pairs) with <itag> elements as necessary to
comply with the section on Representing Inline
Elements.
A. Sample Document
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="2.0"
xmlns="http://www.lisa.org/tmx20"
xsi:schemaLocation="http://www.lisa.org/tmx20 tmx20.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xyz="urn:myApps:xyz">
<header creationtool="Sample Creator" creationtoolversion="1.1.1"
segtype="block" o-tmf="unknown" adminlang="en-US" srclang="*all*" datatype="x-sample">
<segmentation>
<internal-file xyz:myattribute="custom rules">
<!-- Segmentation rules in SRX 2.0 format -->
<srx:srx version="2.0" xmlns:srx="http://www.lisa.org/srx20">
<srx:header segmentsubflows="yes" cascade="yes">
<srx:formathandle type="start" include="no"/>
<srx:formathandle type="end" include="yes"/>
<srx:formathandle type="isolated" include="yes"/>
</srx:header>
<srx:body>
<srx:languagerules>
<srx:languagerule languagerulename="Default">
<!-- Common rule for most languages -->
<srx:rule break="yes">
<srx:beforebreak>[\.\?!]+</srx:beforebreak>
<srx:afterbreak>\s</srx:afterbreak>
</srx:rule>
</srx:languagerule>
</srx:languagerules>
<srx:maprules>
<!-- Common breaking rules -->
<srx:languagemap languagepattern=".*" languagerulename="Default"/>
</srx:maprules>
</srx:body>
</srx:srx>
</internal-file>
</segmentation>
<!-- Other elements -->
<xyz:other />
</header>
<body>
<!-- Paired codes with translatable text -->
<tu srclang="en-US" datatype="html" tuid="sample1">
<tuv xml:lang="en" datatype="html">
<seg>link to <itag type="link" x="1" pos="start">&a href="www.mysite.com"
title="<sub type="x-title">my site</sub>"></itag>my web site<itag
pos="end" type="link"></a>,</itag>.</seg>
</tuv>
<tuv xml:lang="es" datatype="html">
<seg>enlace a <itag type="link" x="1" pos="start">&a href="www.mysite.com/es"
title="<sub type="x-title">mi sitio</sub>"></itag>mi sitio web<itag pos="end"
type="link"></a>,</itag>.</seg>
</tuv>
</tu>
<!-- Paired codes without translatable text -->
<tu datatype="rtf">
<context context-type="x-my-context">text formatting options</context>
<tuv xml:lang="en">
<seg>Text in <itag type="italic">italics</itag>.</seg>
</tuv>
<tuv xml:lang="fr">
<seg>Texte en <itag type="italic">italiques</itag>.</seg>
</tuv>
</tu>
<!-- Standalone sequence with translatable text -->
<tu datatype="html">
<tuv xml:lang="en-US">
<seg>This is <itag type="image"><img src="john.gif" alt="<sub type="alt">John's
picture</sub>"/></itag> John.</seg>
</tuv>
<tuv xml:lang="es">
<seg>Este es <itag type="image"><img src="juan.gif" alt="<sub type="alt">foto
de Juan</sub>"/></itag> Juan.</seg>
</tuv>
</tu>
<!-- Standalone sequence without translatable text -->
<tu>
<tuv xml:lang="en">
<seg>text displayed in <itag type="lb" equiv-text="
"/> two lines.</seg>
</tuv>
<tuv xml:lang="es">
<seg>texto en <itag type="lb" equiv-text="
"/> dos lineas.</seg>
</tuv>
</tu>
<!-- Notes -->
<tu tuid="90293837" creationid="jean-claude" srclang="zh-CN" segtype="phrase">
<note>Salutations</note>
<note>Machine translation</note>
<tuv xml:lang="en">
<seg>Hello!</seg>
</tuv>
<tuv o-encoding="BIG5" xml:lang="zh-CN">
<note>Enable Unicode support for viewing this entry.</note>
<seg>你好!</seg>
</tuv>
</tu>
<!-- Untranslatable text -->
<tu o-tmf="xliff" creationdate="20060125T210600Z" changedate="20060315T130700Z"
creationid="ted@mail.com">
<tuv xml:lang="en" xml:space="default">
<seg><hi type="protected" comment="product name">Ultrabalancer</hi> support
is excellent.</seg>
</tuv>
<tuv xml:lang="es" xml:space="default">
<seg>El soporte de <hi type="protected">Ultrabalancer</hi> es excelente.</seg>
</tuv>
</tu>
<!-- Foreign elements -->
<xyz:database>main server</xyz:database>
<xyz:purpose>general</xyz:purpose>
<!-- grouped segments -->
<tu group="numbers" g-order="1" datatype="plaintext" creationdate="20060125T210600Z">
<tuv xml:lang="fr">
<seg>un</seg>
</tuv>
<tuv xml:lang="de">
<seg>eine</seg>
</tuv>
<tuv xml:lang="en">
<seg>one</seg>
</tuv>
</tu>
<tu group="numbers" g-order="2" datatype="plaintext">
<tuv xml:lang="de">
<seg>zwei</seg>
</tuv>
<tuv xml:lang="fr">
<seg>deux</seg>
</tuv>
<tuv xml:lang="en">
<seg>two</seg>
</tuv>
</tu>
<tu group="numbers" g-order="3" datatype="plaintext">
<tuv xml:lang="en">
<seg>three</seg>
</tuv>
<tuv xml:lang="de">
<seg>drei</seg>
</tuv>
<tuv xml:lang="fr">
<seg>trois</seg>
</tuv>
</tu>
</body>
</tmx>
|
B. XML Schema for TMX
The XML Schma for TMX is available at: http://www.lisa.org/tmx/tmx20.xsd.
<?xml version="1.0" encoding="UTF-8"?>
<!--
Document : tmx20.xsd
Version : 1.0
Created on : December 2, 2006
Author : Rodolfo M. Raya (rmraya@maxprograms.com)
Modified : February 18, 2009 by Rodolfo M. Raya (rmraya@maxprograms.com)
July 23, 2008 by Arle Lommel (arle@lisa.org)
Description : This XML Schema defines the structure of TMX 2.0
Status : Preliminary draft
Copyright © The Localisation Industry Standards Association [LISA] 1997-2009.
All Rights Reserved.
-->
<xs:schema xmlns:tmx="http://www.lisa.org/tmx20" targetNamespace="http://www.lisa.org/tmx20"
xml:lang="en" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:import namespace="http://www.w3.org/XML/1998/namespace"
schemaLocation="http://www.w3.org/2001/xml.xsd" />
<!--
==================================================
Restrictions
==================================================
-->
<!-- Restrictions for segtype attribute -->
<xs:simpleType name="segtypes">
<xs:restriction base="xs:token">
<xs:enumeration value="block" />
<xs:enumeration value="paragraph" />
<xs:enumeration value="sentence" />
<xs:enumeration value="phrase" />
</xs:restriction>
</xs:simpleType>
<!-- Restrictions for xml:space attribute -->
<xs:simpleType name="space">
<xs:restriction base="xs:token">
<xs:enumeration value="default" />
<xs:enumeration value="preserve" />
</xs:restriction>
</xs:simpleType>
<!-- Restrictions for assoc attribute -->
<xs:simpleType name="assoc_type">
<xs:restriction base="xs:token">
<xs:enumeration value="p" />
<xs:enumeration value="f" />
<xs:enumeration value="b" />
</xs:restriction>
</xs:simpleType>
<!-- Restrictions for type attribute when used in <itag> -->
<xs:simpleType name="placeholder_type">
<xs:restriction base="xs:token">
<xs:enumeration value="bold" />
<xs:enumeration value="color" />
<xs:enumeration value="dulined" />
<xs:enumeration value="emphasis" />
<xs:enumeration value="font" />
<xs:enumeration value="italic" />
<xs:enumeration value="link" />
<xs:enumeration value="scap" />
<xs:enumeration value="strong" />
<xs:enumeration value="struct" />
<xs:enumeration value="ulined" />
<xs:enumeration value="xliff-bpt" />
<xs:enumeration value="xliff-g" />
<xs:enumeration value="index" />
<xs:enumeration value="date" />
<xs:enumeration value="time" />
<xs:enumeration value="fnote" />
<xs:enumeration value="enote" />
<xs:enumeration value="alt" />
<xs:enumeration value="image" />
<xs:enumeration value="pb" />
<xs:enumeration value="lb" />
<xs:enumeration value="cb" />
<xs:enumeration value="inset" />
<xs:enumeration value="xliff-bx" />
<xs:enumeration value="xliff-ex" />
<xs:enumeration value="xliff-it" />
<xs:enumeration value="xliff-ph" />
<xs:enumeration value="xliff-x" />
</xs:restriction>
</xs:simpleType>
<!-- Restrictions for type attribute when used in <hi> -->
<xs:simpleType name="term_type">
<xs:restriction base="xs:token">
<xs:enumeration value="abbrev" />
<xs:enumeration value="abbreviated-form" />
<xs:enumeration value="abbreviation" />
<xs:enumeration value="acronym" />
<xs:enumeration value="appellation" />
<xs:enumeration value="collocation" />
<xs:enumeration value="common-name" />
<xs:enumeration value="datetime" />
<xs:enumeration value="equation" />
<xs:enumeration value="expanded-form" />
<xs:enumeration value="formula" />
<xs:enumeration value="head-term" />
<xs:enumeration value="initialism" />
<xs:enumeration value="international-scientific-term" />
<xs:enumeration value="internationalism" />
<xs:enumeration value="logical-expression" />
<xs:enumeration value="materials-management-unit" />
<xs:enumeration value="name" />
<xs:enumeration value="near-synonym" />
<xs:enumeration value="part-number" />
<xs:enumeration value="phrase" />
<xs:enumeration value="phraseological-unit" />
<xs:enumeration value="protected" />
<xs:enumeration value="romanized-form" />
<xs:enumeration value="set-phrase" />
<xs:enumeration value="short-form" />
<xs:enumeration value="sku" />
<xs:enumeration value="standard-text" />
<xs:enumeration value="symbol" />
<xs:enumeration value="synonym" />
<xs:enumeration value="synonymous-phrase" />
<xs:enumeration value="term" />
<xs:enumeration value="transcribed-form" />
<xs:enumeration value="transliterated-form" />
<xs:enumeration value="truncated-term" />
<xs:enumeration value="variant" />
</xs:restriction>
</xs:simpleType>
<!-- Restrictions for context-type attribute -->
<xs:simpleType name="context_type">
<xs:restriction base="xs:token">
<xs:enumeration value="database" />
<xs:enumeration value="element" />
<xs:enumeration value="elementtitle" />
<xs:enumeration value="linenumber" />
<xs:enumeration value="numparams" />
<xs:enumeration value="paramnotes" />
<xs:enumeration value="record" />
<xs:enumeration value="recordtitle" />
<xs:enumeration value="sourcefile" />
</xs:restriction>
</xs:simpleType>
<!-- Restrictions for date values -->
<xs:simpleType name="date_type">
<xs:restriction base="xs:string">
<!-- YYYYMMDDThhmmssZ -->
<xs:pattern value="[1-2][0-9][0-9][0-9][0-1][0-9][0-3][0-9]T[0-5][0-9][0-5][0-9][0-5][0-9]Z"/>
</xs:restriction>
</xs:simpleType>
<!-- Restrictions for user-defined attribute values -->
<xs:simpleType name="Custom">
<xs:restriction base="xs:string">
<xs:pattern value="x-[^\s]+" />
</xs:restriction>
</xs:simpleType>
<!--
==================================================
Structural Elements
==================================================
-->
<!-- Base Document Element -->
<xs:element name="tmx">
<xs:complexType>
<xs:sequence>
<xs:element ref="tmx:header" />
<xs:element ref="tmx:body" />
</xs:sequence>
<xs:attribute name="version" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="2.0" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Body -->
<xs:element name="body">
<xs:complexType>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element ref="tmx:tu" />
<xs:any namespace="##other" processContents="lax" />
</xs:choice>
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Context Information -->
<xs:element name="context">
<xs:complexType mixed="true">
<xs:attribute name="context-type" use="required">
<xs:simpleType>
<xs:union memberTypes="tmx:context_type tmx:Custom" />
</xs:simpleType>
</xs:attribute>
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- External File -->
<xs:element name="external-file">
<xs:complexType>
<xs:attribute name="href" use="required" />
<xs:attribute name="crc" />
<xs:attribute name="uid" />
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Header -->
<xs:element name="header">
<xs:complexType>
<xs:sequence>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element ref="tmx:note" />
</xs:choice>
<xs:element minOccurs="0" ref="tmx:segmentation" />
<xs:any maxOccurs="unbounded" minOccurs="0" namespace="##other"
processContents="lax" />
</xs:sequence>
<xs:attribute name="creationtool" use="required" />
<xs:attribute name="creationtoolversion" use="required" />
<xs:attribute name="segtype" use="required" type="tmx:segtypes" />
<xs:attribute name="o-tmf" use="required" />
<xs:attribute name="adminlang" use="required" />
<xs:attribute name="srclang" use="required" />
<xs:attribute name="datatype" use="required" />
<xs:attribute name="o-encoding" />
<xs:attribute name="creationdate" type="tmx:date_type"/>
<xs:attribute name="creationid" />
<xs:attribute name="changedate" type="tmx:date_type"/>
<xs:attribute name="changeid" />
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Internal File -->
<xs:element name="internal-file">
<xs:complexType mixed="true">
<xs:sequence>
<xs:any maxOccurs="1" minOccurs="1" namespace="http://www.lisa.org/srx20"
processContents="lax" />
</xs:sequence>
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Note -->
<xs:element name="note">
<xs:complexType mixed="true">
<xs:attribute name="o-encoding" />
<xs:attribute ref="xml:lang" />
<xs:attribute name="creationdate" type="tmx:date_type"/>
<xs:attribute name="creationid" />
<xs:attribute name="changedate" type="tmx:date_type"/>
<xs:attribute name="changeid" />
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Segment -->
<xs:element name="seg">
<xs:complexType mixed="true">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element ref="tmx:itag" />
<xs:element ref="tmx:hi" />
</xs:choice>
<xs:attribute ref="xml:space" default="default" />
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Segmentation -->
<xs:element name="segmentation">
<xs:complexType>
<xs:choice>
<xs:element ref="tmx:internal-file" />
<xs:element ref="tmx:external-file" />
</xs:choice>
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Translation Unit -->
<xs:element name="tu">
<xs:complexType>
<xs:sequence>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element ref="tmx:note" />
<xs:element ref="tmx:context" />
</xs:choice>
<xs:element ref="tmx:tuv" minOccurs="2" maxOccurs="unbounded" />
<xs:any maxOccurs="unbounded" minOccurs="0" namespace="##other"
processContents="lax" />
</xs:sequence>
<xs:attribute name="tuid" />
<xs:attribute name="o-encoding" />
<xs:attribute name="datatype" />
<xs:attribute name="usagecount" >
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="0" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="lastusagedate" type="tmx:date_type"/>
<xs:attribute name="creationtool" />
<xs:attribute name="creationtoolversion" />
<xs:attribute name="creationdate" type="tmx:date_type"/>
<xs:attribute name="creationid" />
<xs:attribute name="changedate" type="tmx:date_type"/>
<xs:attribute name="segtype" type="tmx:segtypes" />
<xs:attribute name="changeid" />
<xs:attribute name="o-tmf" />
<xs:attribute name="srclang" />
<xs:attribute name="group" />
<xs:attribute name="g-order">
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="1" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Translation Unit Variant -->
<xs:element name="tuv">
<xs:complexType>
<xs:sequence>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element ref="tmx:note" />
</xs:choice>
<xs:element ref="tmx:seg" />
<xs:any maxOccurs="unbounded" minOccurs="0" namespace="##other"
processContents="lax" />
</xs:sequence>
<xs:attribute ref="xml:lang" use="required" />
<xs:attribute name="o-encoding" />
<xs:attribute name="datatype" />
<xs:attribute name="usagecount" >
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="0" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="lastusagedate" type="tmx:date_type"/>
<xs:attribute name="creationtool" />
<xs:attribute name="creationtoolversion" />
<xs:attribute name="creationdate" type="tmx:date_type"/>
<xs:attribute name="creationid" />
<xs:attribute name="changedate" type="tmx:date_type"/>
<xs:attribute name="o-tmf" />
<xs:attribute name="changeid" />
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!--
==================================================
Content Markup
==================================================
-->
<!-- Highlight -->
<xs:element name="hi">
<xs:complexType mixed="true">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element ref="tmx:itag" />
</xs:choice>
<xs:attribute name="x">
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="1" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="type" use="required">
<xs:simpleType>
<xs:union memberTypes="tmx:term_type tmx:Custom" />
</xs:simpleType>
</xs:attribute>
<xs:attribute name="comment" />
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Internal Tag -->
<xs:element name="itag">
<xs:complexType mixed="true">
<xs:sequence>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="tmx:sub" />
</xs:sequence>
<xs:attribute name="x">
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="1" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="assoc" type="tmx:assoc_type" />
<xs:attribute name="equiv-text" />
<xs:attribute name="pos">
<xs:simpleType>
<xs:restriction base="xs:token">
<xs:enumeration value="start" />
<xs:enumeration value="end" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="type" use="required">
<xs:simpleType>
<xs:union memberTypes="tmx:placeholder_type tmx:Custom" />
</xs:simpleType>
</xs:attribute>
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
<!-- Subflow -->
<xs:element name="sub">
<xs:complexType mixed="true">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element ref="tmx:itag" />
</xs:sequence>
<xs:attribute name="datatype" />
<xs:attribute name="type" use="required">
<xs:simpleType>
<xs:union memberTypes="tmx:placeholder_type tmx:term_type tmx:Custom" />
</xs:simpleType>
</xs:attribute>
<xs:anyAttribute namespace="##any" processContents="lax" />
</xs:complexType>
</xs:element>
</xs:schema>
<!-- End -->
|
C. Glossary
OSCAR
LISA special interest group (Open Standards for Container/Content Allowing Re-use).
UTC
UTC stands for Coordinated Universal Time.
XML
XML stands for Extensible Markup Language. XML is a simplified and restricted subset
of Standard Generalized Markup Language (SGML).
XML Schema
A description of a type of XML document, typically expressed in terms of constraints
on the structure and content of documents of that type, above and beyond the basic
syntax constraints imposed by XML itself. An XML schema
provides a view of the document type at a relatively high level of abstraction.
D. References
Normative
[IANA Charsets]
IANA Names for
Character Sets. IANA (Internet Assigned Numbers Authority), Aug 2001
[MIME Media Types]
IANA MIME Media
Types. IANA (Internet Assigned Numbers Authority), 2007
[ISO 8601]
Representation
of dates and times. ISO (International Organization for Standardization), Dec
2000.
[RFC 2046]
Multipurpose Internet Mail
Extensions (MIME) Part Two: Media Types. IETF (Internet Engineering Task
Force), November 1996.
[RFC 4646]
RFC 4646 Tags for the
Identification of Languages. IETF (Internet Engineering Task Force),
September 2006. This document, in combination with RFC 4647, replaces RFC 3066, which
replaced RFC 1766.
[SRX 2.0]
Segmentation Rules
Exchange (SRX) is an XML-based standard for description of the ways in which
translation and other language-processing tools segment text for processing.
[XML 1.0]
Extensible Markup Language (XML)
1.0 Second Edition. W3C (World Wide Web Consortium), Oct 2000.
[XML Namespaces]
Namespaces in
XML. W3C (World Wide Web Consortium), August 2006.
Non-Normative
[ISO]
International Organization for
Standardization Web site.
[LISA]
Localisation Industry Standards
Association Web site.
[Unicode]
Unicode Consortium Web site.
[W3C]
World Wide Web Consortium Web site.
|