Wednesday, November 8, 2017

Taking Translation Metadata Beyond Translation Memory Descriptors

 This is a guest post on Translation Metadata by Luigi Muzii. Some may recall his previous post: The Obscure and Controversial Importance of Metadata. Luigi's view of translation metadata is much broader and all-encompassing than most descriptions we see in the translation industry which usually only reference TM descriptors. In addition to descriptors about the TM, it can also be about the various kinds of projects, the kinds of TM, translators used, higher levels of an ontological organization, client feedback, profitability and other parameters that are crucial to developing meaningful performance indicators (KPI).

As we head into the world of AI-driven efficiencies, the quality of the data and the quality and sophistication of the management of your data becomes significantly more strategic and important. I have observed over the years that LSPs struggle to gather data for MT engine training and that for many if not most, the data sits in an unstructured and unorganized mass on network drives, where one is lucky to even see intelligible naming conventions and consistent formats. Many experts now say the data is even more important than the ML algorithms which will increasingly become commodities.  Look at the current hotshot on the MT technology block: Neural MT,  which already has 4+ algorithmic foundations available for the asking (OpenNMT, TensorFlow, Nematus and Facebook's  Fairseq). I bet more will appear and that the success of NMT initiatives will be driven more by the data than the algorithm.

Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities. It's the data where the real value is.

 Good metadata implementations will also help to develop meaningful performance indicators, and as Luigi says could very well be elemental to disintermediation. Getting data right, is of strategic value and IMO that is why something like DeepL is such a formidable entrant. DeepL very likely has data in a much more organized structure that is metadata rich and can be re-deployed in any/many combinations with speed and accuracy. Data organization, I expect will become a means and instrument for LSPs to develop strategic advantage as good NMT solution platforms become ubiquitous.
 ** ------------------- **

If you were not to read on the subject focus of this post here, you are hardly likely to read it elsewhere: many people love to talk and write about metadata, but rarely actually care about it. This is because, usually, no one is willing to spend much of his their time filling out forms. Although this is undoubtedly a boring task, there is nothing trivial in assembling compact and yet comprehensive data to describe a job, however simple or complex, small or huge.

On the other hand, this monitoring and documenting activity is a rather common task for any project manager. In fact, in project management, a project charter must always be compiled stating scope, goals, stakeholders, and outlining roles and responsibilities. This document serves as a reference for the statement of work defining all tasks, timelines, and deliverables.

When part of a larger project, translation is managed as a specific task, but this does not exempt the team in charge to collect and provide the relevant data to execute this task. This data ranges from working instructions to project running time, from team members involved to costs, etc. but even LSPs and translation buyers, who might benefit from this job and procedural documentation action, whatever the type and size of the project or the task, often skip this step.

The descriptive data describing these projects and tasks is called metadata, and the information it provides can be used for ongoing or future discovery, identification or management. Metadata can sometimes be captured by computers, but more often it has to be created manually. Alas, translation project managers and translators often neglect to create this metadata, or they do not create enough metadata, or the metadata they create is not accurate enough; this makes metadata scarce and partial, and rapidly totally irrelevant.

The Importance of Metadata

On the other hand, metadata is critical for the extraction of business intelligence from workflows and processes. In fact, to produce truly useful stats and get practical KPIs, automatically-generated data is insufficient for any business inference whatsoever and the collation of relevant data is crucial for any measurement efforts to be effective.

The objective of this measurement activity is all about reducing uncertainty, which is critical to business. Translation could well be a very tiny fraction of a project, and although it is small, no buyer is willing to put a project at risk on independent variables that are not properly understood. Therefore, to avoid guessing, buyers require factual data to assess their translation effort, to budget it, and to evaluate the product they will eventually receive.

Every LSP should then first be capable of identifying what is important from the customer’s perspective, to make its efforts more efficient, cost-effective, and insightful. In this respect, measurements enable a company to have an accurate pulse on the business.

Measurements should be taken against pre-specified benchmarks to derive indicators and align daily activities with strategic goals, and analytics are essential to unlocking relevant insights, with data being the lifeblood of analytics. At the same time, measurements allow buyers to assess vendor capability and reliability.

ERP and CRM systems are commonly used in most industries to gather, review and adjust measurements. TMSs are the lesser equivalent of those systems in the translation industry.
In a blog post dating back to 2011, Kirti Vashee asked why are there so many TMS systems (given the size of the industry and the average size of its players,) each one with a tiny installed base. The answer was in the following question: because every LSP and corporate localization department think that their translation project management process is so unique that it can only be properly automated by creating a new TMS.

The Never-ending and Unresolved Standards Initiatives

More or less the same thing happens when it comes to any industry discussion on standards, with any initiative starting with an overstated claim and an effort focused on covering every single aspect of the topic addressed, no matter how vague or huge, in contrast with the spirit of standardization, which should result from a general consensus on straightforward, lean, and flexible guidelines.

In the same post, Kirti Vashee also reported about Jaap van der Meer predicting at LISA’s final standards summit event that GMS/TMS would disappear over time, in favor of plug-ins to other systems. Apparently, he also said that TMs would be dead in 5 years or less. Niels Bohr is often misquoted for saying that predictions are always hard, especially about the future.

While translation tools as we have known them for almost three decades have now lost centrality, they are definitely not dead, and we also see that GMS/TMS systems have not disappeared, and three years from now, we will see whether Grant Straker’s prediction is going to prove right that a third of all translation companies would disappear by 2020 due to technology disruption.

Technology has been lowering costs, but it is not responsible for increasing margin erosion. People who cannot make the best use of technology are. The next big thing in the translation industry might, in fact, be the long announced and awaited disintermediation. Having made a significant transition to the cloud, and having learned how to exploit and leverage data, companies in every industry are moving to API platforms. As usual, the translation industry is reacting quite slowly and randomly. This is essentially another consequence of the industry’s pulverization, which also brings industry players to the contradiction of considering their business too unique to be properly automated, due to its creative and artistic essence, and yet trying to standardize every aspect of it.

In fact, ISO 17100, ASTM F2575-14 and even ISO 18587 on post-editing of machine translation contains a special annex or a whole chapter on project specifications and registration or parameters, while a technical specification, ISO/TS 11669, has been issued on this topic.

Unfortunately, in most cases, all these documents reflect the harmful conflation of features with requirements that are typical of the translation industry. Another problem is the confusion coming from the lack of agreement on the terms used to describe the steps in the process. Standards did not solve this problem, thus proving essentially uninteresting for industry outsiders.

The overly grand ambitions of any new initiative are a primary reason for them being doomed to irrelevance, while gains may be made by starting with smaller, less ambitious goals.

 Why Metadata is Important

Metadata is one of the pillars for disintermediation, along with protocols on how it is managed, exchanged between systems, and its exact exchange format.

In essence, metadata follows the partitioning of the translation workflow into its key constituents:
  • Project (the data that is strictly relevant to its management);
  • Production (the data that pertains to the translation process);
  • Business (the transaction-related data).
Metadata in each area can then be divided into essential and ancillary (optional.) To know which metadata is essential in each area, find where and how it can be used.

KPIs typically fall within the scope of metadata, especially project metadata, and their number depends on available and collectible data. Most of the data to “feed” a KPI dashboard can, in fact, be retrieved from a job ticket, and the more detailed a job ticket is, the more accurate the indicators are.
For example, from a combination of project, production and business metadata, KPIs can be obtained to better understand which language pair(s,) customer(s,) service and domain are most profitable. Cost-effectiveness can also be measured through cost, quality and timeliness indicators.

A process quality indicator may be computed out of other performance indicators such as the rate of orders fulfilled in-full, on-time, the average time from order to customer receipt, the percentage of units coming out of a process with no rework and/or the percentage of items inspected requiring rework.

The essential metadata allowing for the computation of basic translation KPIs might be the following:
  • Project
  • Unique identifier
  • Project name
  • Client’s name
  • Client’s contact person
  • Order date
  • Start date
  • Due date
  • Delivery date
  • PM’s name
  • Vendor name(s)
  • Status
  • Rework(s)
  • Production
  • Source language
  • Target language(s)
  • Scope of work (type of service(s))
  • TM
  • Percentage of TM used
  • MT
  • Term base
  • Style guide
  • QA results
  • Business
  • Volume
  • Initial quotation
  • Agreed fee
  • Discount
  • Currency
  • Expected date of payment
  • Actual date of payment

Although translation may be seen as a sub-task of a larger project, it may also be seen as a project itself. This is especially true when a translation is broken down into chunks to be apportioned to multiple vendors for multiple languages or even for a single language in case of large assignments and limited time available.

In this case, the translation project is split into tasks and each task is allotted in a work package (WP.) Each WP is then assigned a job ticket with a group ID so that all job tickets pertaining to a project can eventually be consolidated for any computations.

This will allow for associating a vendor and the relevant cost(s) to each WP for subsequent processing.

Most of the above metadata can be automatically generated by a computer system to populate the fields of a job ticket. This information might then be sent along with the processed job (in the background) as an XML, TXT, or CSV file, and stored and/or exchanged between systems.

To date, the mechanisms for compiling job tickets are not standardized in TMS systems, metadata is often labeled differently too. And yet, the many free Excel-based KPI tools available to process this kind of data basically confirm that this is not a complicated task.

To date, however, TMS systems do not seem to pay much attention to KPIs and to the processing of project data and focus more on language-related metadata. In fact, translation tools and TMS systems all add different types of metadata to every translation unit during processing. This is because metadata is used only for basic workflow automation, to identify and search translatable and untranslatable resources, provide translatable files to suitable translators, identify which linguistic resources have been used, what status a translation unit has, etc. Also, the different approach every technology provider adopts to manipulate the increasingly common XLIFF format makes metadata exchange virtually impossible; indeed, data, as well as metadata, are generally stripped away when fully compliant XLIFF files are produced.

This article is meant as a position paper to present my opinion about a topic that has recently risen to prominence and is now under the spotlights thanks to GALA’s TAPICC initiative for which I’m volunteering, in the hope to put the debate on practical and factual tracks.

For any advice on this and other topic related to authoring, translation, and the associated technologies, the author can be reached via email or Skype.


This supplement to the post contains some excerpts from the annexes to the two major industry standards, ISO 17100 Translation Services — Requirements for translation services and ISO/TS 11669 Translation projects — General guidance.

The first excerpt comes from Annex B (Agreements and project specifications) and Annex C (Project registration and reporting) to ISO 17100. The second excerpt comes from clause 6.4 Translation parameters of ISO/TS 11669.

This data is perfectly suitable candidates as ancillary (optional) metadata.
All excerpts are provided fur further investigation and comments.

ISO 17100

Annex B (Agreements and project specifications)

  1. scope,
  2. copyright,
  3. liability,
  4. confidentiality clauses,
  5. non-disclosure agreements (NDAs),
  6. languages,
  7. delivery dates,
  8. project schedule,
  9. quotation and currency used,
  10. terms of payment,
  11. use of translation technology,
  12. materials to be provided to the TSP by the client,
  13. handling of feedback,
  14. warranties,
  15. dispute resolution,
  16. choice of governing law.

Annex C (Project registration and reporting)

  1. unique project identifier,
  2. client’s name and contact person,
  3. dated purchase order and commercial terms, including quotations, volume, deadlines and delivery details,
  4. agreement and any ancillary specifications or related elements, as listed in Annex B,
  5. composition of the TSP project team and contact-person,
  6. source and target language(s),
  7. date(s) of receipt of source language content and any related material,
  8. title and description of source [language] content,
  9. purpose and use of the translation,
  10. existing client or in-house terminology or other reference material to be used,
  11. client’s style guide(s),
  12. information on any amendments to the commercial terms and changes to the translation project.

ISO/TS 11669

Translation parameters

1.      source characteristics

a.    source language

b.    text type

c.     audience

d.    purpose

2.      specialized language

a.    subject field

b.    terminology

3.      volume

4.      complexity

5.      origin

6.      target language information

a.    target language

b.    target terminology

7.      audience

8.      purpose

9.      content correspondence

10.   register

11.   file format

12.   style

a.    style guide

b.    style relevance

13.   layout

14.   typical production tasks

a.    preparation

b.    initial translation

c.     in-process quality assurance

                              i.     self-checking

                            ii.     revision

                           iii.     review

                          iv.     final formatting

                            v.     proofreading

15.   additional tasks

16.   technology

17.   reference materials

18.   workplace requirements

19.   permissions

a.    copyright

b.    recognition

c.     restrictions

20.   submissions

a.    qualifications

b.    deliverables

c.     delivery

d.    deadline

21.   expectations

a.       compensation

b.      communication


Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm. He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization-related work.

This link provides access to his other blog posts.


  1. Thanks, Luigi. Great overview. Missing, inadequate or incorrect metadata, and the misuse of what is captured, certainly are responsible for the woes you describe and many more. You list Project, Production and Business as three top-level metadata categories. I think segment-level TU-metadata should be another.

    If we look at speech recognition as an example, every speech utterance (speech unit) is accurately tagged with metadata that affects the technology's performance: male/female, age, regional accent, etc. TMs have the mechanism to preserve this TU-level metadata ( tags and attributes) but all too often, it is missing, irrelevant, or simply wrong. These erroneous data are responsible for poor performance of many MT systems.

    For most of the 30-year history of TMs, TU-metadata added little or no value. Translators and agencies grew lazy. Despite corpus linguistics becoming the defacto-standard for 10 years, few practitioners have changed to capture accurate TU-metadata. If we want corpus-based MT systems (SMT and NMT) work better, and there's plenty of room for improvement, it's time we start tagging TUs with accurate metadata.

  2. With TAPICC, I'm reminded of the continual struggle this industry faces in creating a new, over-arching standard to replace the 10 standards that came before it, but ultimately result in 11 standards that one may opt to choose. :)

  3. This comment is a reply to Tom Hoar’s and Stephen Holmes’s comments.
    Several months ago, in a previous post, here (in English, and on my own blog (in Italian,, I had already addressed the metadata topic in a wider and more generic perspective. Then, I stressed how language data is nearly useless without relevant metadata, which allows language data to be collected, organized, cleaned and explored for potential re-use, especially with MT engine training.
    Translation memories have made the “datafication” of translation possible, and this, in turn, enabled phrase-based and now neural machine translation. Indeed, statistical and neural machine translation are the real and most useful technology we have, and it is transformative (I might dare say disruptive.)
    At the very same level, the translation industry has been investing most of its resources, as well as the best of them, in a position war, digging trenches and laying barbed wire. Translators, LSPs, and tech vendors are equally responsible (and should be openly blamed) for the absurd self-defeating outcome, the no-man’s land of TMX. Not only is interoperability an illusion, it is now a delusion. What’s worse, the increasingly vast amount of valuable language data that is still being produced is fated to irrelevance. In the best scenarios, it can be exploited for less than a half of its potential.
    TM tools have indeed the capacity to add and store TU-level metadata, but translators are not taught the importance of metadata and, even less, their use and how to leverage it. The commands envisioning and shaping trenches and validating excavation are mostly in the universities. The reason is quite simple: corpus linguistics and translation studies (traductologie, Uebersetzungwissenshaft—aargh!) are academically more profitable.
    The poor state of TMX is a direct consequence of the same mindset, which leads industry players to produce useless standards (e.g. the quality standards condescendingly aiming at getting rid with a general, globally recognized standard considered too costly and ineffectual) only doomed to remain unnoticed or poorly considered when not disparaged by customers.
    This does not mean, however, that when a new initiative, like TAPICC, is launched no effort should be made to make it work. Hope is always the last to die. Even after +35 years in the industry.