Tuesday, June 16, 2020

Understanding Machine Translation Quality & Risk Prediction

Much of the commentary that is available on the use of machine translation (MT) in the translation industry today, focuses heavily on assessing MT output quality, comparing TM vs. MT, and overall MTPE process management issues. However, the quality discussion still remains muddy and there are very few clear guidelines to make the use of MT consistently effective. There is much discussion about Edit Distance, post-editing productivity, and measurement processes like DQF, but much less discussion about understanding training and source corpus and developing strategies to make an MT engine produce better output.  While the use of  MT in localization use cases continues to expand as generic MT output quality improves, it is worth noting that MT use is much more likely to deliver greater value in use cases other than localization. 

It is estimated that trillions of words a day are translated by "free" MT portal across the world. This activity suggests the huge need for language translation that goes far beyond the normal focus of the translation and localization industry. While a large portion of this use is by consumers, there is a growing portion of this use that involves enterprise users. 

ROI on MT use cases outside of localization tends to be much higher 

The largest users of enterprise MT today tend to be focused on eCommerce and eDiscovery use cases. Alibaba, Amazon, and eBay translate billions of words a month. eDiscovery tends to focus on the following use cases where many thousands of documents and varied data sources have to be quickly reviewed and processed:
  • Cross-border litigation usually related to intellectual property violations, product liability, or contract disputes.
  • Pharmacovigilance (PV or PhV), also known as drug safety, is the pharmacological science relating to the collection, detection, assessment, monitoring, and prevention of adverse effects with pharmaceutical products (and pandemic-like diseases e.g. Covid-19 incident reports).
  • National Security and Law Enforcement Surveillance of social media and targeted internet activity to identify and expose bad actors involved with drugs, terrorism, and other criminal activities.
  • Corporate information governance and compliance.
  • Customer Experience (CX) related communications and content.

Thus, as the use of MT becomes more strategic and pervasive we also see a need for new kinds of tools and capabilities that can assist in the optimization process, This is a guest post by Adam Bittlingmayer, a co-founder of ModelFront who is a developer of a new breed of machine learning-driven tools that assist and enhance MT initiatives across all the use cases described above. ModelFront describes what they do as: In research terms, we've built "massively multilingual black-box deep learning models for quality estimation, quality evaluation, and filtering", and productized it.

I met with @Adam Bittlingmayer, co-founder of @ModelFront, to talk about predicting translation risk and share our experience automating translation at scale at giants like Facebook and Google and startups like ModelFront who is developing specialized capabilities to make MT translation risk prediction more efficient and effective.

The tools they provide are valuable in the development of better MT engines by doing the following:
  • Filtering parallel corpora used in training MT engines
  • Comparison of MT engines with detailed error profiles
  • Rapid and more comprehensive translation error detection & correction capabilities
  • Enhanced man-machine collaboration infrastructure that is particularly useful in high volume MT use cases

In his post, Adam provides an explanation of some important terms that are often conflated and also gives you a feel for MT development from the perspective of one who has done this at scale for Google. Capabilities like these can help developers add value to any MT initiative and these go way beyond the simple data cleaning routines that many LSPs use. From my understanding, these tools can help good engines get better, but they are not magic that can suddenly improve shoddy engines that many LSPs continue to build. Adam provides some comparisons with leading localization industry tools so that a reader can better understand the capabilities and focus of the ModelFront capabilities.

A very key characteristic of the ModelFront platform, beyond the scalability, is the control given to implement high levels of production automation. These processes can be embedded and integrated into MT development and production pipelines to enable better MT outcomes in a much larger range of use cases. Although ModelFront is already being used in Localization MTPE use cases, in a highly automated and integrated translation management workflow, I believe the real potential for added value is beyond the typical localization purview. 

An example of how this kind of technology can add value even in early experimentation can be seen in recent research by Amazon to use quality estimation for subtitling. What they found by categorizing subtitle translations into the three categories: Good translations which are fine as is and need no further improvement, loose translations which may require human post-edits and bad translations which need a “complete rewrite.

The researchers worked with 30,000 video subtitle files in English and their corresponding translations in French, German, Italian, Portuguese and Spanish for their experiments. They found that their DeepSubQE model was accurate in its estimations more than 91% of the time for all five languages. 
This would mean that human efforts can be focused on a much smaller set of data and thus yield a much better overall quality in less time.


Confidence scoring, quality estimation, and risk prediction

What is the difference between a quality estimation, a confidence score, and translation risk prediction?

Microsoft, Unbabel, and Memsource use the research standard quality estimation, while Smartling calls its feature a quality confidence score, and there are even research papers talking about confidence estimation or error prediction, and tools that use a fuzzy match percentage. Parallel data filtering approaches like Zipporah or LASER use quality score or similarity score.

ModelFront uses risk prediction.

They're overlapping concepts and often used interchangeably - the distinctions are as much about tradition, convention, and use case as about inputs and outputs. They are all basically a score from 0.0 to 1.0 or 0% to 100%, at sequence-level precision or greater. Unlike BLEU, HTER, METEOR, or WER, they do not require a golden human reference translation.

We're interested in language, so we know the nuances in naming are important.

Confidence scoring

A machine translation confidence score is typically used for a machine translation's own bet about its own quality on the input sequence. A higher score correlates with higher quality.

It is typically based on internal variables of the translation system - a so-called glassbox approach. So it can't be used to compare systems or to assess human translation or translation memory matches.

Quality estimation

A machine translation quality estimate is based on a sequence pair - the source text and the translation text. Like a confidence score, a higher score correlates with higher quality.

It implies a pure supervised black-box approach, where the system learns from labeled data at training time but knows nothing about how the translation was produced at run time. It also implies the scoring of machine translation only.

This term is used in research literature and conferences, like the WMT shared task and is also the most common term in the context of the approach pioneered at Unbabel and Microsoft - safely auto-approving raw machine translation for as many segments as possible.

It's often contrasted with quality evaluation - a corpus-level score.

In practice, usage varies - researchers do talk about unsupervised and glassbox approaches to quality estimation, and about word-level quality estimation, and there's no reason that quality estimation could not be used for more tasks, like quality evaluation or parallel data filtering.

Risk prediction

A translation risk prediction is also based on a sequence pair - the source text and the translation text. A higher score correlates with a higher risk.

Like quality estimation, it implies a pure black-box approach. Unlike quality estimation, it can also be used for everything from parallel data filtering to quality assurance of human translation to corpus- or system-level quality evaluation.

Why did we introduce yet another name? Risk prediction is the term used at ModelFront because it's the most correct and it's what clients actually want, across all use cases.

Often it's impossible to say if a translation is of high quality or low quality because the input sequence is ambiguous or noisy. When English Apple is translated to Spanish as Manzana or to Apple, it makes no sense to say that both are low quality or medium quality - one of them is probably perfect. But it does make sense to say that, without more context, both are risky.

We also wanted our approach to explicitly break away from quality estimation's focus on post-editing distance or effort and CAT tools' focus on rules-based translation memory matching, and to be future-proof as use cases and technology evolves.

ModelFront's risk prediction system will grow to include risk types and rich phrase- and word-level information.

Options for translation quality and risk

How to build or buy services, tools or technology for measuring translation quality and risk

Measuring quality and risk are fundamental to successful translation at scale. Both human and machine translation benefit from sentence-level and corpus-level metrics.

Metrics like BLEU are based on string distance to the human reference translations and cannot be used for new incoming translations, nor for the human reference translations themselves.

What are the options if you want to build or buy services, tools or technology for measuring the quality and risk of new translations?


Whether just an internal human evaluation in a spreadsheet, user-reported quality ratings, an analysis of translator post-editing productivity and effort, or full post-editing, professional human linguists and translators are the gold standard.

There is significant research on human evaluation methods, and quality frameworks like MQM-DQF and even quality management platforms like TAUS DQF and ContentQuo for standardizing and managing human evaluations, as well as translators and language service providers offering quality reviews or continuous human labeling.


Translation tools like Memsource, Smartling, and GlobalLink have features for automatically measuring quality bundled in their platforms. Memsource's feature is based on machine learning.


Xbench, Verifika, and LexiQA directly apply exhaustive, hand-crafted linguistic rules, configurations and translation memories to catch common translation errors, especially human translation errors.

They are integrated into existing tools, and their outputs are predictable and interpretable. LexiQA is unique in its partnerships with web-based translation tools and its API.

Open-source libraries

If you have the data and the machine learning team and want to build your own system based on machine learning, there is a growing set of open-source options.

The most notable quality estimation frameworks are OpenKiwi from Unbabel and DeepQuest from the research group led by Lucía Specia. Zipporah from Hainan Xu and Philipp Koehn is the best-known library for parallel data filtering.

The owners of those repositories are also key contributors to and co-organizers of the WMT shared tasks on Quality Estimation and Parallel Corpus Filtering.

Massively multilingual libraries and pre-trained models like LASER are a surprisingly effective unsupervised approach to parallel data filtering when combined with other techniques like language identification, regexes, and round-trip translation.

Internal systems

Unbabel, eBay, Microsoft, Amazon, Facebook, and others invest in in-house quality estimation research and development for their own use, mainly for the content that flows through their platforms at scale.

The main goal is to use raw machine translation for as much as possible, whether in  an efficient hybrid translation workflow for localization or customer service, or just to limit catastrophes on User and business-generated content that is machine translated by default.

Their approaches are based on machine learning.

Systems accessible as APIs, consoles or on-prem

ModelFront is the first and only API for translation risk prediction based on machine learning. With a few clicks or a few lines of code, you can access a production-strength system.

Our approach is developed fully in-house, extending ideas from the leading researchers in quality estimation and parallel data filtering, and from our own experience inside the leading machine translation provider.

We've productized it and made it accessible and useful to more players - enterprise localization teams, language service providers, platform and tool developers and machine translation researchers.

We have built security, scalability, and support for 100+ languages and 10K+ language pairs, locales, encodings, formatting, tags and file formats, integrations with the top machine translation API providers and automated customization into our APIs.

We provide our technology as an API and console for convenience, as well as on-prem deployments.

We continuously invest in curated parallel datasets and manually-labeled datasets and track emerging risk types as translation technology, use cases, and languages evolve.

1 comment:

  1. These last posts about MT are a fantastic read, thank you. Very interesting. Here is my two cents: I think that definitely there is a wall between MT developers and the "last mile" workers: post-editors.

    MT developers will benefit if they share a bit more of their technology with said workers. I mean, MT is the result of very refined pattern recognition technology, right? So, why not approach the correction and optimization with some of the same tools? Help the post-editors, empower them to leverage technology more efficiently and actually advance and gain new marketable skills.

    One specific example: regular expressions. A translator or editor that learns regex can save time, work more efficiently, perform advanced search & replace and many other things. Here, the apparently simple looking Notepad++ reveals itself as a far more powerful tool than traditional CAT tools.

    The sky is the limit. I can imagine a cyberpunk post-editor that is writing Python scripts on the fly to clean huge amounts of text data, detecting errors, running custom QA processes, all thanks to an ecosystem of NLP APIs that go far beyond traditional tasks.