Much of the commentary that is available on the use of machine translation (MT) in the translation industry today, focuses heavily on assessing MT output quality, comparing TM vs. MT, and overall MTPE process management issues. However, the quality discussion still remains muddy and there are very few clear guidelines to make the use of MT consistently effective. There is much discussion about Edit Distance, post-editing productivity, and measurement processes like DQF, but much less discussion about understanding training and source corpus and developing strategies to make an MT engine produce better output. While the use of MT in localization use cases continues to expand as generic MT output quality improves, it is worth noting that MT use is much more likely to deliver greater value in use cases other than localization.
ROI on MT use cases outside of localization tends to be much higher |
- Cross-border litigation usually related to intellectual property violations, product liability, or contract disputes.
- Pharmacovigilance (PV or PhV), also known as drug safety, is the pharmacological science relating to the collection, detection, assessment, monitoring, and prevention of adverse effects with pharmaceutical products (and pandemic-like diseases e.g. Covid-19 incident reports).
- National Security and Law Enforcement Surveillance of social media and targeted internet activity to identify and expose bad actors involved with drugs, terrorism, and other criminal activities.
- Corporate information governance and compliance.
- Customer Experience (CX) related communications and content.
Thus, as the use of MT becomes more strategic and pervasive we also see a need for new kinds of tools and capabilities that can assist in the optimization process, This is a guest post by Adam Bittlingmayer, a co-founder of ModelFront who is a developer of a new breed of machine learning-driven tools that assist and enhance MT initiatives across all the use cases described above. ModelFront describes what they do as: In research terms, we've built "massively multilingual black-box deep learning models for quality estimation, quality evaluation, and filtering", and productized it.
- Filtering parallel corpora used in training MT engines
- Comparison of MT engines with detailed error profiles
- Rapid and more comprehensive translation error detection & correction capabilities
- Enhanced man-machine collaboration infrastructure that is particularly useful in high volume MT use cases
The researchers worked with 30,000 video subtitle files in English and their corresponding translations in French, German, Italian, Portuguese and Spanish for their experiments. They found that their DeepSubQE model was accurate in its estimations more than 91% of the time for all five languages.
=====
Confidence scoring, quality estimation, and risk prediction
What is the difference between a quality estimation, a confidence score, and translation risk prediction?
Microsoft, Unbabel, and Memsource use the research standard quality estimation, while Smartling calls its feature a quality confidence score, and there are even research papers talking about confidence estimation or error prediction, and tools that use a fuzzy match percentage. Parallel data filtering approaches like Zipporah or LASER use quality score or similarity score.
ModelFront uses risk prediction.
They're overlapping concepts and often used interchangeably - the distinctions are as much about tradition, convention, and use case as about inputs and outputs. They are all basically a score from 0.0 to 1.0 or 0% to 100%, at sequence-level precision or greater. Unlike BLEU, HTER, METEOR, or WER, they do not require a golden human reference translation.
We're interested in language, so we know the nuances in naming are important.
Confidence scoring
A machine translation confidence score is typically used for a machine translation's own bet about its own quality on the input sequence. A higher score correlates with higher quality.
It is typically based on internal variables of the translation system - a so-called glassbox approach. So it can't be used to compare systems or to assess human translation or translation memory matches.
Quality estimation
A machine translation quality estimate is based on a sequence pair - the source text and the translation text. Like a confidence score, a higher score correlates with higher quality.
It implies a pure supervised black-box approach, where the system learns from labeled data at training time but knows nothing about how the translation was produced at run time. It also implies the scoring of machine translation only.
This term is used in research literature and conferences, like the WMT shared task and is also the most common term in the context of the approach pioneered at Unbabel and Microsoft - safely auto-approving raw machine translation for as many segments as possible.
It's often contrasted with quality evaluation - a corpus-level score.
In practice, usage varies - researchers do talk about unsupervised and glassbox approaches to quality estimation, and about word-level quality estimation, and there's no reason that quality estimation could not be used for more tasks, like quality evaluation or parallel data filtering.
Risk prediction
A translation risk prediction is also based on a sequence pair - the source text and the translation text. A higher score correlates with a higher risk.
Like quality estimation, it implies a pure black-box approach. Unlike quality estimation, it can also be used for everything from parallel data filtering to quality assurance of human translation to corpus- or system-level quality evaluation.
Why did we introduce yet another name? Risk prediction is the term used at ModelFront because it's the most correct and it's what clients actually want, across all use cases.
Often it's impossible to say if a translation is of high quality or low quality because the input sequence is ambiguous or noisy. When English
Apple
is translated to Spanish asManzana
or toApple
, it makes no sense to say that both are low quality or medium quality - one of them is probably perfect. But it does make sense to say that, without more context, both are risky.
We also wanted our approach to explicitly break away from quality estimation's focus on post-editing distance or effort and CAT tools' focus on rules-based translation memory matching, and to be future-proof as use cases and technology evolves.
ModelFront's risk prediction system will grow to include risk types and rich phrase- and word-level information.
Options for translation quality and risk
How to build or buy services, tools or technology for measuring translation quality and risk
Measuring quality and risk are fundamental to successful translation at scale. Both human and machine translation benefit from sentence-level and corpus-level metrics.
Metrics like BLEU are based on string distance to the human reference translations and cannot be used for new incoming translations, nor for the human reference translations themselves.
What are the options if you want to build or buy services, tools or technology for measuring the quality and risk of new translations?
Humans
Whether just an internal human evaluation in a spreadsheet, user-reported quality ratings, an analysis of translator post-editing productivity and effort, or full post-editing, professional human linguists and translators are the gold standard.
There is significant research on human evaluation methods, and quality frameworks like MQM-DQF and even quality management platforms like TAUS DQF and ContentQuo for standardizing and managing human evaluations, as well as translators and language service providers offering quality reviews or continuous human labeling.
Features
Translation tools like Memsource, Smartling, and GlobalLink have features for automatically measuring quality bundled in their platforms. Memsource's feature is based on machine learning.
Tools
Xbench, Verifika, and LexiQA directly apply exhaustive, hand-crafted linguistic rules, configurations and translation memories to catch common translation errors, especially human translation errors.
They are integrated into existing tools, and their outputs are predictable and interpretable. LexiQA is unique in its partnerships with web-based translation tools and its API.
Open-source libraries
If you have the data and the machine learning team and want to build your own system based on machine learning, there is a growing set of open-source options.
The most notable quality estimation frameworks are OpenKiwi from Unbabel and DeepQuest from the research group led by LucÃa Specia. Zipporah from Hainan Xu and Philipp Koehn is the best-known library for parallel data filtering.
The owners of those repositories are also key contributors to and co-organizers of the WMT shared tasks on Quality Estimation and Parallel Corpus Filtering.
Massively multilingual libraries and pre-trained models like LASER are a surprisingly effective unsupervised approach to parallel data filtering when combined with other techniques like language identification, regexes, and round-trip translation.
Internal systems
Unbabel, eBay, Microsoft, Amazon, Facebook, and others invest in in-house quality estimation research and development for their own use, mainly for the content that flows through their platforms at scale.
The main goal is to use raw machine translation for as much as possible, whether in an efficient hybrid translation workflow for localization or customer service, or just to limit catastrophes on User and business-generated content that is machine translated by default.
Their approaches are based on machine learning.
Systems accessible as APIs, consoles or on-prem
ModelFront is the first and only API for translation risk prediction based on machine learning. With a few clicks or a few lines of code, you can access a production-strength system.
Our approach is developed fully in-house, extending ideas from the leading researchers in quality estimation and parallel data filtering, and from our own experience inside the leading machine translation provider.
We've productized it and made it accessible and useful to more players - enterprise localization teams, language service providers, platform and tool developers and machine translation researchers.
We have built security, scalability, and support for 100+ languages and 10K+ language pairs, locales, encodings, formatting, tags and file formats, integrations with the top machine translation API providers and automated customization into our APIs.
We provide our technology as an API and console for convenience, as well as on-prem deployments.
We continuously invest in curated parallel datasets and manually-labeled datasets and track emerging risk types as translation technology, use cases, and languages evolve.
These last posts about MT are a fantastic read, thank you. Very interesting. Here is my two cents: I think that definitely there is a wall between MT developers and the "last mile" workers: post-editors.
ReplyDeleteMT developers will benefit if they share a bit more of their technology with said workers. I mean, MT is the result of very refined pattern recognition technology, right? So, why not approach the correction and optimization with some of the same tools? Help the post-editors, empower them to leverage technology more efficiently and actually advance and gain new marketable skills.
One specific example: regular expressions. A translator or editor that learns regex can save time, work more efficiently, perform advanced search & replace and many other things. Here, the apparently simple looking Notepad++ reveals itself as a far more powerful tool than traditional CAT tools.
The sky is the limit. I can imagine a cyberpunk post-editor that is writing Python scripts on the fly to clean huge amounts of text data, detecting errors, running custom QA processes, all thanks to an ecosystem of NLP APIs that go far beyond traditional tasks.