The Challenge of Defining Translation Quality
The subject of "translation quality" has always been a challenging communication issue for the translation industry. It is particularly difficult to explain this concept in a straightforward way to an industry outsider or a customer whose primary focus is building business momentum in international markets, and who is not familiar with localization industry translation-quality-speak. Nowadays they tend to focus on creating and managing the dynamic and ever-changing content that enhances a global customer's digital journey, rather than the static content that is the more typical focus of localization managers. Thus, the conventional way in which translation quality is discussed by LSPs is not very useful to these customers. Since every LSP claims to deliver the "best quality " or "high quality" translations", it is difficult for these buyers to tell the difference in this service aspect from one service provider to another. The quality claim between vendors thus essentially cancels out.
These customers also differ in other ways. They need larger volumes of content to be translated rapidly at the lowest cost possible, but yet at a quality level that is useful to the customer in digital interactions with the enterprise. For millions of digital interactions with enterprise content, the linguistic perfection of translations is not a meaningful and achievable goal given the volume, short shelf-life, and instant turnaround expectations a digital customer will have.
As industry observer and critic Luigi Muzii describes it:
"Because of the intricacies related to the intrinsic nature of languages, objective
measurement of translation quality has always been a much researched
and debated topic that has borne very little fruit. The notion
of understood quality level remains unsolved, together with any kind of
generally accepted and clearly understood quality assessment and
measurement."
The industry response to this need for a better definition of translation quality is deeply colored by the localization mindset and thus we see the emergence of approaches like the Dynamic Quality Framework (DQF). Many critics consider it too cumbersome and detailed to implement in translating modern fast-flowing content streams needed for superior digital experience. While DQF can be useful in some limited localization use-case scenarios, it will surely confound and frustrate enterprise managers who are more focused on digital transformation imperatives. The ability to rapidly handle and translate large volumes of DX-relevant content cost-effectively is increasingly a higher priority and needs a new and different view on monitoring quality. The quality of the translation does matter in delivering superior DX but has a lower priority than speed, cost, and digital agility.
While machines do most of the translation on the planet today, this does not mean that there is no role for higher value-added human translation (increasingly supported by CAT tools). If the content is a critical and high-impact communication, most of us understand that human oversight is critical for success in the business mission. And if translation involves finesse, nuance, and high art, it is probably best to leave the "translating" computers completely out of the picture.
However, in this age of digitally-driven business transformation and momentum, competent MT solutions are essential to the enterprise's mission. Increasingly, more and more content is translated and presented to target customers without EVER going through any post-editing modification. The business value of the translation is often defined by its utility to the consumer in a digital
journey, basic understandability, availability-on-demand, and the overall CX
impact, rather than linguistic perfection. Generally, useable accuracy and timely delivery matter more than perfect grammar and fluency. The phrase
"good enough" is used both disparagingly, and as a positive attribute,
for the translation output that is useful to a customer even in a less than
“perfect” state.
So we have a situation today where the term translation quality is often meaningless even in "human translation" because it cannot be described to an inexperienced buyer of translation services (or regular human beings) in a clear, objective, and consistently measurable way. Comparing different human translation works of the same source material is often an exercise in frustration or subjective preference at best. Every sentence can have multiple correct, accurate translations, so how do we determine what is the best translation? Since every LSP in the industry claims to provide the "best quality", such a claim is useless to a buyer who does not wish to wade through discussions on error counts, error categories, and error monitoring dashboards that are sometimes used to illustrate translation quality.
Defining Machine Translation Output Quality
The MT development community has also had difficulty establishing a meaningful and widely useful comparative measurement for translation quality. Fortunately, they had assistance from the
National Institute of Standards & Technology (NIST) and developed a methodology to compare the translation quality of multiple competing MT systems under carefully controlled evaluation protocols. The NIST used a variant of
BLEU scores and other measures of precision, recall, adequacy, and fluency to compare different MT systems rapidly in a standardized and transparent manner. Their efforts probably helped to establish BLEU as a preferred scoring methodology to rate both evolving and different competing MT systems.
The competitive evaluation approach works when multiple systems are compared under carefully monitored test protocols, but becomes less useful when an individual developer announces "huge improvements" in BLEU scores as it is easy to make extravagant claims of improvement that are not easily validated. Some independent evaluations used by many today, provide comparisons where several systems may have actually trained on the test sets - this is the equivalent of giving a student the exam with the answers before a formal test. This makes some publicly available comparisons done by independent parties somewhat questionable and misleading. Other reference Test-set-based measurements like hLepor, Meteor, chrF, Rouge, and others are also plagued by similar problems. These automated measurements are all useful, but unreliable indicators of absolute quality.
Best practices today suggest that a combination of multiple automated measures needs to be used together with human assessments of MT output to really get a handle on the relative quality of different MT systems. Again, this quickly gets messy as soon as we start asking annoying questions like:
- What kind of content are we testing?
- Are we sure that these MT systems have not trained on the test data?
- What kind of translators is evaluating the different sets of MT output?
- How do these evaluators determine what is better and worse when comparing different correct translations?
- How many sentences are needed to make a meaningful assessment and draw accurate conclusions when comparing multiple MT systems' performance on the same source material?
So, we see that conducting an accurate evaluation is difficult, and messy, and it is easy to draw wrong conclusions stemming from easy-to-make errors in the evaluation process.
However, in the last few years, several MT developers have claimed to produce MT systems that have achieved human parity. This has been especially true with the advent of neural machine translation. These claims are useful for creating a publicity buzz among ignorant journalists and fear amongst some translators, but usually disappoint anybody who looks more closely.
Achieving human parity for machine translation is an important milestone of machine translation research. However, the idea of computers achieving human quality level is generally considered unattainable and triggers negative reactions from the research community and end-users alike. This is understandable, as previous similar announcements have turned out to be overly optimistic.
The goal of achieving human parity has become a way to say that MT systems have gotten significantly better as
this Microsoft communication shows. I too was also involved with the SDL claim of having
"cracked Russian", which
is yet another broad claim stating that human parity has been reached😧.
Many, who are less skeptical than I am, will interpret that an MT engine that claims to have achieved human parity can ostensibly produce translations of equal quality to those produced by a human translator. This can indeed be true on a small subset of carefully selected test material, but alas we find that it is not usually true in general for much of what we submit with high expectations to these allegedly human parity MT engines. This is the unfortunate history of MT: over-promising and underdelivering. MT promises are so often empty promises 😏.
While many in the translation and research communities feel a certain amount of outrage over these exaggerated claims (based on MT output they see in the results of their own independent tests) it is useful to understand what supporting documentation is used to make these claims.
We should understand that at least among some MT experts there is no deliberate intent to deceive, and it is possible to do these evaluations with enough rigor and competence to make a reasonable claim of breakthrough progress, even if it falls short of the blessed state of human parity.
There are basically two definitions of human parity generally used to make this claim.
Definition 1.If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.
Definition 2.If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
Again the devil is in the details, as the data and the people used in making the determination can vary quite dramatically. There are (50?) shades of grey rather than black-and-white facts in most cases. The most challenging issue is that human judges and evaluators are at the heart of the assessment process. These evaluators can vary in competence and expertise and can range from bilingual subject matter experts and professionals to low-cost crowdsourced workers
who earn pennies per evaluation. The other big problem is the messy, inconsistent, irrelevant, and biased data underlying the assessments.
Ensuring objective, consistent human evaluation is necessary but difficult to do consistently on the required continuous and ongoing basis. If the underlying data used in an evaluation are fuzzy and unclear we actually move to obfuscation and confusion rather than clarity. This can be the scientific equivalent of fake news. MT engines evolve over time and the better the feedback, the faster the evolution if developers know how to use this feedback to drive continuous improvements.
Again, as Luigi Muzii states:
The problem with human evaluation is bias. The red-pen syndrome.
Indeed, human evaluation of machine translation is known for being expensive, time-consuming, and often biased, and yet it is supposed to overcome the drawbacks introduced by the limited accuracy and approximation of automatic evaluation. ... Indeed,
translation quality metrics have become more and more complex and overly detailed, and always seem to be based on the error-catching [and categorization] approach that has proved costly and unreliable thus far.
Useful Issues to Understand
While the parity claims can be roughly true for a small sample of evaluated sentences, it is difficult to extrapolate parity to a broader range of content because it is simply not possible to do machine translation output evaluation on an MT scale (millions of sentences). Some of the same questions that obfuscate quality discussions with human translation services also apply to MT. If we cannot define what a "good translation" is for a human, how is it possible to do this for a mindless, common-sense-free machine, where instruction and direction need to be explicit and clear?
Here are some validation and claim verification questions that can help an observer to understand the extent to which parity has been reached or also expose deceptive marketing spin that may motivate the claims.
What was the test data used in the assessments?
MT systems are often tested and scored on news domain data which is most plentiful. This may not correlate well with system performance on the typical content in the global enterprise content domain. A broad range of different types of content needs to be included to make claims as extravagant as having reached human parity.
What is the quality of the reference test set?
In some cases, researchers found that the test sets had been translated, and then back-translated with MTPE into the original source language. This could mean the content of the test sets would be simplified from a linguistic perspective, and thus easier to machine translate. Ideally, only expert human-created test sets should be used and should contain original source material, and should not be translated data from another language.
Who produced the reference human translations being used and compared?
The reference translations against which all judgments will be made should be "good" translations. Easily said but not so easily done. If competent humans are creating the source test set sentences, the test process will be expensive. Thus, it is often more financially expedient to use MT or cheap translators to produce the test material. This can cause a positive bias for widely used MT systems like Google Translate.
How much data was used in the test to make the claim?
Often human assessments are done with as little as 50 sentences, and automated scoring is rarely done with more than 2,000 sentences. Thus, drawing conclusions on how any MT system will handle the next million sentences it will process is risky, and likely to be overly optimistic. For example, when an MT developer says that over 90% of the system’s output has been labeled as a human translation by professional translators, they may be looking at a sample of only 100 or so sentences. To then claim that human parity has been reached is perhaps overreaching.
Who is making the judgments and what are their credentials?
It is usually cost-prohibitive to use expert professional translators to make the judgments and thus evaluators are often acquired on crowdsourcing platforms where evaluator and translator competence is not easily ascertained.
It can be seen that doing an evaluation properly would be a significant and expensive task, and MT developers have to do this continuously while building the system. The process needs to be efficient, fast, and consistent. It is often only possible to do such careful tests on the most mission-critical projects and is not realistic to follow all these rigorous protocols for typical low ROI enterprise projects. This is why BLEU and other "imperfect" automated quality scores are so widely used. They provide the developers with continuous feedback in a fast and cost-efficient manner if they are done with care and rigor. Recently there has been much discussion about testing on documents to assess understanding of context rather than just sentences. This will add complexity, cost, and difficulty to an already difficult evaluation process, and IMO will yield very small incremental benefits in evaluative and predictive accuracy. There is a need to balance improved process recommendations with cost, and the benefit from improved predictability.
The Academic Response
Some findings from this report in summary:
“Professional translators showed a significant preference for human translation, while non-expert [crowdsourced] raters did not”.
“Human evaluation methods which are currently considered best practice
fail to reveal errors in the output of strong NMT systems”
The authors recommend the following design changes to MT developers in their evaluation process:
- Appoint professional translators as raters
- Evaluate documents, not sentences
- Evaluate fluency on top of adequacy
- Do not heavily edit reference translations for fluency
- Use original source texts
Most developers would say that implementing all these recommendations would make the evaluation process prohibitively expensive and slow. The
researchers here do agree and welcome further studies into
“alternative evaluation protocols that can demonstrate their validity at
a lower cost.” Process changes need to be practical and reasonably possible, and we see that there is a need to balance improved process benefits with, cost, and improved predictability benefits.
What Would Human Parity MT Look Like?
MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. I think it would be useful to the user community at large that MT developers restrain from making these claims until they can show all of the following:
- 90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human
- Catch obvious errors in the source and possibly even correct these before attempting to translate
- Handle variations in the source with consistency and dexterity
- Have at least some nominal amount of contextual referential capability
Note that these are things we would expect without question from an average translator. So why not from the super-duper AI machine?
Until we reach the point where all of the above is true, it would be useful to CLEARLY state the boundary limits of the claim with key parameters underlying the claim. Such as:
- How large the test set was (e.g. 90% of 50 sentences where parity was achieved)
- Descriptions of what kind of source material was tested
- How varied the test material was: sentences, paragraphs, phrases, etc...
- Who judged, scored, and compared the translations
For example, if we saw an MT developer state a parity claim as follows perhaps:
We found that a sample of 45/50 original human sourced sentences translated by the new MT system were judged by a team of three crowdsourced translator/raters as indistinguishable from the translations produced by two professional human translators. Based on this data, we claim the system has achieved "limited human parity".
Until the minimum set of capabilities is shown at the MT scale (>100,000 or even 1M sentences) we should tell MT developers to STFU and give us the claim parameters in a simple, clear, summarized way, so that we can weigh the reality of the data versus the claim for ourselves.
I am also skeptical that we will achieve human parity by 2029 as some "singularity" enthusiasts have been saying for over a decade.
"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data]
is not pixie dust that magically solves all your problems."
Steven
Pinker
"… I’m skeptical, though, about science-fiction
scenarios played out in the virtual reality of our imaginations. The
imagined futures of the past have all been confounded by boring details:
exponential costs, unforeseen technical complications, and insuperable
moral and political roadblocks. It remains to be seen how far artificial
intelligence and robotics will penetrate into the workforce. (Driving a
car is technologically far easier than unloading a dishwasher, running
an errand, or changing a baby.) Given the tradeoffs and impediments in
every other area of technological development, the best guess is: much
farther than it has so far, but not nearly so far as to render humans
obsolete."
Recently some in the Singularity community have admitted that "language is hard" as you can see in this attempt to explain why AI has not mastered translation yet.
Michael Housman,
a faculty member of Singularity University, explained that the ideal
scenario for machine learning and artificial intelligence is something
with fixed rules and a clear-cut measure of success or failure. He named
chess as an obvious example and noted machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited set of moves.
Housman
elaborated, “Language is almost the opposite of that. There aren’t as
clearly-cut and defined rules. The conversation can go in an infinite
number of different directions. And then of course, you need labeled
data. You need to tell the machine to do it right or wrong.”
Housman noted that it’s inherently difficult to assign these informative labels. “Two
translators won’t even agree on whether it was translated properly or
not,” he said. “Language is kind of the wild west, in terms of data.”
Perhaps, we need to admit that human parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?
Perhaps what we really need is an MT platform that can rapidly evolve in quality with specialized human feedback. Post-editing (MTPE) today is generally NOT a positive experience for most translators. But human interaction with the machine can be a significantly better and positive experience. Developing interactive and highly responsive MT systems that can assist, learn, and improve the humdrum elements of translation tasks instantaneously might be a better research focus. This may be a more worthwhile goal than having a God-like machine that can translate anything and everything at human parity.
Even in the AI-will-solve-all community, we know that "language is hard" so maybe we need more focus on improving the man-machine interface, and the quality of the interaction and finding more sophisticated collaborative models. Rapid evolution, intuitive and collaborative interaction, and instant learning seem like a more promising vision to me than crawling all the data on the web and throwing machine learning pixie dust at your ten trillion word TM training data.
Getting to a point where the large majority of translators ALWAYS WANT TO USE MT because it simply makes the work easier, more pleasant, and more efficient is perhaps a better focus for the future. I would bet also that this different vision will be a more likely path to better MT systems that consistently produce better output over millions of sentences.