The Challenge of Defining Translation Quality
As industry observer and critic Luigi Muzii describes it:"Because of the intricacies related to the intrinsic nature of languages, objective measurement of translation quality has always been a much researched and debated topic that has borne very little fruit. The notion of understood quality level remains unsolved, together with any kind of generally accepted and clearly understood quality assessment and measurement."
Defining Machine Translation Output Quality
- What are we testing on?
- Are we sure that these MT systems have not trained on the test data?
- What kind of translators is evaluating the different sets of MT output?
- How do these evaluators determine what is better and worse when comparing different correct translations?
- How many sentences are needed to make a meaningful assessment and draw accurate conclusions when comparing multiple MT systems performance on the same source material?
Achieving human parity for machine translation is an important milestone of machine translation research. However, the idea of computers achieving human quality level is generally considered unattainable and triggers negative reactions from the research community and end-users alike. This is understandable, as previous similar announcements have turned out to be overly optimistic.
Definition 1.If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.Definition 2.If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
The problem with human evaluation is bias. The red-pen syndrome.Indeed, human evaluation of machine translation is known for being expensive, time-consuming, and often biased, and yet it is supposed to overcome the drawbacks introduced by the limited accuracy and approximation of automatic evaluation. ... Indeed, translation quality metrics have become more and more complex and overly detailed, and always seem to be based on the error-catching [and categorization] approach that has proved costly and unreliable thus far.
Useful Issues to Understand
The Academic Response
- Appoint professional translators as raters
- Evaluate documents, not sentences
- Evaluate fluency on top of adequacy
- Do not heavily edit reference translations for fluency
- Use original source texts
What Would Human Parity MT Look Like?
- 90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human
- Catch obvious errors in the source and possibly even correct these before attempting to translate
- Handle variations in the source with consistency and dexterity
- Have at least some nominal amount of contextual referential capability
- How large the test set was (e.g. 90% of 50 sentences where parity was achieved)
- Descriptions on what kind of source material was tested
- How varied the test material was: sentences, paragraphs, phrases, etc...
- Who judged, scored, and compared the translations
We found that a sample of 45/50 original human sourced sentences translated by the new MT system were judged by a team of three crowdsourced translator/raters as indistinguishable from the translations produced by two professional human translators. Based on this data, we claim the system has achieved "limited human parity".
"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not pixie dust that magically solves all your problems."Steven Pinker
"… I’m skeptical, though, about science-fiction scenarios played out in the virtual reality of our imaginations. The imagined futures of the past have all been confounded by boring details: exponential costs, unforeseen technical complications, and insuperable moral and political roadblocks. It remains to be seen how far artificial intelligence and robotics will penetrate into the workforce. (Driving a car is technologically far easier than unloading a dishwasher, running an errand, or changing a baby.) Given the tradeoffs and impediments in every other area of technological development, the best guess is: much farther than it has so far, but not nearly so far as to render humans obsolete."
Recently some in the Singularity community have admitted that "language is hard" as you can see in this attempt to explain why AI has not mastered translation yet.
Michael Housman, a faculty member of Singularity University, explained that the ideal scenario for machine learning and artificial intelligence is something with fixed rules and a clear-cut measure of success or failure. He named chess as an obvious example and noted machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited set of moves.
Housman elaborated, “Language is almost the opposite of that. There aren’t as clearly-cut and defined rules. The conversation can go in an infinite number of different directions. And then of course, you need labeled data. You need to tell the machine to do it right or wrong.”
Housman noted that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”
Perhaps, we need to admit that human parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?
Perhaps what we really need is an MT platform that can rapidly evolve in quality with specialized human feedback. Post-editing (MTPE) today is generally NOT a positive experience for most translators. But human interaction with the machine can be a significantly better and positive experience. Developing interactive and highly responsive MT systems that can assist, learn, and improve the humdrum elements of translation tasks instantaneously might be a better research focus. This may be a more worthwhile goal than having a God-like machine that can translate anything and everything at human parity.
Even in the AI-will-solve-all community, we know that "language is hard" so maybe we need more focus on improving the man-machine interface, and the quality of the interaction and find more sophisticated collaborative models. Rapid evolution, intuitive and collaborative interaction, and instant learning seem like a more promising vision to me than crawling all the data on the web and throwing machine learning pixie dust at your ten trillion word TM training data.
Getting to a point where the large majority of translators ALWAYS WANT TO USE MT because it simply makes the work easier, more pleasant, and more efficient is perhaps a better focus for the future. I would bet also that this different vision will be a more likely path to better MT systems that consistently produce better output over millions of sentences.