This post is to a great extent a belated summary of the last highlights from the AMTA2020 virtual conference which I felt was one of the best ones (in terms of signal-to-noise ratio), held in the last ten years. Of course, I can only speak to those sessions I attended, and I am aware that there were many technical sessions that I did not attend that were also widely appreciated by others. This post is also a way to summarize many of the key challenges and issues being faced by MT today and is thus a good way to review the SOTA of the technology as this less-than-wonderful year ends.
The State of Neural MT
I think that 2020 is the year that Neural MT became just MT. Just regular MT. It is superfluous to add neural anymore because most of the MT discussions and applications that you see today are NMT based and it would be like saying Car Transportation. It might still be useful to say SMT or RBMT to point out use that is not a current mainstream approach, but it is less necessary to say neural MT anymore. While NMT was indeed a big or even HUGE leap forward, we have reached a phase where much of the best research and discussion is focused on superior implementation and application of NMT, rather than just simply using NMT. There are many open-source NMT toolkits available and it is clearly the preferred MT methodology in use today, even though SMT nor RBMT are not completely dead. And some still argue that these older approaches are better for certain specialized kinds of problems.
However, while NMT is a significant step forward in improving the generic MT output quality, there are still many challenges and hurdles ahead. Getting back to AMTA2020, one of the sessions (C3) talked specifically about the most problematic NMT errors across many different language combinations and provided a useful snapshot of the situation. The chart below is a summary of the most common kinds of translation errors found across many different language combinations. We see that while the overall level of MT output acceptability has increased, many of the same challenges still remain. Semantic confusion around word ambiguity, unknown words, and dialectical variants continue to be challenging. NMT has a particular problem with phantom or hallucinated text - it sometimes simply creates stuff that is not in the source. But, we should be clear that the proportion of useful and acceptable translations continues to climb and is strikingly "good" in some cases.
MT Quality Assessment
BLEU is still widely used today to describe and document progress with MT model development, even though it is widely understood to be inadequate in measuring quality changes with NMT models in particular. However, BLEU provides long-term progress milestones for developers in particular and I think the use of BLEU scores in that context still has some validity assuming that proper experimental rigor is followed. BLEU works for this task because it is relatively easy to set up and implement.
The use of BLEU to compare MT systems from different vendors, using public domain test sets is more problematic - my feeling is that it will lead to erroneous conclusions and sub-optimal system selection. To put it bluntly, it is a bullshit exercise that appears scientific and structured but is laden with deeply flawed assumptions and ignorance.
None of the allegedly "superior metric" replacements have really taken root because they simply don't add enough additional accuracy or precision to warrant the overhead, extra effort, and experimental befuddlement. Human evaluation feedback is now a core requirement for any serious MT system development because it is still the best way to accurately understand relative MT system quality and determine progress in development related to specific use scenarios. The SOTA today is still multiple automated metrics + human assessments when accuracy is a concern. As of this writing, a comparison system that spits out relative rankings of MT systems without meaningful human oversight I think is suspect and should be quickly dismissed.
However, the need for better metrics to help both developers and users quickly understand the relative strengths and weaknesses of multiple potential MT systems is even more urgent today. If a developer has 2 or 3 close variants of a potential production MT system, how do they tell which is the best one to commit to? The need to understand how a production system improves or degrades over time is also very valuable.
- Determine the ongoing improvement or degradation of a production MT system
- Differentiate between multiple high-performing systems with better accuracy than has been possible with other metrics
The Translator-Computer Interface
- It allows a much richer and more flexible interaction with the computer for any translation-related task.
- It naturally belongs in the cloud and is likely to offer the most powerful user assistance experience in the cloud setting.
- It can be connected to many translation assistance capabilities like Linguee, dictionaries, terminology - synonym - antonym databases, MT, and other translator reference aids to transform the current TM focused desktop.
- It creates the possibility of a much more interactive and translator-driven interaction/adaptation model for next-generation MT systems that can learn with each interaction.