Tuesday, December 22, 2020

American Machine Translation Association (AMTA2020) Conference Highlights

This post is to a great extent a belated summary of the last highlights from the AMTA2020 virtual conference which I felt was one of the best ones (in terms of signal-to-noise ratio), held in the last ten years.  Of course, I can only speak to those sessions I attended, and I am aware that there were many technical sessions that I did not attend that were also widely appreciated by others.  This post is also a way to summarize many of the key challenges and issues being faced by MT today and is thus a good way to review the SOTA of the technology as this less-than-wonderful year ends.

The State of Neural MT

I think that 2020 is the year that Neural MT became just MT. Just regular MT. It is superfluous to add neural anymore because most of the MT discussions and applications that you see today are NMT based and it would be like saying Car Transportation. It might still be useful to say SMT or RBMT to point out use that is not a current mainstream approach, but it is less necessary to say neural MT anymore.  While NMT was indeed a big or even HUGE leap forward, we have reached a phase where much of the best research and discussion is focused on superior implementation and application of NMT, rather than just simply using NMT. There are many open-source NMT toolkits available and it is clearly the preferred MT methodology in use today, even though SMT nor RBMT are not completely dead. And some still argue that these older approaches are better for certain specialized kinds of problems.

However, while NMT is a significant step forward in improving the generic MT output quality, there are still many challenges and hurdles ahead. Getting back to AMTA2020, one of the sessions (C3) talked specifically about the most problematic NMT errors across many different language combinations and provided a useful snapshot of the situation. The chart below is a summary of the most common kinds of translation errors found across many different language combinations. We see that while the overall level of MT output acceptability has increased, many of the same challenges still remain. Semantic confusion around word ambiguity, unknown words, and dialectical variants continue to be challenging. NMT has a particular problem with phantom or hallucinated text - it sometimes simply creates stuff that is not in the source. But, we should be clear that the proportion of useful and acceptable translations continues to climb and is strikingly "good" in some cases.

A concern for all the large public MT portals, that translate billions of words an hour, is the continuing possibility of catastrophic errors that are offensive, insensitive, or just simply outlandish. Some of these are listed below from a presentation made by Mona Diab, a GWU/Facebook researcher who presented a very interesting overview of something she called "faithful" translation. 

This is a particularly urgent issue for those platforms like Facebook and Twitter that face the huge volumes of social media commentary on political and social events. Social media, in case you did not know, is increasingly the way that much of the world consumes news.

The following slides show what Mona was pointing to when she talked about "Faithfulness" and I recommend that readers look at her whole presentation which is available here. MT on social media can be quite problematic as shown in the next chart.

She thus urged the community to find better ways to assess and determine acceptable or accurate MT quality especially in high-volume social media translation settings. Her presentation provided many examples of problems and described a need for a more semantically accurate measure that she calls "Faithful MT". Social media is an increasingly more important target of translation focus and we have seen the huge impact that commentary in social media can have on consumer buying behavior, political opinion, brand identity, brand reputation, and even political power outcomes. A modern enterprise, commercial, or government, that does not monitor ongoing relevant social media feedback is walking blind, and likely to face unfortunate consequences from this lack of foresight. 

Mona Diab's full presentation is available here and is worth a look as I think it defines several key challenges for the largest users of MT in the world. She mentioned that Facebook processes 20B+ translation transactions per day which could mean anywhere from 100 billion to 2 trillion words a day. This volume will only increase as more of the world comes online and could be twice the current volumes in as little as a year.

Another keynote that was noteworthy (for me) was the presentation by Colin Cherry of Google Research: "Research stories from Google Translate’s Transcribe Mode". He found a way to present his research in a truly compelling way in a style that was both engaging and compelling. The slides are available here but without his talk track, it is barely a shadow of the presentation I watched. Hopefully, AMTA will make the video available. 

Chris Wendt from Microsoft also provided insight into the enterprise use of MT and showed some interesting data in his keynote.  He also gave some examples of catastrophic errors and had this slide to summarize the issues.

He pointed out that in some language combinations it is possible to use 'Raw MT" across many more types of content than in others because these combinations tend to perform better across much more content variation. I am surprised by how many LSPs still overlook this basic fact, i.e. all MT combinations are not equivalent. 

He showed a ranking of the "best" language pair combinations (as in closest to human references) that probably is most meaningful for localization users. But could also be useful to others who want to understand roughly what the MT system quality ratings by language are.

Normally vendor presentations at conferences have too much sales emphasis and too little information content in them to be interesting. I was thus surprised by the Intento and Systran presentations which were both content-rich, educational, and informational-rich. A welcome contrast to the mostly lame product descriptions we normally see.  

While MT technology presentations focused on large-scale use cases (i.e. NOT localization) are making progress in great strides, my feeling is that the localization presentations were inching along progress-wise with post-editing management, complicated data analysis, and review tools themes that really have not changed very much in ten years. A quick look at the most-read blog posts on eMpTy Pages also confirmed that a post I wrote in 2012 on Post-Editing Compensation has made it into the Top 10 list for 2020. Localization use cases still struggle to eke out value from MT technology because it is simply not yet equivalent to human translation. There are clear use cases for both approaches (MT and HT) and it has always been my feeling that localization is a somewhat iffy use case and can only work for the most skilled practitioners who make long-term investments in building suitable translation production pipelines. If I were able to find a cooperative MT developer team I think I would be able to architect a better production flow and man-machine engagement model than much of what I have seen over the years. The reality of MT use in localization still has too much emphasis on the wrong syllable.  I hope I get the chance to do this in 2021.

MT Quality Assessment

BLEU is still widely used today to describe and document progress with MT model development,  even though it is widely understood to be inadequate in measuring quality changes with NMT models in particular. However, BLEU provides long-term progress milestones for developers in particular and I think the use of BLEU scores in that context still has some validity assuming that proper experimental rigor is followed. BLEU works for this task because it is relatively easy to set up and implement. 

The use of BLEU to compare MT systems from different vendors, using public domain test sets is more problematic - my feeling is that it will lead to erroneous conclusions and sub-optimal system selection. To put it bluntly, it is a bullshit exercise that appears scientific and structured but is laden with deeply flawed assumptions and ignorance. 

None of the allegedly "superior metric" replacements have really taken root because they simply don't add enough additional accuracy or precision to warrant the overhead, extra effort, and experimental befuddlement. Human evaluation feedback is now a core requirement for any serious MT system development because it is still the best way to accurately understand relative MT system quality and determine progress in development related to specific use scenarios.  The SOTA today is still multiple automated metrics + human assessments when accuracy is a concern. As of this writing, a comparison system that spits out relative rankings of MT systems without meaningful human oversight I think is suspect and should be quickly dismissed. 

However, the need for better metrics to help both developers and users quickly understand the relative strengths and weaknesses of multiple potential MT systems is even more urgent today. If a developer has 2 or 3 close variants of a potential production MT system, how do they tell which is the best one to commit to?  The need to understand how a production system improves or degrades over time is also very valuable. 

Unbabel presented their new metric: COMET and provided some initial results on its suitability and ability to solve the challenge described above for example. That is to successfully rank several high-performing MT systems. 

The Unbabel team seems very enthusiastic about the potential for COMET to help with:
  • Determine the ongoing improvement or degradation of a production MT system
  • Differentiate between multiple high-performing systems with better accuracy than has been possible with other metrics
Both of these apparently can be done with less human involvement and better automated feedback on these two issues is clearly of high value. It is not clear to me how much overhead is involved in using the metric as we do not have much experience of its use outside of Unbabel. It is being put into open source and will possibly attract a broader user community at least from sophisticated LSPs who stand to gain the most from a better understanding of the value of retraining and doing better comparisons of multiple MT systems than is possible with BLEU, chrF, and hLepor.  I hope to dig deeper into understanding COMET in 2021.

The Translator-Computer Interface

One of the most interesting presentations I saw at AMTA was by Nico Herbig from DFKI on what he called the Multi-modal interface for post-editing. I felt it was striking enough that I asked Nico to contribute a guest post on this blog. This post is now the most popular one over the last two months and can be read at the link below. He has also been covered in more detail and discussion by Jost Zetzche in his newsletter.

While Nico focused on the value of the multi-modal system to the post-editing task at AMTA, it has great value and applicability for any translation-related task. A few things stand out for me about this initiative:
  • It allows a much richer and more flexible interaction with the computer for any translation-related task.
  • It naturally belongs in the cloud and is likely to offer the most powerful user assistance experience in the cloud setting.
  • It can be connected to many translation assistance capabilities like Linguee, dictionaries, terminology - synonym - antonym databases, MT, and other translator reference aids to transform the current TM focused desktop. 
  • It creates the possibility of a much more interactive and translator-driven interaction/adaptation model for next-generation MT systems that can learn with each interaction.

I wish you all a Happy Holiday season and wish a Happy, Healthy, and Prosperous New Year.

(Image credit: SkySafari app and some fun facts)