eMpTy Pages: The Continuing Evolution of Enterprise MT

I recently hosted a round table panel discussion organized by Multilingual with three leading enterprise MT players, to discuss and share thoughts on the state of Enterprise MT today. The panelists are all firmly established in the enterprise MT arena, and a key objective of the session was to contrast the focus and differing requirements of enterprise MT from the generic consumer portal MT that is so easily available today.

MT technology has become pervasive, and today more than a trillion words are translated every single day by the many free MT portals on the web. However, the MT needs of the enterprise are different and much more specialized. Generic MT technology is not always adequate for the many new uses that a global enterprise might have for high volume translation. In this panel discussion, we talk to three leaders working in the Enterprise MT space about:

Customization and adaptation requirements for the enterprise
Issues in measuring and understanding relative MT output quality
The outlook for continuing improvements in core MT technology
Use cases for MT beyond localization

The MultiLingual Summer Series: Meaningful Conversations with Thought Leaders have other interesting sessions coming up in August.

The panelists present in the session included:

Chris Wendt graduated as Diplom-Informatiker from the University of Hamburg, Germany, and subsequently spent a decade on software internationalization for a multitude of Microsoft products including Windows, Internet Explorer, MSN, and Bing – bringing these products to market with equal functionality worldwide. Since 2020 he is leading the group of language services in the Azure Cognitive Services family, bringing natural language processing capabilities to its customers.

Alon Lavie leads and manages Unbabel’s US AI lab based in Pittsburgh, and provides strategic leadership for the AI R&D teams company-wide. For almost 20 years (1996-2015) Alon was a Research Professor at the Language Technologies Institute at Carnegie Mellon University, where he continues to serve as an adjunct professor.

Joern Wuebker, Director of Research at Lilt. the AI-powered enterprise translation software and services company. Prior to joining Lilt, Joern earned a Ph.D. in computer science from RWTH Aachen, where he focused on machine translation research.

The complete session is available as a video on the Multilingual site at the links shown above, however, it requires a viewer to sit through the full session. While there are some solutions that allow video content to be indexed and annotated, and thus make the content discovery easier, these are not in wide use yet. Thus, I thought it would be useful to provide subject-related links to the different sections of the full session in this post to allow a viewer to selectively focus on the subject matter of greatest interest. I have also summarized some of the important points in my own words and added new comments where I thought it might add greater clarity.

What are the characteristics and requirements of enterprise MT? How does it differ from generic MT?

Strong focus on data security and privacy of customer data
The adaptability and customization possibilities available to tailor the MT system to unique customer requirements
Service level requirements to meet corporate IT requirements: Scalability, Batch vs. Interactive
Matching and adjusting MT quality to different enterprise use case requirements: correct translation in context and proper handling of relevant content types especially when MT is used without downstream human adjustment
Handling of varied content types (structured documents) and formats in different business workflows
Maintaining data privacy in dynamic workflows and data interactions
The growing need for interactivity and dynamic adaptation to meet organizational requirements as rapidly and efficiently as possible

What do we mean by adaptation and customization to meet unique organizational needs?

Offline adaptation can often be ineffective referring to TAUS study according to Lilt who favors dynamic and interactive human-in-the-loop adaptation which seems especially well suited for localization use cases but could be less useful for very high volume use cases like eDiscovery and eCommerce
The TAUS COVID MT study should not be used to draw larger conclusions
Enterprise content is constantly changing and so MT engines need to evolve and cannot be static for best outcomes when using the technology
Adaptation should be driven by content variations and evolution
Microsoft has seen that on average their customers get 10+ BLEU improvements from their customization efforts
Handling of unique terminology, style, and branding correctly is increasingly seen as a critical need for the enterprise

How do we handle the problem that there never seems to be enough data to properly customize or adapt an MT engine to unique customer requirements?

Lilt claims they don't need large data volumes to adapt an engine, but rather they need very specific and relevant in-context data that is current and reflects the immediate focus to enable effective adaptation
Adaptation can start with just a few key terms and evolve with TM and/or human feedback that can be gathered over time
Context and high relevance is now increasingly being seen as more important than data volume
Make the customization process demand-driven. Focus your correction and improvement processes on the content that matters the most
Data manufacturing is of dubious value but there are some synthetic data strategies that make some sense e.g. back translations which can be "surprisingly effective".

How do we address the challenges around evaluating MT quality given the increasing awareness and recognition that BLEU scores are not as useful with NMT?

Understand the source data and choose the right metrics to determine accuracy. At Lilt, they use "next word accuracy" as a key measure to determine quality evolution progress
The purpose of MT deployment needs to be properly understood first? Does the MT solve the problem? Does the machine translation improve user experience even though it is not linguistically perfect?
BLEU is a better laboratory evaluation rather than a real-world evaluation and still has value to developers
A human evaluation has to be involved to get a better picture of the MT quality reality but these are expensive and slow
Unbabel uses MQM based error evaluation techniques to develop improved quality predictors
Unbabel has developed a neural-based evaluation approach that is showing great promise with an initiative called COMET. Details will be coming forth soon.

How does Lilt measure MT output quality improvements given that their engine is changing dynamically?

Lilt measures every sentence and learns from every single sentence
They can tell from the improvements in "next word accuracy" and see how it evolved from an initial engine
GPT-3 next word prediction is not a threat to our approach as it is monolingual and requires a huge model that would make it impractical

How do we handle MT use cases beyond localization where potentially billions of words are being translated and a human-in-the-loop deployment is not practical?

Raw MT can be used to make much more content available to global customers even if the value of this is unknown. The enterprise can make more content available to identify what content has the greatest value in different markets
User-generated content (UGC) is hard to customize around and generic engines may work just as well
Communication scenarios are much more important to the business mission usually than UGC, even when an enterprise has very short shelf-life content and thus quality measurement happens in realtime e.g. chatbots, live support
Communication content strategies vary for inbound (UGC) and outbound (customer support). Outbound communications can involve human-in-the-loop (HITL) if the content turnaround times allow this or if the content needs it.
Realtime quality estimation capabilities become a much more critical enabling element for other use cases especially with social media and user forum communication where it is desirable to use raw MT.

What is the difference between linguistic steering and post-editing?

Focusing on high-density language patterns rather than the full corpus to ensure that critical word patterns are properly learned and handled. PEMT is a full corpus evaluation strategy
Unbabel develops dedicated test suites to ensure that the quality is acceptable for the linguistic material that really matters
Raw MT can create the danger of making a catastrophic mistake and experts need to find ways to identify and handle these kinds of errors
Lilt feels that they could handle the translation of a 25M word corpus through their interactive and dynamic evolution and predictions will get better and better. Lilt can push the boundaries of HITL to handle a very large corpus.

What is the impact of "ground-breaking NLP research (BERT/GPT-3)" on the world of MT?

Text-generation technologies are very cool but they actually do NOT really understand the text
Catastrophic failure will be common and easy to achieve :-)
Large LM approaches can help with document-level context but they are mostly monolingual
Also, these approaches are not as effective as dedicated NMT engines on handling the translation task
These new breakthroughs can be used to leverage other downstream NLP tasks
The primary issue is that they are mostly monolingual at the moment, but they could be valuable as they become more multilingual
COMET is built on top of some of these initiatives
Multilingual training experiments have shown that there can be improvements for low resource languages even if they are unrelated

Audience question: Given Cybernetics 2.0, how crucial is human feedback for NMT?

Human feedback will always be required
Language is a living thing and will continue to evolve
Consider COVID 19, how many MT engines knew about this even 6 months ago?

Do you see continuously improving NMT displacing ever greater numbers of human translators?

Machines are already translating 1000X what humans are doing so this is already true in terms of volume but humans are the primary communicators
The roles of translators are evolving as technology improves but there are a finite number of human translators and HT will always be needed
"Human Translation" is not monolithic, and there are many kinds of translators, thus, we should be careful to not lump all translators as equivalent, as there is a wide range in terms of competence and expertise
MT cannot replace competent, subject matter expert translators who understand the communication intent, the semantic core and comprehend the communication impact of a translation
Competent HT will always be the final measure for quality
As long as there are humans there will be a need for human translation.

This last question evoked some spirited social media response and I saw this comment being retweeted many times on Twitter.

The full session can be viewed below and I will add a link of the KUDO version which will also provide interpretation in Chinese, Italian, Russian, and Spanish when it becomes available.

eMpTy Pages

Pages

Tuesday, August 11, 2020

The Continuing Evolution of Enterprise MT

What are the characteristics and requirements of enterprise MT? How does it differ from generic MT?

What do we mean by adaptation and customization to meet unique organizational needs?

How do we handle the problem that there never seems to be enough data to properly customize or adapt an MT engine to unique customer requirements?

How do we address the challenges around evaluating MT quality given the increasing awareness and recognition that BLEU scores are not as useful with NMT?

How does Lilt measure MT output quality improvements given that their engine is changing dynamically?

How do we handle MT use cases beyond localization where potentially billions of words are being translated and a human-in-the-loop deployment is not practical?

What is the difference between linguistic steering and post-editing?

What is the impact of "ground-breaking NLP research (BERT/GPT-3)" on the world of MT?

Audience question: Given Cybernetics 2.0, how crucial is human feedback for NMT?

Do you see continuously improving NMT displacing ever greater numbers of human translators?

No comments:

Post a Comment

Get new posts by email:

Search This Blog

Pages

Featured Post

Comparing MT System Performance