Tuesday, August 11, 2020

The Continuing Evolution of Enterprise MT

I recently hosted a round table panel discussion organized by Multilingual with three leading enterprise MT players, to discuss and share thoughts on the state of Enterprise MT today. The panelists are all firmly established in the enterprise MT arena, and a key objective of the session was to contrast the focus and differing requirements of enterprise MT from the generic consumer portal MT that is so easily available today.  

MT technology has become pervasive, and today more than a trillion words are translated every single day by the many free MT portals on the web. However, the MT needs of the enterprise are different and much more specialized. Generic MT technology is not always adequate for the many new uses that a global enterprise might have for high volume translation. In this panel discussion, we talk to three leaders working in the Enterprise MT space about:

  • Customization and adaptation requirements for the enterprise
  • Issues in measuring and understanding relative MT output quality
  • The outlook for continuing improvements in core MT technology
  • Use cases for MT beyond localization

The MultiLingual Summer Series: Meaningful Conversations with Thought Leaders have other interesting sessions coming up in August.

The panelists present in the session included:

Chris Wendt graduated as Diplom-Informatiker from the University of Hamburg, Germany, and subsequently spent a decade on software internationalization for a multitude of Microsoft products including Windows, Internet Explorer, MSN, and Bing – bringing these products to market with equal functionality worldwide. Since 2020 he is leading the group of language services in the Azure Cognitive Services family, bringing natural language processing capabilities to its customers.
Alon Lavie leads and manages Unbabel’s US AI lab based in Pittsburgh, and provides strategic leadership for the AI R&D teams company-wide. For almost 20 years (1996-2015) Alon was a Research Professor at the Language Technologies Institute at Carnegie Mellon University, where he continues to serve as an adjunct professor.
 Joern Wuebker, Director of Research at Lilt. the AI-powered enterprise translation software and services company. Prior to joining LiltJoern earned a Ph.D. in computer science from RWTH Aachen, where he focused on machine translation research.

The complete session is available as a video on the Multilingual site at the links shown above, however, it requires a viewer to sit through the full session. While there are some solutions that allow video content to be indexed and annotated, and thus make the content discovery easier, these are not in wide use yet. Thus, I thought it would be useful to provide subject-related links to the different sections of the full session in this post to allow a viewer to selectively focus on the subject matter of greatest interest. I have also summarized some of the important points in my own words and added new comments where I thought it might add greater clarity.

What are the characteristics and requirements of enterprise MT? How does it differ from generic MT?

  • Strong focus on data security and privacy of customer data
  • The adaptability and customization possibilities available to tailor the MT system to unique customer requirements
  • Service level requirements to meet corporate IT requirements: Scalability, Batch vs. Interactive
  • Matching and adjusting MT quality to different enterprise use case requirements: correct translation in context and proper handling of relevant content types especially when MT is used without downstream human adjustment
  • Handling of varied content types (structured documents) and formats in different business workflows
  • Maintaining data privacy in dynamic workflows and data interactions
  • The growing need for interactivity and dynamic adaptation to meet organizational requirements as rapidly and efficiently as possible

What do we mean by adaptation and customization to meet unique organizational needs? 

  • Offline adaptation can often be ineffective referring to TAUS study according to Lilt who favors dynamic and interactive human-in-the-loop adaptation which seems especially well suited for localization use cases but could be less useful for very high volume use cases like eDiscovery and eCommerce
  • The TAUS COVID MT study should not be used to draw larger conclusions
  • Enterprise content is constantly changing and so MT engines need to evolve and cannot be static for best outcomes when using the technology
  •  Adaptation should be driven by content variations and evolution
  • Microsoft has seen that on average their customers get 10+ BLEU improvements from their customization efforts
  • Handling of unique terminology, style, and branding correctly is increasingly seen as a critical need for the enterprise

How do we handle the problem that there never seems to be enough data to properly customize or adapt an MT engine to unique customer requirements?

  • Lilt claims they don't need large data volumes to adapt an engine, but rather they need very specific and relevant in-context data that is current and reflects the immediate focus to enable effective adaptation
  • Adaptation can start with just a few key terms and evolve with TM and/or human feedback that can be gathered over time
  • Context and high relevance is now increasingly being seen as more important than data volume
  • Make the customization process demand-driven. Focus your correction and improvement processes on the content that matters the most
  • Data manufacturing is of dubious value but there are some synthetic data strategies that make some sense e.g. back translations which can be "surprisingly effective".

How do we address the challenges around evaluating MT quality given the increasing awareness and recognition that BLEU scores are not as useful with NMT?

  • Understand the source data and choose the right metrics to determine accuracy. At Lilt, they use "next word accuracy" as a key measure to determine quality evolution progress
  • The purpose of  MT deployment needs to be properly understood first? Does the MT solve the problem? Does the machine translation improve user experience even though it is not linguistically perfect?
  • BLEU is a better laboratory evaluation rather than a real-world evaluation and still has value to developers
  • A human evaluation has to be involved to get a better picture of the MT quality reality but these are expensive and slow 
  • Unbabel uses MQM based error evaluation techniques to develop improved quality predictors 
  • Unbabel has developed a neural-based evaluation approach that is showing great promise with an initiative called COMET. Details will be coming forth soon.

How does Lilt measure MT output quality improvements given that their engine is changing dynamically?

  • Lilt measures every sentence and learns from every single sentence
  • They can tell from the improvements in "next word accuracy" and see how it evolved from an initial engine
  • GPT-3 next word prediction is not a threat to our approach as it is monolingual and requires a huge model that would make it impractical

How do we handle MT use cases beyond localization where potentially billions of words are being translated and a human-in-the-loop deployment is not practical?

  • Raw MT can be used to make much more content available to global customers even if the value of this is unknown. The enterprise can make more content available to identify what content has the greatest value in different markets
  • User-generated content (UGC) is hard to customize around and generic engines may work just as well
  • Communication scenarios are much more important to the business mission usually than UGC, even when an enterprise has very short shelf-life content and thus quality measurement happens in realtime e.g. chatbots, live support
  • Communication content strategies vary for inbound (UGC) and outbound (customer support). Outbound communications can involve human-in-the-loop (HITL) if the content turnaround times allow this or if the content needs it.
  • Realtime quality estimation capabilities become a much more critical enabling element for other use cases especially with social media and user forum communication where it is desirable to use raw MT.

What is the difference between linguistic steering and post-editing?

  • Focusing on high-density language patterns rather than the full corpus to ensure that critical word patterns are properly learned and handled. PEMT is a full corpus evaluation strategy
  • Unbabel develops dedicated test suites to ensure that the quality is acceptable for the linguistic material that really matters
  • Raw MT can create the danger of making a catastrophic mistake and experts need to find ways to identify and handle these kinds of errors
  • Lilt feels that they could handle the translation of a 25M word corpus through their interactive and dynamic evolution and predictions will get better and better. Lilt can push the boundaries of HITL to handle a very large corpus.
  • Text-generation technologies are very cool but they actually do NOT really understand the text
  • Catastrophic failure will be common and easy to achieve :-)
  • Large LM approaches can help with document-level context but they are mostly monolingual 
  • Also, these approaches are not as effective as dedicated NMT engines on handling the translation task
  • These new breakthroughs can be used to leverage other downstream NLP tasks
  • The primary issue is that they are mostly monolingual at the moment, but they could be valuable as they become more multilingual
  • COMET is built on top of some of these initiatives
  • Multilingual training experiments have shown that there can be improvements for low resource languages even if they are unrelated

Audience question: Given Cybernetics 2.0, how crucial is human feedback for NMT?

  • Human feedback will always be required 
  • Language is a living thing and will continue to evolve
  • Consider COVID 19, how many MT engines knew about this even 6 months ago?

Do you see continuously improving NMT displacing ever greater numbers of human translators?

  • Machines are already translating 1000X what humans are doing  so this is already true in terms of volume but humans are the primary communicators 
  • The roles of translators are evolving as technology improves but there are a finite number of human translators and HT will always be needed
  • "Human Translation" is not monolithic, and there are many kinds of translators, thus, we should be careful to not lump all translators as equivalent, as there is a wide range in terms of competence and expertise
  • MT cannot replace competent, subject matter expert translators who understand the communication intent, the semantic core and comprehend the communication impact of a translation
  • Competent HT will always be the final measure for quality 
  • As long as there are humans there will be a need for human translation.
This last question evoked some spirited social media response and I saw this comment being retweeted many times on Twitter.

The full session can be viewed below and I will add a link of the KUDO version which will also provide interpretation in Chinese, Italian, Russian, and Spanish when it becomes available.

No comments:

Post a Comment