Tuesday, May 30, 2023

Understanding Adaptive Machine Translation

Machine translation has been around for over 70 years and has made steady progress in tackling what many consider to be one of the most difficult challenges in computing and artificial intelligence. We have seen the approach to this challenge change and evolve, and MT has become much more widely used, especially since the advent of neural MT.

The deep learning neural net methods used in Neural MT have led to significant improvements in output quality, especially in terms of improved fluency, and have encouraged much wider use of machine translation in global business-driving applications.

While the momentum of Neural MT is well understood and recognized as a major advance in state-of-the-art (SOTA) machine translation, it is surprising that Adaptive MT has not had a greater impact.

This is especially true in the enterprise and professional translation market, where Adaptive MT can address specific and unique business needs much more effectively than alternatives. This paper explains why, even in the age of large language models, it remains a critical technology for global enterprises and professional users.

To better understand the value of Adaptive MT systems, it is useful to present a contrast to the typical generic static systems that most are familiar with.

The Typical Generic Static MT Engine

The MT engines that are generic and relatively static are intended for use by a large number of people. These engines do not alter how a phrase is translated until the next major update. Creating a generic baseline engine requires a considerable amount of effort in terms of cost, complexity, and data, which is why it is not done frequently. The diagram above displays the usual process involved in developing and producing a static engine.

A key characteristic of these static engines is that they do not evolve quickly because they require large new data sources to drive improvements, data that is not readily available, and thus generic static engines are typically updated no more than once a year.

On any given day, the major generic MT engine portals (Google, Microsoft, Baidu) allow hundreds of millions of people to translate material of interest. We have already reached the point where 99% or more of the translation done on the planet is done by computers, thanks to these generic engines.

In professional or business settings, the demands for using Machine Translation (MT) are quite particular. Generally, generic MT engines need to be tweaked and fine-tuned to cater to company- or project-specific language usage and terminology. This process of adjusting the MT engine to suit corporate requirements is known as customization or adaptation.

For example, if we consider the needs of IKEA, Pfizer, Airbnb, and Samsung, it is clear that they all have very different needs in terms of subject domain focus, style, and critical terminology and would be better served by enterprise-optimized MT than by a generic, one-size-fits-all MT solution.

Customization or adaptation of MT models to the correct terminology is necessary for successful outcomes with MT use in most enterprise or professional use settings.

The typical MT customization process using static engines is described below. The customization effort and process is a scaled-down version of the generic engine development process. Typically, it requires the collection and incorporation of enterprise translation memory relevant to the use case into the generic model via a scaled-down "training process."

This effort results in limited or coarse optimization if sufficient training data resources are available. The optimization is considered coarse because the training data available to perform the optimization is typically minuscule compared to the base data used in the generic engine. There is little value in training an engine with limited data as there would be no difference in performance from the generic baseline.

Thus, many attempts to use MT in professional settings face data scarcity problems. Limited data availability limits and reduces the potential impact of adaptation. To further complicate matters, it is usually necessary to build separate engines for each different use case, e.g., customer support, marketing, and legal would all be optimized separately.

Since many global enterprises have multiple product lines and businesses that cross multiple domains (TVs, semiconductors, PCs, home appliances) this will often result in a large number of MT engines needed to cover global business needs. As a result, it is often necessary to manage and maintain many MT engines. This management burden is often not understood at the outset when localization teams embark on their MT journey. This complexity also creates a lot of room for error and misalignment as data alignment can easily get out of sync over time.

Over time, many enterprise MT initiatives can be characterized by several problems that are common to users of these static MT systems. These problems are summarized below in order of frequency and importance:

  1. Ongoing scarcity of training data: Static models require a lot of data to drive improvements. There is little value in retraining a model until new or corrective data volumes reach critical levels.
  2. Tedious MTPE experience: Post-editors must repeatedly correct the same errors because these MT engines do not regularly improve, often leading to worker dissatisfaction.
  3. MT model management overhead and complexity: There are too many models to manage and maintain, which can lead to misalignment errors.
  4. Communication issues: Typically, between the MT development team and localization team members and translators, who have very different views of the overall process.
  5. Context insensitivity: Sentence- and document-level context is typically missing from these custom models.

From a technical standpoint, static MT systems often have a significant disparity between the encoding (training) and decoding (inference) stages of model deployment, resulting in a notable disconnect. Adaptive MT, on the other hand, aims to bridge the gap between the two phases of model creation (training) and deployment (inference), thus providing more effective support to expert users such as translators and linguists.

The Adaptive MT Experience

The static MT approach makes sense for large ad-supported portals where the majority (99%+) of the millions of users will use the MT systems without attempting modification or customization.

In contrast, the adaptive MT approach makes more sense for those enterprise and professional translators who almost always attempt to modify the behavior of the generic model to meet the specific and unique needs of a business use case.

ModernMT is an adaptive MT technology solution designed from the ground up to enable and encourage immediate and continuous adaptation to changing business needs. It is designed to support and enhance the professional translator's work process and increase translation leverage and productivity. This is the fundamental difference between an adaptive MT solution like ModernMT and static generic MT systems.

“Simplicity is the ultimate sophistication”

Leonardo da Vinci

While the ModernMT adaptive MT engine also has a basic generic engine underlying its capabilities, it is designed to work instantly with any available translation memory resources and to learn instantly from corrective linguistic feedback.

This is done without any user intervention or action to "train" the system. The user simply points to any available TM and it is used if it is relevant to the translation task at hand. Thus, while many struggle to use MT in an environment where use case requirements are constantly changing, this adaptive MT system uses memories, corrective feedback, and overall context gathered from both the memories and the overall document.

As the use of MT grows in the enterprise, the benefits of an adaptive MT infrastructure continue to accrue, as the management and maintenance of the many production MT systems require nothing more than the organization of TM assets and the provision of continuous corrective feedback to drive continuous improvements in system performance.

Thus, content creators and linguistically informed users can be the primary drivers of the ongoing system evolution. Because the underlying continuous improvement process is always active in the background, there is no need for any technology process management by these users. Translation issues that may arise in widespread use, can be quickly identified and corrected by linguists without the need for support from MT technology experts.

New use cases of large-scale deployments can be rapidly deployed by targeting human translation efforts on the most relevant and statistically present content. Adaptive MT technology allows for evolutionary approaches that ensure continuous improvement.

Independent market research points to some key factors that are often overlooked by those attempting to deploy MT in professional and enterprise environments. Surveys conducted by Common Sense Advisory and Nimdzi show that most LSPs/Enterprises struggle to deploy MT in production for three key reasons:

  1. Inability to produce MT output at the required quality levels. Most often due to a lack of training data needed for meaningful improvement.
  2. Inability to properly estimate the effort and cost of deploying MT in production.
  3. The ever-changing needs and requirements of different projects with static MT that cannot adapt easily to new requirements create a mismatch of skills, data, and competencies.

Given these difficulties, it is worth considering the key requirements for a production-ready MT system. Why do so many still fail with MT?

One reason for failure is that many LSPs and localization managers have used automated metrics to select the "best" MT system for their production needs without having any understanding of how MT engines improve and evolve. Automated MT quality metrics such as BLEU, Edit Distance, hLepor, and COMET are used to select the "best" MT systems for production work.

The scores can be helpful for MT system developers in enhancing and refining their systems. However, globalization managers who rely solely on this method to choose the "best" system may miss some noticeable limitations in selecting the most suitable MT system.

Ideally, the "best" MT system would be determined by a team of competent translators who would run directly relevant content through the MT system after establishing a structured and repeatable evaluation process. This is slow, expensive, and difficult, even if only a small sample of 250 sentences is evaluated.

Thus, automated measurements (metrics) that attempt to score translation adequacy, fluency, precision, and recall must often be used. They attempt to do what is best done by competent bilingual humans. These scoring methodologies are always an approximation of what a competent human assessment would determine, and can often be incorrect or misleading, especially with static Test Sets.

This approach of ranking different MT systems by scores based on opaque and possibly irrelevant reference test sets has several problems. These problems include:

  • These scores do not represent production performance.
  • These scores are typically obtained on static MT systems and do not capture a system's ability to improve.
  • The results are an OLD snapshot of a constantly changing scene. If you change the angle or focus, the results would change.
  • Small differences in scores are often meaningless, and most users would be hard-pressed to explain what these small numerical differences might mean.
  • The score is an approximate measure of system performance at a historical point in time and is generally not a reliable predictor of future performance.
  • These scores are unable to capture the dynamic evolution typical of an adaptive MT system.
  • Generic, static systems often score higher on these rankings initially but this does not reflect that they are much more difficult to tune and adapt to unique, company-specific requirements.

When choosing MT systems for production use, relying solely on score-based rankings can lead to suboptimal or even incorrect choices. This approach is often used because NMT system performance is difficult to understand and can be shrouded in mystery. However, automated metrics are not always reliable and should not be the only factor considered in making purchase decisions. Using scores to justify choices may be a "lazy buyer" strategy that fails to fully account for the complexity involved in selecting the best MT system for a given purpose.

But the failure of so many LSPs with MT technology suggests that this approach may not be the best way forward to achieve production-ready and production-grade MT technology. So what criteria are more relevant in the context of identifying production-grade MT technology? The following criteria are much more likely to lead to technology choices that make long-term sense. For example:

  • The speed with which an MT system can be tuned and adapted to unique corporate content. Systems that require complex training efforts by technology specialists will slow the globalization team’s responsiveness.
  • The ease with which the system can be adapted to unique corporate needs The need to have expensive consulting resources or dedicated MT technology staff on hand and ready to go greatly reduces the agility and responsiveness of the globalization team.
  • An automated and robust MT model improvement process as corrective feedback and improved data resources are brought to bear.
  • The complexity of MT system management increases exponentially when multiple vendors are used as they may have different maintenance and optimization procedures. This suggests that it is better to focus on one or two partners and build expertise through deep engagement.
  • The ability of a system to enable startup work even if little or no data is available.
  • straightforward process to correct any problematic or egregious translation errors. Many large static systems need large volumes of correction data to override such errors.
  • The availability of expert resources to manage specialized enterprise use cases and trained human resources (linguists) to help prime and prepare MT systems for large-scale deployment.

It is now common knowledge that machine learning-based AI systems are only as good as the data they use. One of the keys to long-term success with MT is to build a virtuous data collection system that refines MT performance and ensures continuous improvement. The existence of such a system would encourage more widespread adoption and enable the enterprise to become multilingual at scale. This would allow the enterprise to break down the barrier of language as a barrier to global business success.

One should not assume that all adaptive MT systems follow the same technological approach. In fact, real-time, in-context adaptation can be achieved through various architectural methods. Upon closer examination of other adaptive MT solutions, it is apparent that dynamic adaptation can be accomplished using different technological strategies. However, as more buyers come to realize that the responsiveness of the MT system holds greater importance than a static COMET score on a random test set, the evaluation strategies will evolve. It will be more beneficial to assess which systems can adapt most effortlessly with minimal effort.

The ModernMT approach to adaptation is to bring the encoding and decoding phases of model deployment much closer together, allowing dynamic and active human-in-the-loop corrective feedback, that is not so different from the in-context corrections and prompt modifications we are seeing with large language models.

It is possible that in the future, as Large Language Models (LLMs) become more cost-effective, scalable, secure, and controllable, they could be utilized to enhance SOTA adaptive MT models. This could improve both core translation quality and output fluency, either as stand-alone solutions or more likely as hybrid models that work with MT purpose-focused models that are yet to come.

Although LLMs have demonstrated their effectiveness in certain high-resource languages, their performance in lower-resource languages is notably poor according to initial evaluations. This outcome is not surprising, given that LLMs are not specifically tailored for translation tasks. The challenge lies in the fact that LLMs rely on extensive data caches for each language, and the significant data volumes required for improvement are often difficult to locate. Therefore, resolving this issue will not be a quick or simple process.

In contrast, ModernMT just announced support for 200 languages that can all immediately benefit from the continuous improvement infrastructure that underlies the technology, and begin the steady quality improvement process that is described in this article.

It has become evident that real-time systems that can enhance performance and swiftly respond to informed feedback from experts are highly favored to tackle the task of large-scale automated language translation.

Once customers have experienced the advantages of dynamic adaptive systems, they are unlikely to revert to the complexities, inconveniences, and slowly improving output quality of batch-trained static MT systems.