Pages

Wednesday, December 18, 2024

The Evolving LLM Era and its Potential Impact

With the advent of Large Language Models (LLMs), there are exciting new possibilities available. However, we also see a large volume of mostly vague and poorly defined claims of "using Al" by practitioners with little or no experience with machine learning technology and algorithms. 

The signal-to-noise (hype-to-reality) ratio has never been higher, and much of the hype fails to meet real business production use case requirements. Aside from the data privacy issues, copyright problems, and potential misuse of LLMs by bad actors, hallucinations and reliability issues also continue to plague LLMs.


Enterprise users expect production IT infrastructure output to be reliable, consistent, and predictable on an ongoing basis, but there are very few use cases where this is currently possible with LLM output. The situation is evolving, and many expect that the expert use of LLMs could have a dramatic and favorable impact on current translation production processes.


There are several areas in and around the machine translation task where LLMs can add considerable value to the overall language translation process. These include the following:

  • LLM translations tend to be more fluent and acquire more contextual information, albeit in a smaller set of languages
  • Source text can be improved and enhanced before translation to produce better-quality translations
  • LLMs can carry out quality assessments on translated output and identify different types of errors
  • LLMs can be trained to take corrective actions on translated output to raise overall quality
  • LLM MT is easier to adapt dynamically and can avoid the large re-training that typical static NMT models require



At Translated, we have been carrying out extensive research and development over the past 18 months into these very areas, and the initial results are extremely promising, as outlined in our recent whitepaper.

The chart below shows some evidence of our progress with LLM MT. It compares Google (static), DeepL (static), Lara RAG-tuned LLM MT, GPT-4o (5-shot), and ModernMT (TM access) for nine high-resource languages. These results for Lara are expected to improve further. 

At Translated, we have been carrying out extensive research and development over the past 12 months into these very areas, and the initial results are extremely promising, as outlined in our recent whitepaper.




One approach involves using independent LLM modules to handle each category separately. The other approach is to integrate these modules into a unified workflow, allowing users to simply submit their content and receive the best possible translation. This integrated process includes MTQE as well as automated review and post-editing.

While managing these tasks separately can offer more control, most users prefer a streamlined workflow that focuses on delivering optimal results with minimal effort, with the different technology components working efficiently behind the scenes.

LLM-based machine translation will need to be secure, reliable, consistent, predictable, and efficient for it to be a serious contender to replace state-of-the-art (SOTA) NMT models.

This transition is underway but will need more time to evolve and mature.

Thus, SOTA Neural MT models may continue to dominate MT use in any enterprise production scenarios for the next 12-15 months, except where the highest quality automated translation is required. 

Currently, LLM MT makes the most sense in settings where high throughput, high volume, and a high degree of automation are not a requirement and where high quality can be achieved with reduced human review costs enabled by language AI.

Translators are already using LLMs for high-resource languages for all the translation-related tasks previously outlined. It is the author’s opinion that there is a transition period where it is quite plausible that both NMT and LLM MT might be used together or separately for different tasks in new LLM-enriched workflows. NMT will likely perform high-volume, time-critical production work as shown in the chart below.



In the scenario shown above, information triage is at work. High-volume content is initially processed by an adaptive NMT model, followed by an efficient MTQE process that sends a smaller subset to an LLM for cleanup and refinement. These corrections can be sent back to improve the MT model and increase the quality of the MTQE (not shown in the diagram above).

However, as LLMs get faster and it is easier to automate sequences of tasks, it may be possible to embed both an initial quality assessment and an automated post-editing step together for an LLM-based process to manage.


An emerging trend among LLM experts is the use of agents. Agentic AI and the use of agents in large language models (LLMs) represent a significant evolution in artificial intelligence, moving beyond simple text generation to create autonomous, goal-driven systems capable of complex reasoning and task execution. 

AI agents are systems that use LLMs as their core controller to autonomously pursue complex goals and workflows with minimal human supervision. 

They potentially combine several key components:

  • An LLM core for language understanding and generation
  • Memory modules for short-term and long-term information retention
  • Planning capabilities for breaking down tasks and setting goals
  • Some ability to iterate to a goal
  • Tools for accessing external information and executing actions
  • Interfaces for interacting with users or other systems

One approach involves using independent LLM agents to address each of the categories below as separate and discrete steps.

The other approach is to integrate these steps into a unified and robust workflow, allowing users to simply submit content and receive the best possible output through an AI-managed process. This integrated workflow would include source cleanup, MTQE, and automated post-editing. Translated is currently evaluating both approaches to identify the best path forward in different production scenarios.



Agentic AI systems are capable of several advanced capabilities that include:

  • Autonomy: Ability to take goal-directed actions with minimal oversight
  • Reasoning: Contextual decision-making and weighing tradeoffs
  • Adaptive planning: Dynamically adjusting goals and plans as conditions change
  • Natural language understanding: Comprehending and following complex instructions
  • Workflow optimization: Efficiently moving between subtasks to complete processes

A thriving and vibrant open-source community will be a key requirement for ongoing progress. The open-source community has been continually improving the capabilities of smaller models and challenging the notion that scale is all you need. We see an increase in recent models that are smaller and more efficient but still capable and are thus often preferred for deployment.

All signs point to an exciting future where the capabilities of technology to enhance and improve human communication and understanding get better, and we are likely to see major advances in bringing an increasing portion of humanity into the digital sphere for productive, positive engagement and interaction.

Tuesday, December 17, 2024

The Evolution of AI Translation Technology

 Translated Srl is a pioneer in using MT in professional translation settings at a production scale. The company has a long history of innovation in the effective use of MT technology (an early form of AI) in production settings. It has deployed MT extensively across much of its professional translation workload for over 15 years and has acquired considerable expertise in doing this efficiently and reliably.

Machine Translation
IS
Artificial Intelligence

One of the main drivers behind language AI has been the ever-increasing content volumes needed in global enterprise settings to deliver exceptional global customer experience. The rationale behind the use of language AI in the translation context has always been to amplify the ability of stakeholders to produce higher volumes of multilingual content more efficiently and at increasingly higher quality levels. 

Consequently, we are witnessing a progressive human-machine partnership where an increasing portion of the production workload is being transferred to machines as technology advances.

Research analysts have pointed out that even as recently as 2022-23 LSPs and localization departments have struggled with using generic (static) MT systems in enterprises for the following reasons:

  1. Inability to produce MT output at the required quality levels. Most often due to a lack of training data needed to see meaningful improvement.
  2. Inability to properly estimate the effort and cost of deploying MT in production. 
  3. The ever-changing needs and requirements of different projects with static MT that cannot adapt easily to new requirements create a mismatch of skills, data, and competencies.

The Adaptive MT Innovation

In contrast to much of the industry, Translated was the first mover in the production use of adaptive MT since the Statistical MT era. The adaptive MT approach is an agile and highly responsive way to deploy MT in enterprise settings as it is particularly well-suited to rapidly changing enterprise use case scenarios.

From the earliest days, ModernMT was designed to be a useful assistant to professional translators to reduce the tedium of the typical post-editing (MTPE) work process. This focus on building a productive and symbiotic human-machine relationship has resulted in a long-term trend of continued improvement and efficiency.


ModernMT is an adaptive MT technology solution designed from the ground up to enable and encourage immediate and continuous adaptation to changing business needs. It is designed to support and enhance the professional translator's work process and increase translation leverage and productivity beyond what translation memory alone can. It is a continuous learning system that improves with ongoing corrective feedback. This is the fundamental difference between an adaptive MT solution like ModernMT and static generic MT systems.

The ModernMT approach to MT model adaptation is to bring the encoding and decoding phases of model deployment much closer together, allowing dynamic and active human-in-the-loop corrective feedback, which is not so different from the in-context corrections and prompt modifications we are seeing being used with large language models today.

It is now common knowledge that machine learning-based AI systems are only as good as the data they use. One of the keys to long-term success with MT is to build a virtuous data collection system that refines MT performance and ensures continuous improvement. This high-value data collection effort has been underway at Translated for over 15 years and is a primary reason why ModernMT outperforms competitive alternatives.

This is also a reason why it makes sense to channel translation-related work through a single vendor so that an end-to-end monitoring system can be built and enhanced over time. This is much more challenging to implement and deploy in multi-vendor scenarios. 


The existence of such a system encourages more widespread adoption of automated translation and enables the enterprise to become efficiently multilingual at scale. The use of such a technological foundation allows the enterprise to break down the language as a barrier to global business success.


The MT Quality Estimation & Integrated Human-In-The-Loop Innovation

As MT content volumes rapidly increase in the enterprise, it becomes more important to make the quality management process more efficient, as human review methods do not scale easily. It is useful for any multilingual-at-scale initiative to rapidly identify the MT output that most need correction and focus critical corrective feedback primarily on these lower-quality outputs to enable the MT system to continually improve and ensure overall improved quality on a large content volume.

The basic idea is to enable the improvement process to be more efficient by immediately focusing 80% of the human corrective effort on the 20% lowest-scoring segments. Essentially, the 80:20 rule is a principle that helps individuals and companies prioritize their efforts to achieve maximum impact with the least amount of work. This leveraged approach allows overall MT quality, especially in very large-scale or real-time deployments, to improve rapidly.

Human review at a global content scale is unthinkable, costly, and probably a physical impossibility because of the ever-increasing volumes. As the use of MT expands across the enterprise to drive international business momentum and as more automated language technology is used, MTQE technology offers enterprises a way to identify and focus on the content that needs the least, and the most human review and attention, before it is released into the wild.


When a million sentences of customer-relevant content need to be published using MT, MTQE is a means to identify the ~10,000 sentences that most need human corrective attention to ensure that global customers receive acceptable quality across the board.

This informed identification of problems that need to be submitted for human attention is essential to allow for a more efficient allocation of resources and improved productivity. This process enables much more content to be published without risking brand reputation and ensuring that desired quality levels are achieved. In summary, MTQE is a useful risk management strategy as volumes climb.

Pairing content with lower MTQE scores into a workflow that connects a responsive, continuously learning adaptive MT system like ModernMT with expert human editors creates a powerful translation engine. This combination allows for handling large volumes of content while maintaining high translation quality.

When a responsive adaptive MT system is integrated with a robust MTQE system and a tightly connected human feedback loop, enterprises can significantly increase the volume of published multilingual content.

The conventional method, involving various vendors with different and distinct processes, is typically slow and prone to errors. However, this sluggish and inefficient method is frequently employed to enhance the quality of MT output, as shown below.


MTQE technology aims to pinpoint errors quickly and concentrate on minimizing the size of the data set requiring corrective feedback. The business goal centers on swiftly identifying and rectifying the most problematic segments.

Speed and guaranteed quality at scale are highly valued deliverables. Innovations that decrease the volume of data requiring review and reduce the risk of translation errors are crucial to the business mission.


The additional benefit of an adaptive rather than a generic MTQE process further extends the benefit of this technology by reducing the amount of content that needs careful review.

The traditional model of post-editing everything is now outdated.

The new approach entails translating everything and then only revising the worst and most erroneous parts to ensure an acceptable level of quality.

For example, if an initial review of 40% of the sentences with the lowest MTQE score using a generic MTQE model identifies 60% of the major problems in a corpus, using the adaptive QE model informed by customer data can result in the identification of 90% of the "major" translation problems in a corpus by focusing only on the 20% lowest scoring MTQE scores using the adaptive MTQE model. 

This innovation greatly enhances the overall efficiency. The chart below shows how a process that integrates adaptive MT, MTQE, and focused human-in-the-loop (HITL) work together to build a continuously improving translation production platform.


The capability to enhance the overall quality of translation in a large, published corpus by analyzing less data significantly boosts the efficiency and utility of automated translation. An improvement process based on Machine Translation Quality Estimation (MTQE) is a form of technological leverage that advantages extensive translation production.


The Evolving LLM Era and Potential Impact 

The emergence of Large Language Models (LLMs) has opened up thrilling new opportunities. However, there is also a significant number of vague and ill-defined claims of "using AI" by individuals with minimal experience in machine learning technologies and algorithms. The disparity between hype and reality is at an all-time high, with much of the excitement not living up to the practical requirements of real business use cases. Beyond concerns of data privacy, copyright, and the potential for misuse by malicious actors, issues of hallucinations and reliability persistently challenge the deployment of LLMs in production environments.

Enterprise users expect their IT infrastructure to consistently deliver reliable and predictable outcomes. However, this level of consistency is not currently easily achievable with LLM output. As the technology evolves, many believe that expert use of LLMs could significantly and positively impact current translation production processes.




Comparing MT System Performance

 The advantages of a dynamic adaptive MT system are clarified in this post. Most static MT systems need significant upfront investment to enable adaptation. Adaptive systems like ModernMT have a natural advantage since the system is so easily adapted to customer domain and data.


Machine Translation (MT) system evaluation is necessary for enterprises considering increasing the use of automated translation to meet the increasing information and communication needs to engage the global customer. Managers need to understand which MT system is best for their specific use case and language combination, and which MT system will improve the fastest with their data and with the least effort to perform best for the intended use case.

What is the best MT system for my specific use case, and this language combination?

The comparative evaluation of the quality performance of MT systems has been problematic and often misleading because the typical research approach has been to assume that all MT systems work in the same way.

Thus, comparisons by “independent” third parties are generally made at the lowest common denominator level i.e. the static or baseline version of the system. Focusing on the static baseline makes it easier for a researcher to line up and rank different systems but penalizes highly responsive MT systems that are designed and able to immediately respond to the user's focus and requirements, and perform system optimization around user content.

Which MT system is going to improve the fastest with my unique data and require the least amount of effort to get the best performance for my intended use case?

Ideally, a meaningful evaluation would test a model on its potential capabilities with new and unseen data as it is expected that a model should do well on data it has been trained on and knows.

However, many third-party evaluations use generic test data that is scoured from the web and slightly modified. Thus, data leakage is always possible as shown in the center diagram below.

Issues like data leakage and sampling bias can cause AI to give faulty predictions or produce misleading rankings. Since there is no reliable way to exclude test data contained in the training data this problem is not easily solved. Data leakage will cause overly optimistic results (high scores) that will not be validated or seen in product use. 


This issue is also a challenge when comparing LLM models especially since much of what LLMs are tested on is data that these systems have already seen and trained on. Some key examples of the problems that data leakage causes in machine translation evaluations include:
  1. Overly optimistic performance estimates: because the model has already seen some of the test data during training. This gives a false impression of how well the model will perform on real, unseen data.
  2. Poor real-world performance: Models that suffer from data leakage often fail to achieve anywhere near the same level of performance when deployed on real-world data. The high scores do not translate to the real world.
  3. Misleading comparisons between models: If some models evaluated on a dataset have data leakage while others do not, it prevents fair comparisons and identifying the best approaches. The leaky models will seem superior but not legitimately so.

In addition, the evaluation and ranking of MT systems done by third parties is typically done using an undisclosed and confidential "test data" set that attempts to cover a broad range of generic subject matter. This approach may be useful for users who intend to use the MT system as a generic, one-size-fits-all tool but is less useful for enterprise users who want to understand how different MT systems might perform on their subject domain and content in different use cases.

Rankings on generic test data are often not likely to be useful for predicting actual performance in the enterprise domain. If the test data is not transparent how can an enterprise buyer be confident that the rankings are valid for their use cases? These often irrelevant scores are used to select an MT system for production work and thus are often sub-optimal.

Unfortunately, enterprises looking for the ideal MT solution have been limited to third-party rankings that focus primarily on comparing generic (static) versions of public MT systems, using undisclosed, confidential test data sets that are irrelevant or unrelated to enterprise subject matter.

With the proliferation of MT systems in the market, translation buyers are often bewildered by the range of MT system options and thus resort to using these rankings to make MT system selections without understanding the limitations of the evaluation and ranking process.

What is the value of scores that provide no insight or detail on what the scores and rankings are based on? 

Best practices suggest that users have visibility on what data is used to calculate the score for it to be meaningful or relevant.

Thus, Translated recently undertook some MT comparison research to answer the following questions:

  1. What is the quality performance of an easily tuned and agile adaptive MT system compared to generic MT systems that require special adaptation efforts to accommodate and tune to typical enterprise content?
  2. Can a comparative analysis be done using public-domain enterprise data so that a realistic enterprise case can be evaluated, and so that others can replicate, reproduce, and verify the results?
  3.  Can this evaluation be done transparently, by making test scripts publicly available so other interested parties can replicate and reproduce the results?
  4. Additionally, can the evaluation process be easily modified so that comparative performance on other data sets can also be tested?
  5. Can we provide a better, more accurate comparison of ModernMT's out-of-the-box capabilities against the major MT alternatives available in the market?

This evaluation further validates and reinforces what Gartner, IDC, and Common Sense Advisory have already said about ModernMT being a leader in enterprise MT. 

The evaluation described in this post provides a deeper technical foundation to illustrate ModernMT's responsiveness and ability to quickly adapt to enterprise subject matter and content.


Evaluation Methodology Overview

Translated SRL commissioned Achim Ruopp of Polyglot Technology LLC and asked him to find viable evaluation data and establish an easily reproducible process that could be used to periodically update the evaluation and/or enable others to replicate, reproduce, or otherwise modify the evaluation. He chose the data and developed the procedural outline for the evaluation. This is a typical enterprise use case where MT performance on specialized corporate domain material needs to be understood before deployment in a production setting. It is understood that some of the systems can potentially be further customized with specialized training efforts but this analysis provides a perspective when no effort is made on any of the systems under review.

The process followed by Achim Ruopp in his analysis is shown below:

  • Identify evaluation data and extract the available data for the languages that were of primary interest and that had approximately the same volume of data. The 3D Design, Engineering, and Construction software company Autodesk provides high-quality software UI and documentation translations created via post-editing machine translations.
    • US English → German, 
    • US English → Italian, 
    • US English → Spanish, 
    • US English → Brazilian Portuguese, and 
    • US English → Simplified Chinese 
  • Clean and prepare data into two data sets:
    • 1) ~10,000 segments of TM data for each language pair and,
    • 2) a Test Set with 1,000 segments that had no overlap with the TM data
  • The evaluation aimed to measure the accuracy and speed of the out-of-the-box adaptation of ModernMT to the IT domain and contrast this with generic translations from four major online MT services (Amazon Translate, DeepL, Google Translate, and Microsoft Translator). This is representative of many translation projects in enterprise settings. A zero-shot output score for GPT-4 was also added to show how the leading LLM scores against leading NMT solutions. Thus the “Test Set” was processed and run through all these systems and three versions of ModernMT (Static baseline, Adaptive, and Adaptive with dynamic access to reference TM.) Please note that many “independent evaluations” that compare multiple MT systems focus ONLY on the static version of ModernMT which in reality would rarely happen.
  • The MT output was scored using three widely used MT output quality indicators that are based on a reference Test Set. These include:
    • COMET – A measure of semantic similarity that achieves state-of-the-art levels of correlation with human judgment and is the most commonly used metric in current expert evaluations.
    • SacreBLEU – A measure of syntactic similarity that is possibly the most popular metric used in MT evaluation, despite many shortcomings, that compares the token-based similarity of the MT output with the reference segment and averages it over the whole corpus.
    • TER – A measure of syntactic similarity that measures the number of edits (insertions, deletions, shifts, and substitutions) required to transform a machine translation into a reference translation. This is a measurement that is popular in the localization industry.
  • The results and scores produced are presented in detail in this report in a series of charts with some limited commentary. The summary is shown below. The objective was to understand how ModernMT performs relative to the other alternatives and provide a more accurate out-of-the-box picture, thus the focus of this evaluation remains on how systems perform without any training or customization effort. It is representative of the results if the user were to make virtually no effort beyond pointing to a translation memory.

Summary Results


  • This is the first proper evaluation and comparison of ModernMT's out-of-the-box adaptive MT model (with access to a small translation memory, but not trained) against leading generic (or static) public MT systems.
  • The comparison shows that ModernMT outperforms generic public MT systems using data from an Autodesk public dataset, where translation performance was measured for translation from US English to German, Italian, Spanish, Brazilian Portuguese, and Simplified Chinese using COMET, SacreBLEU, and TER scoring.
  • ModernMT achieves these results without any overt training effort, simply by dynamically using and referencing relevant translation memory (TM) when available.
  • A state-of-the-art LLM (GPT-4) failed to outperform the production NMT systems in most of the tests in this evaluation.
  • The evaluation and comparison tools and research data are in the public domain. Interested observers can replicate the research with their own data.

The effortless improvements in ModernMT show why comparisons to the static version of the system are meaningless

Why is MT evaluation so difficult?

Language is one of the most nuanced, elaborate, and sophisticated mediums used by humans to communicate, share, and gather knowledge. It is filled with unwritten and unspoken context, emotion, and intention that is not easily contained in the data used to train machines on how to understand and translate human language. Thus, machines can only approach language at a literal textual string level and will likely always struggle with finesse, insinuation, and contextual subtleties that require world knowledge and common sense. Machines have neither.

Thus, while it is difficult to do this kind of evaluation with absolute certainty, it is still useful to get a general idea. MT systems will tend to do well on material that is exactly like the material they train on and function almost like translation memory in this case. Both MT system developers and enterprise users need to have some sense of what system might perform best for their purposes.

It is common practice to test MT system performance on material it has not already memorized to get a sense of what system performance will be in real-life situations. Thus quick and dirty quality evaluations provided by BLEU, COMET, and TER can be useful even though they are never as good as expert, objective human assessments. These metrics are used because human assessment is expensive and slow and also difficult to do consistently and objectively over time.


To get an accurate sense of how an MT system might perform on new and unseen data it is worth considering how these factors could undermine any absolute indication of any one system being “better” or “worse” than any other.

  • Language translation for any single sentence does not have a single correct answer. Many different translations could be useful and adequate and correct  for the purpose at hand.
  • It is usually recommended that a varied but representative set of 1,000 to 2,000 segments/sentences be used in an evaluation. Since MT systems will be compared and scored against this “gold standard” the Test Set should be professionally done. This can cost $1,500 to $2,500 per language. So, 20 languages can cost $50,000 just to create the Test Set. This cost often results in MT use to reduce costs which builds in a bias for the MT system (typically Google) used to produce this data.
  • There is no definitive way to ensure that there is no overlap between the training data and the test data so data leakage can often undermine the accuracy of the results.
  • It is easier to use generic tests but the most useful performance indicators in production settings will always be with carefully constructed test sentences of actual enterprise content (that are not contained in the training set).

Automated quality evaluation metrics like COMET are indeed useful but the experts in the community now realize that these scores have to be used together with competent human assessments to get an accurate picture of the relative quality of different MT systems. Using automated scores alone is not advised.


What matters most?

This post explores some broader business issues that should also be considered when considering MT quality.

While much attention is given to comparative rankings of different MT systems, one should ask how useful this is in understanding how any particular MT system will perform on any enterprise-specific use case. Scores on generic test sets do not accurately predict how a system will perform on enterprise content in a highly automated production setting.

The rate at which an MT system improves for specific enterprise content with least effort possible is possibly the most important criterion for MT system selection.

Ideally, improvement should be seen on a daily or at least weekly basis.

So instead of asking what COMET score System A has on its EN > FR system? It is important to ask other questions that are more likely to ensure successful outcomes. The answers to the following questions will likely lead to much better MT system selections.

  • How quickly will this system adapt to my unique customer content?
  • How much data will I need to provide to see it perform better on my content and use case?
  • How easy is it to integrate the system with my production environment?
  • How easy or difficult is it to set up a continuously improving system that continues to improve and learn from ongoing corrective feedback?
  • How easy or difficult is it to manage and maintain my optimized systems on an ongoing basis?
  • Can I automate the ongoing MT model improvement process?
  • Ongoing improvements are driven both by technology enhancements and by expert human feedback, are both these available from this vendor?

Please follow this link for a detailed report on this evaluation and more detailed analysis and commentary on understanding MT evaluation from a more practical and business-success-focused perspective.

Monday, December 16, 2024

ModernMT Introduces Adaptive Quality Estimation (MTQE)

 As MT quality improves, MT use expands to publishing millions of words monthly to improve global customer experience. MTQE can quickly identify potential problems to focus MTPE only on the most problematic sections and quickly publish large volumes of global CX-enhancing content safely.


Historically, the path to achieving quality in professional language translation work is to involve multiple humans in the creation and validation of every translated segment. This multi-human translation production process is known as TEP or Translate > Edit > Proof. The way to guarantee the best translation quality will be produced has always been to provide a quality review by a second and sometimes a third person. When this process works well it produces “good quality” translation, but this approach also has serious limitations:

1) it is an ad-hoc process with constantly changing humans that can result in the same mistakes happening again, and,

2) it is a time-consuming, miscommunication-prone, and costly process that is difficult to scale as volumes increase.

The TEP model has been the foundation for much of the professional translation work done over the last 20 years and is still the production model used for much of the translation work managed by localization groups. While this is a historical fact, the landscape for professional business translation has been changing in two primary ways:

1)  The volumes of content that need to be translated to be successful in international business settings are continually increasing,

2)  An increasing need and use of machine translation and more automation to cope with the ever-increasing demand, and the need for much faster turnaround on translation projects.

One solution to this problem is to increase the use of machine translation and post-edit the output (MTPE or PEMT). This is an attempt to reproduce part of the entirely human TEP process described above with a machine starting the process. This approach has met with limited success, and many LSPs and localization managers struggle to find an optimal MT process due to the following issues:

Uneven or poor machine translation quality: The automation can only be successful when the MT provides a useful and preferably continuously improving first draft submitted for human approval or refinement. MT quality varies by language and few LSPs and localization managers know how to engineer and optimize MT systems to perform optimally for their specific needs. Recent surveys by researchers show that LSPs (and localization managers) still struggle to meet quality expectations and estimate cost and efforts when using MT.

Translator resistance: As MTPE is a machine output-driven process, and typically paid at lower unit rates, many translators are loathe to do this kind of work without assurances that the MT will be of adequate quality to assure fair overall compensation. Low quality MT is much more demanding to correct and thus translators find that their compensation is negatively impacted when they work with low-quality MT. The converse is also true, many translators have found that high-quality adaptive MT work results in higher-than-expected compensation due to the continuous improvement in the MT output and overall system responsiveness.

Lack of standardization: there is currently no standardization in the post-editing process, which can lead to inconsistencies in the quality of the final translation.

Training and experience: Post-editing MT requires a different skill set than traditional translation, and post-editors need to be trained accordingly. The translator versus post-editing task remains a source of friction in an industry that depends heavily on skillful human input, largely due to improper work specification, and compensation-related concerns.

Cost: Post-editing can be expensive, especially for large volumes of text. This can be a significant obstacle for companies that need to translate large amounts of content since it is often assumed that all the MT output must be reviewed and edited.


MT Quality Evaluation vs MT Quality Estimation

But as we move forward and expand the use of machine translation to make ever-increasing volumes of content multilingual, we see the need for two kinds of quality assessment tools that can be useful to any enterprise that seeks to be multilingual at scale.

1) Quality Evaluation estimates provide a quality assessment of multiple versions of an MT system that may be used by the MT system developers to better understand the impact of changing development strategies. Commonly used evaluation metrics include BLEU, COMET, TER, and ChrF which all use a human reference test set (the gold standard) to calculate a quality score of each MT system’s performance and is well understood by the developer.  

 These scores are useful to developers to find optimal strategies in the system development process but unfortunately, these scores are also used by “independent” researchers who seek to sell aggregation software to less informed buyers and localization managers who usually have limited understanding of the scores, the test sets, and the opaque process used to generate the scores. Thus, buyers will often make sub-optimal and naïve choices in MT system selection.

2) Quality Estimation scores, on the other hand, are quality assessments made by the machine without using reference translations or actively requiring humans in the loop. It is an assessment of quality made by a machine itself on how good or bad a machine-translated output segment is.  MTQE can serve as a valuable tool for risk management in high-volume translation scenarios where human intervention is limited or impractical due to the volume of translations or speed of delivery. MTQE enables efficiency and minimizes potential risks associated with using raw MT because it directs attention to the most likely problematic translations, and reduces the need to look at all the automated translations.

Interest in MTQE has gained momentum as the use of MT has increased, as it allows rapid error detection in large volumes of MT output, thus enabling rapid and focused error correction strategies to be implemented.

Another way to understand MTQE is to more closely examine the difference in training data used in developing an MT engine versus the data used in building a QE model. An MT system is trained on large volumes of source and target sentence pairs or segments or what is generally called translation memory.

An MTQE system is trained on the original MT output and corrected sentence pairs which are also compared to the original source (ground truth) to identify error patterns.  The MTQE validation process seeks to confirm that there is a high level of agreement between a machine's quality prediction of machine output and human quality assessment of that same output

Quality estimation is a method for predicting the quality without having to compare it to a human reference set. Quality estimation uses machine learning methods to assign quality scores to machine-translated segments and since it works through machine learning it can be used in dynamic, live situations.  Quality estimation can predict quality at various levels of text, including at the level of the word, phrase, sentence, or even document but is used most commonly at a segment level.


What is T-QE?

The current or traditional process used to improve adaptive machine translation quality uses one of two methods:

1)      random segments are selected and reviewed by professional translators or,

2)      every segment has to be reviewed by a translator to ensure the required quality.

However, as MT content volumes rapidly increase in the enterprise, it becomes more important to make this process more efficient, as these human review methods do not scale easily. It is useful to the production process to rapidly identify those segments that most need human attention, and focus critical corrective feedback primarily on these problem segments to enable the MT system to continually improve and ensure overall improved quality on a large content volume.

The MT Quality Estimator (T-QE) streamlines the system improvement process by providing a quality score for each segment, thus identifying those segments that most need human review, rather than depending only on random segment selection, or requiring that each segment be reviewed.

The basic idea is to enable the improvement process to be more efficient by immediately focusing 80% of the human corrective effort on the 20% lowest-scoring segments. Essentially, the 80:20 rule is a principle that helps individuals and companies prioritize their efforts to achieve maximum impact with the least amount of work. This approach allows overall MT quality, especially in very large-scale or real-time deployments, to improve rapidly.

The MT Quality Estimator assists in solving this challenge by providing an MT quality score for each translated segment, directly within Matecat or via an API.

The MT Quality Estimator at Translated was validated by taking many samples (billions of segments) of different types of content of varying source quality and comparing the correlation between the T-QE scores and human quality assessments.

The initial tests conducted by the ModernMT team suggest that the T-QE scores are more accurate predictors on high-quality segments but it was noted that lower-quality segments contained more UGC, had longer sentences, and were in general noisier.



The Key Benefits of MT Quality Estimation

Human review at a global content scale is unthinkable, costly, and probably a physical impossibility because of the ever-increasing volumes. As the use of MT expands across the enterprise to drive international business momentum and as more automated technology is used, MTQE offers enterprises a way to identify and focus on the content that needs the least, and the most attention, before it is released into the wild.

MTQE is an effective means to manage risk when an enterprise wishes to go multilingual at scale. Quality estimation can predict the quality of a given machine translation, allowing for corrections to be made before the final translation is published. MTQE identifies high-quality MT output that does not require human post-editing and thus makes it easier to focus attention on the lower-quality content, allowing for faster turnaround times and increased efficiency.

When a million sentences of customer-relevant content need to be published using MT, MTQE is a means to identify the ~10,000 sentences that most need human corrective attention to ensure that global customers receive acceptable quality across the board.

This informed identification of problems that need to be submitted for human attention is essential to allow for a more efficient allocation of resources and improved productivity. This process enables much more content to be released to global customers without risking brand reputation, and ensuring that desired quality levels are achieved.

When MTQE is paired and combined with a highly responsive MT system, like ModernMT, it can accelerate the rate at which large volumes of customer-relevant content can be released and published for a growing global customer base.

MTQE provides great value in identifying the content that needs more attention and also identifying the content that can be used in its raw MT form, thus speeding up the rate at which new content can be shared with a global customer base.

“We believe that localization value comes from offering the right balance between quality and velocity,” says Conchita Laguardia, Senior Technical Program Manager at Citrix, and “the main benefit QE gives is the ability to release content faster and more often.”

Other ways that MTQE ratings can also be used include:

  • Informing an end user or a localization manager about the overall estimated quality of translated content at a corpus level,
  • Identifying different kinds of matches in translation memory, e.g., an In-Context Exact (ICE) match is a type of translation match that guarantees a high level of appropriateness by the match having been previously translated in the same context. It is an exact match that occurs in exactly the same context, that is, the same location in a paragraph, which is better than a 100% match and better than fuzzy matches of 80% or less. These different types of TM matches can be processed in differently optimized localization workflows to maximize efficiency and productivity and are useful even in traditional localization work.
  • Deciding if a translation is ready for publishing or if it requires human post-editing,
  • Highlighting problematic content that needs to be revised and changed.

The pairing of content with lower MTQE scores into a workflow that also links into a responsive, continuously learning, adaptive MT system like ModernMT, makes for a powerful translation engine that can handle making large volumes of content multilingual without compromising overall translation quality.


Effective MTQE systems allow the enterprise to produce higher quality fast translations at low cost and safely increase the use of “raw MT”.

The MT Quality Estimator at Translated has been trained on a dataset comprising over 5 billion sentences from parallel corpora (source, MT output, and corrected output) and professional translations in various fields and languages. The AI identifies and learns the error correction patterns by training on these billions of sentences, and provides a reliable prediction of which segments are most likely to need no correction, thus efficiently directing translators to those low-scoring segments that are most likely to need correction. MTQE can be combined with ModernMT, to automatically provide an overall MT quality score for a custom adaptive model, as well as a quality score for MT suggestions within Matecat.

When combined with a highly responsive MT system like ModernMT, it is also possible to improve the overall output quality of a custom MT model by focusing human review only on those sentences that fall below a certain quality score.

 

Salvo Giammarresi, head of localization of Airbnb, a company that has been beta-testing the service, says:

“Thanks to T-QE, Airbnb can systematically supervise the quality of content generated by users, which is processed through our custom MT models. This allows us to actively solicit professional translator reviews for critical content within crucial areas. This is vital to ensure that we are providing our clients with superior quality translations where it truly matters”.


Ongoing Evolution: Adaptive Quality Estimation

The ability to quickly identify errors and focus on reducing the size of the overall data set that needs to receive corrective feedback is an important goal of the MTQE technology. Focus on identifying the most problematic segments and correct them quickly. 

Any innovation that reduces the amount of data that needs to be reviewed to improve a larger corpus is valuable.

Thus, while the original MTQE error identification process uses the most common error patterns learned from the 5 billion-sentence generic dataset, the ModernMT team is also exploring the benefits of applying the adaptive approach to MTQE segment prediction.

The impact of this innovation is significant. The following hypothetical example illustrates the potential impact and reflects the experience of early testing. (This will, of course, vary depending on the dataset and data volume.)

For example, if an initial review of 40% of the sentences with the lowest MTQE score using the generic MTQE model identifies 60% of the major problems in a corpus, using the adaptive model with customer data can result in the identification of 90% of the major problems in a corpus by focusing only on the 20% with the lowest MTQE score using the adaptive MTQE model.

This ability to improve the overall quality of the published corpus by looking at less data, dramatically increases the efficiency of the MTQE-based improvement process. 

This is technological leverage that benefits large-scale translation production.

T-QE is primarily designed and intended for high-volume enterprise users but is also available for translators in MateCat or by API for enterprises. 

 Please contact info@modernmt.com for more information. 


The Importance of User-Generated Content (UGC) and Listening to the Customer

 As the importance of establishing an ever-expanding digital corporate presence to build, enhance, and improve the customer experience for both B2C and B2B customers has gained momentum, companies are realizing the growing importance of what is known as User Generated Content (UGC).

Consumers trust authentic, unpaid recommendations from real customers more than any other type of content.

UGC consists of content such as text, videos, images, and reviews that are generated by real customers, influencers, and independent individuals rather than by the brands themselves. It is important to note that any modifications made to this content should only aim to enhance clarity, conciseness, or formality without altering the original message or quotes. This content focuses on customer experiences, such as reviews, testimonials, case studies, guest posts, comments in online communities and forums, collaborative webinars, podcasts, hosted events, social media posts, and PR campaigns, as well as partner, distributor, and vendor promotions can be utilized in numerous ways to educate both new and current customers about the potential brand experience.

UGC is clear evidence of direct customer feedback, often unsolicited. It is the voice of the customer in its purest form. The value and impact of UGC are even greater in eCommerce settings where this content is widely understood to be a primary driver for conversions and purchase motivation.

In the B2B context, UGC is more than just reviews and case studies, and should be considered to be "any content others create related to your business".

UGC is important in modern digital marketing for many reasons, as summarized below:

  • Authenticity: UGC is a more authentic and experiential form of content than corporate content because it is created by customers, free from artificial embellishments or supervision by brands. Consumers tend to trust UGC more than traditional advertising, and it serves as a contemporary variation of word-of-mouth marketing, a force that has always played a significant role in influencing consumer purchasing decisions.
  • Social Proof: UGC offers social proof that impacts the buyer's journey. It builds consumer confidence and is an extremely efficient strategy for a brand to influence its audience and convert them into customers. In simpler terms, social proof is the equivalent of a reference in a B2B setting or someone else's stamp of approval. UGC also facilitates community-building, which can result in greater loyalty and advocacy.
  • Unlimited Authentic and Unfiltered Content: UGC offers brands unrestricted, genuine, and unedited content to improve brand awareness and strengthen brand reputation. Brands that implement UGC show their willingness to engage in a two-way discussion, fostering more trusted and engaged relationships with consumers.
  • Cost-Effective: Generating marketing content can be a time-consuming and expensive process for an enterprise, which is why UGC is quickly becoming a critical component of digital marketing campaigns.
  • Increased Engagement and conversions: User engagement increases due to user-generated content, which is directly correlated with conversions. User-generated content validates and legitimizes your marketing message, leading to an increased likelihood of user conversion and higher sales.

While some marketers still believe that branded content is more trustworthy or preferable to user-generated content, research suggests otherwise. Customers consider authentic user-generated content (UGC) the most trustworthy content in both B2C and B2B contexts.


UGC has many benefits for businesses. Authentic and uncensored content can establish trust and credibility, as customers are more likely to believe and engage with content from peers and independent observers than from the brand itself. 

Today, most customers are cautious of claims of superiority made by brands and actively seek information from like-minded customers and independent observers to better understand the product or service during the buyer and customer journey.


Additionally, it is a cost-effective way for a business to create trusted content that can favorably influence engagement and build stronger relationships with customers at various stages in the buyer and customer journeys.

Furthermore, UGC provides valuable insights into customers' experiences and perspectives and enables the enterprise to engage with customers more deeply and effectively. Statistics show that consumers find UGC 9.8x more impactful than influencer content, and 79% of people say UGC highly impacts their purchasing decisions. Some of the most recent research also confirms that consumers rank authentic UGC as the most trustworthy content in their buyer journey.


Here are some recent statistics from reputable sources on the value and impact of UGC:

  • 64% of consumers agree that when a brand they like and use re-shares content by customers, they are more likely to share content about the brand or its products.
  • 76% of consumers have purchased a product because of someone else’s recommendation before.
  • 72% of consumers believe that reviews and testimonials submitted by customers are more credible than the brand talking about their products.
  • A study by Bazaarvoice showed that websites with UGC can see an increase of 29% in web conversions, a 20% increase in return visitors, and a 90% increase in time spent on-site.
  • Research by BrightLocal indicated that 79.69% of consumers look at ratings and reviews before making a purchase.
  • 6 in 10 marketers report that their audience engages more with UGC in marketing and communications channels than branded content.
  • 75.78% of consumers have used social media to search for or discover products, brands, and experiences. 
  • Three-quarters or more of travelers were active on at least one social media platform in 2019.
  • Cost-per-click has been seen to decrease by 50% with the addition of user-generated content in social media ads.
  • The majority of millennials, 66%, book their travel trips using their smartphone. A higher majority, 74%, said that they use their smartphone for research related to their travels. Again the most trusted content tends to be UGC and peer commentary on travel experience.
  • These statistics show that User Generated Content (UGC) is a valuable tool for marketers to establish trust, engagement, and loyalty with their audiences. Engaging with UGC helps marketers listen to their customers, understand their needs, and collaborate with them as co-marketers to create more compelling content. This engagement strategy enables marketers to attract new customers, foster brand loyalty, and increase customer satisfaction.

    However, research indicates that many businesses still struggle to comprehend, utilize, and harness the potential of fast-moving, high-impact UGC content. Furthermore, most marketing organizations remain focused on developing and disseminating brand messages, rather than actively monitoring and engaging with the ongoing stream of customer feedback across social media and the internet.


The Translation Challenge & Perspective

As can be expected, the volume of user-generated data is constantly increasing in the modern era, and the challenge for the modern enterprise is to track it in all its most relevant variants and to set up translation production processes for the most important and relevant content.

According to World Economic Forum estimations, by 2025, the amount of data created by humans each day will be about 463 exabytes (one exabyte is equal to one billion gigabytes). As of 2021, we produce over 500 million tweets, ~300 billion emails, and 4 million gigabytes of Facebook data every single day.

While this data has primarily focused on G7 economies in the past, it is expected to shift significantly as economic growth continues to surge in the Global South and South Asia over the next two decades. As a result, global business leaders must master the skills to listen, share, communicate, translate, and comprehend various content streams in an expanding array of languages. The languages that hold the utmost relevance at present may not retain the same level of significance in the upcoming decades.

This will require that leading global businesses will enable and be capable of being multilingual along all of the following content dimensions:

Social Media Content: As social media grows into a better search engine, it’s up to marketers to create searchable content. Many buyers request user-generated content along their buying journey and this should be easily accessible as they peruse and investigate your site. Here are some examples of B2B use of social media as a digital marketing channel.

Multilingual Email Content: Personalized email content that enables quick and effortless retrieval of User Generated Content (UGC) and reviews, and prompts customers to share their feedback for future content development.

Digital Advertising: There is a clear trend towards more video/audio content, along with a strong preference for access to genuine user-generated reviews, forums, and discussions.

Web Content: Customers crave reviews from others with similar needs. The inclusion of visual reviews on your website and product pages, in addition to user-generated content, can create the feedback loop necessary to satisfy your audience's desires.

Brand Content: Branded content mixed with relevant and specific user-generated content addressing evaluation issues raised by many customers is crucial. However, numerous consumers only consult it after they have already satisfied themselves with other customer opinion data. While consumers often consult other customer opinions before turning to UGC, buyers are 4-6 times more likely to purchase from purpose-driven companies that they advocate for through UGC and word-of-mouth referrals. Moreover, the addition of UGC in social media ads has been shown to decrease cost-per-click by 50%. 6 out of 10 marketers report that their audience more frequently engages with user-generated content (UGC) in marketing and communications channels than with branded content.

The truth is that today, the #1 marketing channel used by most companies is social media and the brand's website is the second most used marketing channel, especially in B2C settings.

Measuring the success of a UGC campaign involves tracking key performance indicators (KPIs) that align with overall business goals. These can vary by language and can thus help to identify the most and least receptive markets. Here are some KPIs and metrics to consider when evaluating the success of a UGC campaign:

  1. Engagement Metrics: Monitor likes, comments, shares, and clicks to understand the impact of UGC on audience engagement.
  2. Reach and Impressions: Measure the number of people who see your UGC and the total number of times it's displayed.
  3. UGC Volume: Track the total number of user-generated posts, reviews, or other content forms associated with your brand.
  4. Conversion Rates: Analyze how UGC influences customer behavior, such as driving traffic to your website, increasing sales, or prompting sign-ups for newsletters.
  5. Content Performance Metrics: Track metrics tied to specific goals, pieces of content, or distribution channels, such as impressions, reach, engagement, clicks, conversions, sales, revenue, or customer loyalty.
  6. ROI Calculation: Consider factors like content creation costs, revenue spent on paid social ads, the value of your visual content library, cost per click (CPC), and overall conversions when calculating the ROI of your UGC campaign

To be able to participate effectively in the global market an enterprise will need not only the most streamlined and efficient translation production capabilities but also have infrastructure and processes that continually improve and adapts to changing customer requirements.

This is precisely the solution that has been developed by Translated for any global enterprise to be able to undertake this content deluge challenge successfully. This is a solution and a technology that has been developed in close collaboration with clients who have focused on serving customers who have expressed a preference for having multilingual content access at scale, particularly for more dynamic real-time UGC which inform evaluation and purchase decisions.


Unveiling Hyper Adaptive ModernMT


Translated recently announced a new model of ModernMT, its adaptive machine translation (MT) system. The new model, called Hyper Adaptive, enables companies to translate billions of words at ultra-fast speeds without compromising quality. It is domain-specific and designed for use cases such as translating user-generated content, datasets for multilingual large language models, and web content for data mining activities.

In recent years, companies have approached Translated with requests to leverage the accuracy of ModernMT's adaptive MT system to quickly translate specialized, unique content and high volumes of ongoing content. While a generic adaptive MT model can handle the request to some extent, it is not designed to translate millions of words per minute in a specific domain.

Hyper Adaptive solves this issue by using sophisticated compression techniques and training the MT model for specific use cases based on the customer's previous translations and translation memories (TMs) to ensure high-quality performance even at a scale of many billions of words a month.

The resulting highly specialized MT model is much smaller and more efficient than a generic adaptive model and can process content at ultra-fast speeds, in as little as 50ms for a typical sentence. An example to clarify the performance capability at Translated's dedicated data centers: it can translate the entire English Wikipedia (4.4 billion words) into another language in less than a day (3 million words per minute). By training directly using customer data, the Hyper Adaptive model achieves translation accuracy equal to or better than state-of-the-art custom adaptive MT models.

Often, when very high throughput is required, MT systems will need to make compromises on output quality. Typically there is a trade-off between quality and throughput. In contrast, this solution helps companies maintain high quality even when translating massive volumes of content at ultra-high speeds.

In some specific use cases, such as dynamically changing user-generated content, combining the dynamically learning adaptive MT model with ongoing professional translator corrective feedback can further improve the quality of the MT output over time. 

Even though the model is optimized throughput speed, the model is still adaptive, and thus, it continues to improve after initial training through ongoing corrective feedback and the addition of new TMs delivered to match the company's style.

As the demand for agile global enterprises scales to translating billions of words a month, solutions like Hyper Adaptive ModernMT allow continuous improvement daily yet can easily translate billions of words of relevant UGC into over 200 languages every day.

We designed the Hyper Adaptive model to enable the translation of content that has never been translated before. Its language coverage allows companies to reach over 99% of the world's population in their own language. Hyper Adaptive is one more step towards global understanding.

Marco Trombetti – Translated CEO

Integration and Costs

Like all other ModernMT models, the Hyper Adaptive model can be integrated into the translation workflow via API. Costs vary depending on the use case, the amount of data to be translated, and the amount and quality of existing translations and TMs. Existing Translated customers can contact their account manager to get a new service quote.

Thanks to the Hyper Adaptive model, user-generated content on Airbnb has reached an unprecedented level of quality, greatly improving the experience for our user base. The real-time, high-quality translation of UGC has helped Airbnb foster a stronger sense of community among our hosts and guests, which has had a tremendous impact on our business.

Salvo Giammarresi – Head of Localization at Airbnb