eMpTy Pages

The Evolving LLM Era and its Potential Impact

2024-12-18T10:21:00.004-08:00

With the advent of Large Language Models (LLMs), there are exciting new possibilities available. However, we also see a large volume of mostly vague and poorly defined claims of "using Al" by practitioners with little or no experience with machine learning technology and algorithms.

The signal-to-noise (hype-to-reality) ratio has never been higher, and much of the hype fails to meet real business production use case requirements. Aside from the data privacy issues, copyright problems, and potential misuse of LLMs by bad actors, hallucinations and reliability issues also continue to plague LLMs.

Enterprise users expect production IT infrastructure output to be reliable, consistent, and predictable on an ongoing basis, but there are very few use cases where this is currently possible with LLM output. The situation is evolving, and many expect that the expert use of LLMs could have a dramatic and favorable impact on current translation production processes.

There are several areas in and around the machine translation task where LLMs can add considerable value to the overall language translation process. These include the following:

LLM translations tend to be more fluent and acquire more contextual information, albeit in a smaller set of languages
Source text can be improved and enhanced before translation to produce better-quality translations
LLMs can carry out quality assessments on translated output and identify different types of errors
LLMs can be trained to take corrective actions on translated output to raise overall quality
LLM MT is easier to adapt dynamically and can avoid the large re-training that typical static NMT models require

At Translated, we have been carrying out extensive research and development over the past 18 months into these very areas, and the initial results are extremely promising, as outlined in our recent whitepaper.

The chart below shows some evidence of our progress with LLM MT. It compares Google (static), DeepL (static), Lara RAG-tuned LLM MT, GPT-4o (5-shot), and ModernMT (TM access) for nine high-resource languages. These results for Lara are expected to improve further.

At Translated, we have been carrying out extensive research and development over the past 12 months into these very areas, and the initial results are extremely promising, as outlined in our recent whitepaper.

One approach involves using independent LLM modules to handle each category separately. The other approach is to integrate these modules into a unified workflow, allowing users to simply submit their content and receive the best possible translation. This integrated process includes MTQE as well as automated review and post-editing.

While managing these tasks separately can offer more control, most users prefer a streamlined workflow that focuses on delivering optimal results with minimal effort, with the different technology components working efficiently behind the scenes.

LLM-based machine translation will need to be secure, reliable, consistent, predictable, and efficient for it to be a serious contender to replace state-of-the-art (SOTA) NMT models.

This transition is underway but will need more time to evolve and mature.

Thus, SOTA Neural MT models may continue to dominate MT use in any enterprise production scenarios for the next 12-15 months, except where the highest quality automated translation is required.

Currently, LLM MT makes the most sense in settings where high throughput, high volume, and a high degree of automation are not a requirement and where high quality can be achieved with reduced human review costs enabled by language AI.

Translators are already using LLMs for high-resource languages for all the translation-related tasks previously outlined. It is the author’s opinion that there is a transition period where it is quite plausible that both NMT and LLM MT might be used together or separately for different tasks in new LLM-enriched workflows. NMT will likely perform high-volume, time-critical production work as shown in the chart below.

In the scenario shown above, information triage is at work. High-volume content is initially processed by an adaptive NMT model, followed by an efficient MTQE process that sends a smaller subset to an LLM for cleanup and refinement. These corrections can be sent back to improve the MT model and increase the quality of the MTQE (not shown in the diagram above).

However, as LLMs get faster and it is easier to automate sequences of tasks, it may be possible to embed both an initial quality assessment and an automated post-editing step together for an LLM-based process to manage.

An emerging trend among LLM experts is the use of agents. Agentic AI and the use of agents in large language models (LLMs) represent a significant evolution in artificial intelligence, moving beyond simple text generation to create autonomous, goal-driven systems capable of complex reasoning and task execution.

AI agents are systems that use LLMs as their core controller to autonomously pursue complex goals and workflows with minimal human supervision.

They potentially combine several key components:

An LLM core for language understanding and generation
Memory modules for short-term and long-term information retention
Planning capabilities for breaking down tasks and setting goals
Some ability to iterate to a goal
Tools for accessing external information and executing actions
Interfaces for interacting with users or other systems

One approach involves using independent LLM agents to address each of the categories below as separate and discrete steps.

The other approach is to integrate these steps into a unified and robust workflow, allowing users to simply submit content and receive the best possible output through an AI-managed process. This integrated workflow would include source cleanup, MTQE, and automated post-editing. Translated is currently evaluating both approaches to identify the best path forward in different production scenarios.

Agentic AI systems are capable of several advanced capabilities that include:

Autonomy: Ability to take goal-directed actions with minimal oversight
Reasoning: Contextual decision-making and weighing tradeoffs
Adaptive planning: Dynamically adjusting goals and plans as conditions change
Natural language understanding: Comprehending and following complex instructions
Workflow optimization: Efficiently moving between subtasks to complete processes

A thriving and vibrant open-source community will be a key requirement for ongoing progress. The open-source community has been continually improving the capabilities of smaller models and challenging the notion that scale is all you need. We see an increase in recent models that are smaller and more efficient but still capable and are thus often preferred for deployment.

All signs point to an exciting future where the capabilities of technology to enhance and improve human communication and understanding get better, and we are likely to see major advances in bringing an increasing portion of humanity into the digital sphere for productive, positive engagement and interaction.

The Evolution of AI Translation Technology

2024-12-17T15:12:00.004-08:00

Translated Srl is a pioneer in using MT in professional translation settings at a production scale. The company has a long history of innovation in the effective use of MT technology (an early form of AI) in production settings. It has deployed MT extensively across much of its professional translation workload for over 15 years and has acquired considerable expertise in doing this efficiently and reliably.

Machine Translation
IS
Artificial Intelligence

One of the main drivers behind language AI has been the ever-increasing content volumes needed in global enterprise settings to deliver exceptional global customer experience. The rationale behind the use of language AI in the translation context has always been to amplify the ability of stakeholders to produce higher volumes of multilingual content more efficiently and at increasingly higher quality levels.

Consequently, we are witnessing a progressive human-machine partnership where an increasing portion of the production workload is being transferred to machines as technology advances.

Research analysts have pointed out that even as recently as 2022-23 LSPs and localization departments have struggled with using generic (static) MT systems in enterprises for the following reasons:

Inability to produce MT output at the required quality levels. Most often due to a lack of training data needed to see meaningful improvement.
Inability to properly estimate the effort and cost of deploying MT in production.
The ever-changing needs and requirements of different projects with static MT that cannot adapt easily to new requirements create a mismatch of skills, data, and competencies.

The Adaptive MT Innovation

In contrast to much of the industry, Translated was the first mover in the production use of adaptive MT since the Statistical MT era. The adaptive MT approach is an agile and highly responsive way to deploy MT in enterprise settings as it is particularly well-suited to rapidly changing enterprise use case scenarios.

From the earliest days, ModernMT was designed to be a useful assistant to professional translators to reduce the tedium of the typical post-editing (MTPE) work process. This focus on building a productive and symbiotic human-machine relationship has resulted in a long-term trend of continued improvement and efficiency.

ModernMT is an adaptive MT technology solution designed from the ground up to enable and encourage immediate and continuous adaptation to changing business needs. It is designed to support and enhance the professional translator's work process and increase translation leverage and productivity beyond what translation memory alone can. It is a continuous learning system that improves with ongoing corrective feedback. This is the fundamental difference between an adaptive MT solution like ModernMT and static generic MT systems.

The ModernMT approach to MT model adaptation is to bring the encoding and decoding phases of model deployment much closer together, allowing dynamic and active human-in-the-loop corrective feedback, which is not so different from the in-context corrections and prompt modifications we are seeing being used with large language models today.

It is now common knowledge that machine learning-based AI systems are only as good as the data they use. One of the keys to long-term success with MT is to build a virtuous data collection system that refines MT performance and ensures continuous improvement. This high-value data collection effort has been underway at Translated for over 15 years and is a primary reason why ModernMT outperforms competitive alternatives.

This is also a reason why it makes sense to channel translation-related work through a single vendor so that an end-to-end monitoring system can be built and enhanced over time. This is much more challenging to implement and deploy in multi-vendor scenarios.

The existence of such a system encourages more widespread adoption of automated translation and enables the enterprise to become efficiently multilingual at scale. The use of such a technological foundation allows the enterprise to break down the language as a barrier to global business success.

The MT Quality Estimation & Integrated Human-In-The-Loop Innovation

As MT content volumes rapidly increase in the enterprise, it becomes more important to make the quality management process more efficient, as human review methods do not scale easily. It is useful for any multilingual-at-scale initiative to rapidly identify the MT output that most need correction and focus critical corrective feedback primarily on these lower-quality outputs to enable the MT system to continually improve and ensure overall improved quality on a large content volume.

The basic idea is to enable the improvement process to be more efficient by immediately focusing 80% of the human corrective effort on the 20% lowest-scoring segments. Essentially, the 80:20 rule is a principle that helps individuals and companies prioritize their efforts to achieve maximum impact with the least amount of work. This leveraged approach allows overall MT quality, especially in very large-scale or real-time deployments, to improve rapidly.

Human review at a global content scale is unthinkable, costly, and probably a physical impossibility because of the ever-increasing volumes. As the use of MT expands across the enterprise to drive international business momentum and as more automated language technology is used, MTQE technology offers enterprises a way to identify and focus on the content that needs the least, and the most human review and attention, before it is released into the wild.

When a million sentences of customer-relevant content need to be published using MT, MTQE is a means to identify the ~10,000 sentences that most need human corrective attention to ensure that global customers receive acceptable quality across the board.

This informed identification of problems that need to be submitted for human attention is essential to allow for a more efficient allocation of resources and improved productivity. This process enables much more content to be published without risking brand reputation and ensuring that desired quality levels are achieved. In summary, MTQE is a useful risk management strategy as volumes climb.

Pairing content with lower MTQE scores into a workflow that connects a responsive, continuously learning adaptive MT system like ModernMT with expert human editors creates a powerful translation engine. This combination allows for handling large volumes of content while maintaining high translation quality.

When a responsive adaptive MT system is integrated with a robust MTQE system and a tightly connected human feedback loop, enterprises can significantly increase the volume of published multilingual content.

The conventional method, involving various vendors with different and distinct processes, is typically slow and prone to errors. However, this sluggish and inefficient method is frequently employed to enhance the quality of MT output, as shown below.

MTQE technology aims to pinpoint errors quickly and concentrate on minimizing the size of the data set requiring corrective feedback. The business goal centers on swiftly identifying and rectifying the most problematic segments.

Speed and guaranteed quality at scale are highly valued deliverables. Innovations that decrease the volume of data requiring review and reduce the risk of translation errors are crucial to the business mission.

The additional benefit of an adaptive rather than a generic MTQE process further extends the benefit of this technology by reducing the amount of content that needs careful review.

The traditional model of post-editing everything is now outdated.

The new approach entails translating everything and then only revising the worst and most erroneous parts to ensure an acceptable level of quality.

For example, if an initial review of 40% of the sentences with the lowest MTQE score using a generic MTQE model identifies 60% of the major problems in a corpus, using the adaptive QE model informed by customer data can result in the identification of 90% of the "major" translation problems in a corpus by focusing only on the 20% lowest scoring MTQE scores using the adaptive MTQE model.

This innovation greatly enhances the overall efficiency. The chart below shows how a process that integrates adaptive MT, MTQE, and focused human-in-the-loop (HITL) work together to build a continuously improving translation production platform.

The capability to enhance the overall quality of translation in a large, published corpus by analyzing less data significantly boosts the efficiency and utility of automated translation. An improvement process based on Machine Translation Quality Estimation (MTQE) is a form of technological leverage that advantages extensive translation production.

The Evolving LLM Era and Potential Impact

The emergence of Large Language Models (LLMs) has opened up thrilling new opportunities. However, there is also a significant number of vague and ill-defined claims of "using AI" by individuals with minimal experience in machine learning technologies and algorithms. The disparity between hype and reality is at an all-time high, with much of the excitement not living up to the practical requirements of real business use cases. Beyond concerns of data privacy, copyright, and the potential for misuse by malicious actors, issues of hallucinations and reliability persistently challenge the deployment of LLMs in production environments.

Enterprise users expect their IT infrastructure to consistently deliver reliable and predictable outcomes. However, this level of consistency is not currently easily achievable with LLM output. As the technology evolves, many believe that expert use of LLMs could significantly and positively impact current translation production processes.

Comparing MT System Performance

2024-12-17T14:11:00.001-08:00

The advantages of a dynamic adaptive MT system are clarified in this post. Most static MT systems need significant upfront investment to enable adaptation. Adaptive systems like ModernMT have a natural advantage since the system is so easily adapted to customer domain and data.

Machine Translation (MT) system evaluation is necessary for enterprises considering increasing the use of automated translation to meet the increasing information and communication needs to engage the global customer. Managers need to understand which MT system is best for their specific use case and language combination, and which MT system will improve the fastest with their data and with the least effort to perform best for the intended use case.

What is the best MT system for my specific use case, and this language combination?

The comparative evaluation of the quality performance of MT systems has been problematic and often misleading because the typical research approach has been to assume that all MT systems work in the same way.

Thus, comparisons by “independent” third parties are generally made at the lowest common denominator level i.e. the static or baseline version of the system. Focusing on the static baseline makes it easier for a researcher to line up and rank different systems but penalizes highly responsive MT systems that are designed and able to immediately respond to the user's focus and requirements, and perform system optimization around user content.

Which MT system is going to improve the fastest with my unique data and require the least amount of effort to get the best performance for my intended use case?

Ideally, a meaningful evaluation would test a model on its potential capabilities with new and unseen data as it is expected that a model should do well on data it has been trained on and knows.

However, many third-party evaluations use generic test data that is scoured from the web and slightly modified. Thus, data leakage is always possible as shown in the center diagram below.

Issues like data leakage and sampling bias can cause AI to give faulty predictions or produce misleading rankings. Since there is no reliable way to exclude test data contained in the training data this problem is not easily solved. Data leakage will cause overly optimistic results (high scores) that will not be validated or seen in product use.

This issue is also a challenge when comparing LLM models especially since much of what LLMs are tested on is data that these systems have already seen and trained on. Some key examples of the problems that data leakage causes in machine translation evaluations include:

Overly optimistic performance estimates: because the model has already seen some of the test data during training. This gives a false impression of how well the model will perform on real, unseen data.
Poor real-world performance: Models that suffer from data leakage often fail to achieve anywhere near the same level of performance when deployed on real-world data. The high scores do not translate to the real world.
Misleading comparisons between models: If some models evaluated on a dataset have data leakage while others do not, it prevents fair comparisons and identifying the best approaches. The leaky models will seem superior but not legitimately so.

In addition, the evaluation and ranking of MT systems done by third parties is typically done using an undisclosed and confidential "test data" set that attempts to cover a broad range of generic subject matter. This approach may be useful for users who intend to use the MT system as a generic, one-size-fits-all tool but is less useful for enterprise users who want to understand how different MT systems might perform on their subject domain and content in different use cases.

Rankings on generic test data are often not likely to be useful for predicting actual performance in the enterprise domain. If the test data is not transparent how can an enterprise buyer be confident that the rankings are valid for their use cases? These often irrelevant scores are used to select an MT system for production work and thus are often sub-optimal.

Unfortunately, enterprises looking for the ideal MT solution have been limited to third-party rankings that focus primarily on comparing generic (static) versions of public MT systems, using undisclosed, confidential test data sets that are irrelevant or unrelated to enterprise subject matter.

With the proliferation of MT systems in the market, translation buyers are often bewildered by the range of MT system options and thus resort to using these rankings to make MT system selections without understanding the limitations of the evaluation and ranking process.

What is the value of scores that provide no insight or detail on what the scores and rankings are based on?

Best practices suggest that users have visibility on what data is used to calculate the score for it to be meaningful or relevant.

Thus, Translated recently undertook some MT comparison research to answer the following questions:

What is the quality performance of an easily tuned and agile adaptive MT system compared to generic MT systems that require special adaptation efforts to accommodate and tune to typical enterprise content?
Can a comparative analysis be done using public-domain enterprise data so that a realistic enterprise case can be evaluated, and so that others can replicate, reproduce, and verify the results?
Can this evaluation be done transparently, by making test scripts publicly available so other interested parties can replicate and reproduce the results?
Additionally, can the evaluation process be easily modified so that comparative performance on other data sets can also be tested?
Can we provide a better, more accurate comparison of ModernMT's out-of-the-box capabilities against the major MT alternatives available in the market?

This evaluation further validates and reinforces what Gartner, IDC, and Common Sense Advisory have already said about ModernMT being a leader in enterprise MT.

The evaluation described in this post provides a deeper technical foundation to illustrate ModernMT's responsiveness and ability to quickly adapt to enterprise subject matter and content.

Evaluation Methodology Overview

Translated SRL commissioned Achim Ruopp of Polyglot Technology LLC and asked him to find viable evaluation data and establish an easily reproducible process that could be used to periodically update the evaluation and/or enable others to replicate, reproduce, or otherwise modify the evaluation. He chose the data and developed the procedural outline for the evaluation. This is a typical enterprise use case where MT performance on specialized corporate domain material needs to be understood before deployment in a production setting. It is understood that some of the systems can potentially be further customized with specialized training efforts but this analysis provides a perspective when no effort is made on any of the systems under review.

The process followed by Achim Ruopp in his analysis is shown below:

Identify evaluation data and extract the available data for the languages that were of primary interest and that had approximately the same volume of data. The 3D Design, Engineering, and Construction software company Autodesk provides high-quality software UI and documentation translations created via post-editing machine translations.
- US English → German,
- US English → Italian,
- US English → Spanish,
- US English → Brazilian Portuguese, and
- US English → Simplified Chinese
Clean and prepare data into two data sets:
- 1) ~10,000 segments of TM data for each language pair and,
- 2) a Test Set with 1,000 segments that had no overlap with the TM data
The evaluation aimed to measure the accuracy and speed of the out-of-the-box adaptation of ModernMT to the IT domain and contrast this with generic translations from four major online MT services (Amazon Translate, DeepL, Google Translate, and Microsoft Translator). This is representative of many translation projects in enterprise settings. A zero-shot output score for GPT-4 was also added to show how the leading LLM scores against leading NMT solutions. Thus the “Test Set” was processed and run through all these systems and three versions of ModernMT (Static baseline, Adaptive, and Adaptive with dynamic access to reference TM.) Please note that many “independent evaluations” that compare multiple MT systems focus ONLY on the static version of ModernMT which in reality would rarely happen.
The MT output was scored using three widely used MT output quality indicators that are based on a reference Test Set. These include:
- COMET – A measure of semantic similarity that achieves state-of-the-art levels of correlation with human judgment and is the most commonly used metric in current expert evaluations.
- SacreBLEU – A measure of syntactic similarity that is possibly the most popular metric used in MT evaluation, despite many shortcomings, that compares the token-based similarity of the MT output with the reference segment and averages it over the whole corpus.
- TER – A measure of syntactic similarity that measures the number of edits (insertions, deletions, shifts, and substitutions) required to transform a machine translation into a reference translation. This is a measurement that is popular in the localization industry.
The results and scores produced are presented in detail in this report in a series of charts with some limited commentary. The summary is shown below. The objective was to understand how ModernMT performs relative to the other alternatives and provide a more accurate out-of-the-box picture, thus the focus of this evaluation remains on how systems perform without any training or customization effort. It is representative of the results if the user were to make virtually no effort beyond pointing to a translation memory.

Summary Results

This is the first proper evaluation and comparison of ModernMT's out-of-the-box adaptive MT model (with access to a small translation memory, but not trained) against leading generic (or static) public MT systems.
The comparison shows that ModernMT outperforms generic public MT systems using data from an Autodesk public dataset, where translation performance was measured for translation from US English to German, Italian, Spanish, Brazilian Portuguese, and Simplified Chinese using COMET, SacreBLEU, and TER scoring.
ModernMT achieves these results without any overt training effort, simply by dynamically using and referencing relevant translation memory (TM) when available.
A state-of-the-art LLM (GPT-4) failed to outperform the production NMT systems in most of the tests in this evaluation.
The evaluation and comparison tools and research data are in the public domain. Interested observers can replicate the research with their own data.

The effortless improvements in ModernMT show why comparisons to the static version of the system are meaningless

Why is MT evaluation so difficult?

Language is one of the most nuanced, elaborate, and sophisticated mediums used by humans to communicate, share, and gather knowledge. It is filled with unwritten and unspoken context, emotion, and intention that is not easily contained in the data used to train machines on how to understand and translate human language. Thus, machines can only approach language at a literal textual string level and will likely always struggle with finesse, insinuation, and contextual subtleties that require world knowledge and common sense. Machines have neither.

Thus, while it is difficult to do this kind of evaluation with absolute certainty, it is still useful to get a general idea. MT systems will tend to do well on material that is exactly like the material they train on and function almost like translation memory in this case. Both MT system developers and enterprise users need to have some sense of what system might perform best for their purposes.

It is common practice to test MT system performance on material it has not already memorized to get a sense of what system performance will be in real-life situations. Thus quick and dirty quality evaluations provided by BLEU, COMET, and TER can be useful even though they are never as good as expert, objective human assessments. These metrics are used because human assessment is expensive and slow and also difficult to do consistently and objectively over time.

To get an accurate sense of how an MT system might perform on new and unseen data it is worth considering how these factors could undermine any absolute indication of any one system being “better” or “worse” than any other.

Language translation for any single sentence does not have a single correct answer. Many different translations could be useful and adequate and correct for the purpose at hand.
It is usually recommended that a varied but representative set of 1,000 to 2,000 segments/sentences be used in an evaluation. Since MT systems will be compared and scored against this “gold standard” the Test Set should be professionally done. This can cost $1,500 to $2,500 per language. So, 20 languages can cost $50,000 just to create the Test Set. This cost often results in MT use to reduce costs which builds in a bias for the MT system (typically Google) used to produce this data.
There is no definitive way to ensure that there is no overlap between the training data and the test data so data leakage can often undermine the accuracy of the results.
It is easier to use generic tests but the most useful performance indicators in production settings will always be with carefully constructed test sentences of actual enterprise content (that are not contained in the training set).

Automated quality evaluation metrics like COMET are indeed useful but the experts in the community now realize that these scores have to be used together with competent human assessments to get an accurate picture of the relative quality of different MT systems. Using automated scores alone is not advised.

What matters most?

This post explores some broader business issues that should also be considered when considering MT quality.

While much attention is given to comparative rankings of different MT systems, one should ask how useful this is in understanding how any particular MT system will perform on any enterprise-specific use case. Scores on generic test sets do not accurately predict how a system will perform on enterprise content in a highly automated production setting.

The rate at which an MT system improves for specific enterprise content with least effort possible is possibly the most important criterion for MT system selection.

Ideally, improvement should be seen on a daily or at least weekly basis.

So instead of asking what COMET score System A has on its EN > FR system? It is important to ask other questions that are more likely to ensure successful outcomes. The answers to the following questions will likely lead to much better MT system selections.

How quickly will this system adapt to my unique customer content?
How much data will I need to provide to see it perform better on my content and use case?
How easy is it to integrate the system with my production environment?
How easy or difficult is it to set up a continuously improving system that continues to improve and learn from ongoing corrective feedback?
How easy or difficult is it to manage and maintain my optimized systems on an ongoing basis?
Can I automate the ongoing MT model improvement process?
Ongoing improvements are driven both by technology enhancements and by expert human feedback, are both these available from this vendor?

Please follow this link for a detailed report on this evaluation and more detailed analysis and commentary on understanding MT evaluation from a more practical and business-success-focused perspective.

ModernMT Introduces Adaptive Quality Estimation (MTQE)

2024-12-16T17:10:00.000-08:00

As MT quality improves, MT use expands to publishing millions of words monthly to improve global customer experience. MTQE can quickly identify potential problems to focus MTPE only on the most problematic sections and quickly publish large volumes of global CX-enhancing content safely.

Historically, the path to achieving quality in professional language translation work is to involve multiple humans in the creation and validation of every translated segment. This multi-human translation production process is known as TEP or Translate > Edit > Proof. The way to guarantee the best translation quality will be produced has always been to provide a quality review by a second and sometimes a third person. When this process works well it produces “good quality” translation, but this approach also has serious limitations:

1) it is an ad-hoc process with constantly changing humans that can result in the same mistakes happening again, and,

2) it is a time-consuming, miscommunication-prone, and costly process that is difficult to scale as volumes increase.

The TEP model has been the foundation for much of the professional translation work done over the last 20 years and is still the production model used for much of the translation work managed by localization groups. While this is a historical fact, the landscape for professional business translation has been changing in two primary ways:

1) The volumes of content that need to be translated to be successful in international business settings are continually increasing,

2) An increasing need and use of machine translation and more automation to cope with the ever-increasing demand, and the need for much faster turnaround on translation projects.

One solution to this problem is to increase the use of machine translation and post-edit the output (MTPE or PEMT). This is an attempt to reproduce part of the entirely human TEP process described above with a machine starting the process. This approach has met with limited success, and many LSPs and localization managers struggle to find an optimal MT process due to the following issues:

Uneven or poor machine translation quality: The automation can only be successful when the MT provides a useful and preferably continuously improving first draft submitted for human approval or refinement. MT quality varies by language and few LSPs and localization managers know how to engineer and optimize MT systems to perform optimally for their specific needs. Recent surveys by researchers show that LSPs (and localization managers) still struggle to meet quality expectations and estimate cost and efforts when using MT.

Translator resistance: As MTPE is a machine output-driven process, and typically paid at lower unit rates, many translators are loathe to do this kind of work without assurances that the MT will be of adequate quality to assure fair overall compensation. Low quality MT is much more demanding to correct and thus translators find that their compensation is negatively impacted when they work with low-quality MT. The converse is also true, many translators have found that high-quality adaptive MT work results in higher-than-expected compensation due to the continuous improvement in the MT output and overall system responsiveness.

Lack of standardization: there is currently no standardization in the post-editing process, which can lead to inconsistencies in the quality of the final translation.

Training and experience: Post-editing MT requires a different skill set than traditional translation, and post-editors need to be trained accordingly. The translator versus post-editing task remains a source of friction in an industry that depends heavily on skillful human input, largely due to improper work specification, and compensation-related concerns.

Cost: Post-editing can be expensive, especially for large volumes of text. This can be a significant obstacle for companies that need to translate large amounts of content since it is often assumed that all the MT output must be reviewed and edited.

MT Quality Evaluation vs MT Quality Estimation

But as we move forward and expand the use of machine translation to make ever-increasing volumes of content multilingual, we see the need for two kinds of quality assessment tools that can be useful to any enterprise that seeks to be multilingual at scale.

1) Quality Evaluation estimates provide a quality assessment of multiple versions of an MT system that may be used by the MT system developers to better understand the impact of changing development strategies. Commonly used evaluation metrics include BLEU, COMET, TER, and ChrF which all use a human reference test set (the gold standard) to calculate a quality score of each MT system’s performance and is well understood by the developer.

These scores are useful to developers to find optimal strategies in the system development process but unfortunately, these scores are also used by “independent” researchers who seek to sell aggregation software to less informed buyers and localization managers who usually have limited understanding of the scores, the test sets, and the opaque process used to generate the scores. Thus, buyers will often make sub-optimal and naïve choices in MT system selection.

2) Quality Estimation scores, on the other hand, are quality assessments made by the machine without using reference translations or actively requiring humans in the loop. It is an assessment of quality made by a machine itself on how good or bad a machine-translated output segment is. MTQE can serve as a valuable tool for risk management in high-volume translation scenarios where human intervention is limited or impractical due to the volume of translations or speed of delivery. MTQE enables efficiency and minimizes potential risks associated with using raw MT because it directs attention to the most likely problematic translations, and reduces the need to look at all the automated translations.

Interest in MTQE has gained momentum as the use of MT has increased, as it allows rapid error detection in large volumes of MT output, thus enabling rapid and focused error correction strategies to be implemented.

Another way to understand MTQE is to more closely examine the difference in training data used in developing an MT engine versus the data used in building a QE model. An MT system is trained on large volumes of source and target sentence pairs or segments or what is generally called translation memory.

An MTQE system is trained on the original MT output and corrected sentence pairs which are also compared to the original source (ground truth) to identify error patterns. The MTQE validation process seeks to confirm that there is a high level of agreement between a machine's quality prediction of machine output and human quality assessment of that same output

Quality estimation is a method for predicting the quality without having to compare it to a human reference set. Quality estimation uses machine learning methods to assign quality scores to machine-translated segments and since it works through machine learning it can be used in dynamic, live situations. Quality estimation can predict quality at various levels of text, including at the level of the word, phrase, sentence, or even document but is used most commonly at a segment level.

What is T-QE?

The current or traditional process used to improve adaptive machine translation quality uses one of two methods:

1) random segments are selected and reviewed by professional translators or,

2) every segment has to be reviewed by a translator to ensure the required quality.

However, as MT content volumes rapidly increase in the enterprise, it becomes more important to make this process more efficient, as these human review methods do not scale easily. It is useful to the production process to rapidly identify those segments that most need human attention, and focus critical corrective feedback primarily on these problem segments to enable the MT system to continually improve and ensure overall improved quality on a large content volume.

The MT Quality Estimator (T-QE) streamlines the system improvement process by providing a quality score for each segment, thus identifying those segments that most need human review, rather than depending only on random segment selection, or requiring that each segment be reviewed.

The basic idea is to enable the improvement process to be more efficient by immediately focusing 80% of the human corrective effort on the 20% lowest-scoring segments. Essentially, the 80:20 rule is a principle that helps individuals and companies prioritize their efforts to achieve maximum impact with the least amount of work. This approach allows overall MT quality, especially in very large-scale or real-time deployments, to improve rapidly.

The MT Quality Estimator assists in solving this challenge by providing an MT quality score for each translated segment, directly within Matecat or via an API.

The MT Quality Estimator at Translated was validated by taking many samples (billions of segments) of different types of content of varying source quality and comparing the correlation between the T-QE scores and human quality assessments.

The initial tests conducted by the ModernMT team suggest that the T-QE scores are more accurate predictors on high-quality segments but it was noted that lower-quality segments contained more UGC, had longer sentences, and were in general noisier.

The Key Benefits of MT Quality Estimation

Human review at a global content scale is unthinkable, costly, and probably a physical impossibility because of the ever-increasing volumes. As the use of MT expands across the enterprise to drive international business momentum and as more automated technology is used, MTQE offers enterprises a way to identify and focus on the content that needs the least, and the most attention, before it is released into the wild.

MTQE is an effective means to manage risk when an enterprise wishes to go multilingual at scale. Quality estimation can predict the quality of a given machine translation, allowing for corrections to be made before the final translation is published. MTQE identifies high-quality MT output that does not require human post-editing and thus makes it easier to focus attention on the lower-quality content, allowing for faster turnaround times and increased efficiency.

This informed identification of problems that need to be submitted for human attention is essential to allow for a more efficient allocation of resources and improved productivity. This process enables much more content to be released to global customers without risking brand reputation, and ensuring that desired quality levels are achieved.

When MTQE is paired and combined with a highly responsive MT system, like ModernMT, it can accelerate the rate at which large volumes of customer-relevant content can be released and published for a growing global customer base.

MTQE provides great value in identifying the content that needs more attention and also identifying the content that can be used in its raw MT form, thus speeding up the rate at which new content can be shared with a global customer base.

“We believe that localization value comes from offering the right balance between quality and velocity,” says Conchita Laguardia, Senior Technical Program Manager at Citrix, and “the main benefit QE gives is the ability to release content faster and more often.”

Other ways that MTQE ratings can also be used include:

Informing an end user or a localization manager about the overall estimated quality of translated content at a corpus level,
Identifying different kinds of matches in translation memory, e.g., an In-Context Exact (ICE) match is a type of translation match that guarantees a high level of appropriateness by the match having been previously translated in the same context. It is an exact match that occurs in exactly the same context, that is, the same location in a paragraph, which is better than a 100% match and better than fuzzy matches of 80% or less. These different types of TM matches can be processed in differently optimized localization workflows to maximize efficiency and productivity and are useful even in traditional localization work.
Deciding if a translation is ready for publishing or if it requires human post-editing,
Highlighting problematic content that needs to be revised and changed.

The pairing of content with lower MTQE scores into a workflow that also links into a responsive, continuously learning, adaptive MT system like ModernMT, makes for a powerful translation engine that can handle making large volumes of content multilingual without compromising overall translation quality.

Effective MTQE systems allow the enterprise to produce higher quality fast translations at low cost and safely increase the use of “raw MT”.

The MT Quality Estimator at Translated has been trained on a dataset comprising over 5 billion sentences from parallel corpora (source, MT output, and corrected output) and professional translations in various fields and languages. The AI identifies and learns the error correction patterns by training on these billions of sentences, and provides a reliable prediction of which segments are most likely to need no correction, thus efficiently directing translators to those low-scoring segments that are most likely to need correction. MTQE can be combined with ModernMT, to automatically provide an overall MT quality score for a custom adaptive model, as well as a quality score for MT suggestions within Matecat.

When combined with a highly responsive MT system like ModernMT, it is also possible to improve the overall output quality of a custom MT model by focusing human review only on those sentences that fall below a certain quality score.

Salvo Giammarresi, head of localization of Airbnb, a company that has been beta-testing the service, says:

“Thanks to T-QE, Airbnb can systematically supervise the quality of content generated by users, which is processed through our custom MT models. This allows us to actively solicit professional translator reviews for critical content within crucial areas. This is vital to ensure that we are providing our clients with superior quality translations where it truly matters”.

Ongoing Evolution: Adaptive Quality Estimation

The ability to quickly identify errors and focus on reducing the size of the overall data set that needs to receive corrective feedback is an important goal of the MTQE technology. Focus on identifying the most problematic segments and correct them quickly.

Any innovation that reduces the amount of data that needs to be reviewed to improve a larger corpus is valuable.

Thus, while the original MTQE error identification process uses the most common error patterns learned from the 5 billion-sentence generic dataset, the ModernMT team is also exploring the benefits of applying the adaptive approach to MTQE segment prediction.

The impact of this innovation is significant. The following hypothetical example illustrates the potential impact and reflects the experience of early testing. (This will, of course, vary depending on the dataset and data volume.)

For example, if an initial review of 40% of the sentences with the lowest MTQE score using the generic MTQE model identifies 60% of the major problems in a corpus, using the adaptive model with customer data can result in the identification of 90% of the major problems in a corpus by focusing only on the 20% with the lowest MTQE score using the adaptive MTQE model.

This ability to improve the overall quality of the published corpus by looking at less data, dramatically increases the efficiency of the MTQE-based improvement process.

This is technological leverage that benefits large-scale translation production.

T-QE is primarily designed and intended for high-volume enterprise users but is also available for translators in MateCat or by API for enterprises.

Please contact info@modernmt.com for more information.

The Importance of User-Generated Content (UGC) and Listening to the Customer

2024-12-16T16:47:00.000-08:00

As the importance of establishing an ever-expanding digital corporate presence to build, enhance, and improve the customer experience for both B2C and B2B customers has gained momentum, companies are realizing the growing importance of what is known as User Generated Content (UGC).

Consumers trust authentic, unpaid recommendations from real customers more than any other type of content.

UGC consists of content such as text, videos, images, and reviews that are generated by real customers, influencers, and independent individuals rather than by the brands themselves. It is important to note that any modifications made to this content should only aim to enhance clarity, conciseness, or formality without altering the original message or quotes. This content focuses on customer experiences, such as reviews, testimonials, case studies, guest posts, comments in online communities and forums, collaborative webinars, podcasts, hosted events, social media posts, and PR campaigns, as well as partner, distributor, and vendor promotions can be utilized in numerous ways to educate both new and current customers about the potential brand experience.

UGC is clear evidence of direct customer feedback, often unsolicited. It is the voice of the customer in its purest form. The value and impact of UGC are even greater in eCommerce settings where this content is widely understood to be a primary driver for conversions and purchase motivation.

In the B2B context, UGC is more than just reviews and case studies, and should be considered to be "any content others create related to your business".

UGC is important in modern digital marketing for many reasons, as summarized below:

Authenticity: UGC is a more authentic and experiential form of content than corporate content because it is created by customers, free from artificial embellishments or supervision by brands. Consumers tend to trust UGC more than traditional advertising, and it serves as a contemporary variation of word-of-mouth marketing, a force that has always played a significant role in influencing consumer purchasing decisions.
Social Proof: UGC offers social proof that impacts the buyer's journey. It builds consumer confidence and is an extremely efficient strategy for a brand to influence its audience and convert them into customers. In simpler terms, social proof is the equivalent of a reference in a B2B setting or someone else's stamp of approval. UGC also facilitates community-building, which can result in greater loyalty and advocacy.
Unlimited Authentic and Unfiltered Content: UGC offers brands unrestricted, genuine, and unedited content to improve brand awareness and strengthen brand reputation. Brands that implement UGC show their willingness to engage in a two-way discussion, fostering more trusted and engaged relationships with consumers.
Cost-Effective: Generating marketing content can be a time-consuming and expensive process for an enterprise, which is why UGC is quickly becoming a critical component of digital marketing campaigns.
Increased Engagement and conversions: User engagement increases due to user-generated content, which is directly correlated with conversions. User-generated content validates and legitimizes your marketing message, leading to an increased likelihood of user conversion and higher sales.

While some marketers still believe that branded content is more trustworthy or preferable to user-generated content, research suggests otherwise. Customers consider authentic user-generated content (UGC) the most trustworthy content in both B2C and B2B contexts.

UGC has many benefits for businesses. Authentic and uncensored content can establish trust and credibility, as customers are more likely to believe and engage with content from peers and independent observers than from the brand itself.

Today, most customers are cautious of claims of superiority made by brands and actively seek information from like-minded customers and independent observers to better understand the product or service during the buyer and customer journey.

Additionally, it is a cost-effective way for a business to create trusted content that can favorably influence engagement and build stronger relationships with customers at various stages in the buyer and customer journeys.

Furthermore, UGC provides valuable insights into customers' experiences and perspectives and enables the enterprise to engage with customers more deeply and effectively. Statistics show that consumers find UGC 9.8x more impactful than influencer content, and 79% of people say UGC highly impacts their purchasing decisions. Some of the most recent research also confirms that consumers rank authentic UGC as the most trustworthy content in their buyer journey.

Here are some recent statistics from reputable sources on the value and impact of UGC:

64% of consumers agree that when a brand they like and use re-shares content by customers, they are more likely to share content about the brand or its products.
76% of consumers have purchased a product because of someone else’s recommendation before.
72% of consumers believe that reviews and testimonials submitted by customers are more credible than the brand talking about their products.
A study by Bazaarvoice showed that websites with UGC can see an increase of 29% in web conversions, a 20% increase in return visitors, and a 90% increase in time spent on-site.
Research by BrightLocal indicated that 79.69% of consumers look at ratings and reviews before making a purchase.
6 in 10 marketers report that their audience engages more with UGC in marketing and communications channels than branded content.
75.78% of consumers have used social media to search for or discover products, brands, and experiences.
Three-quarters or more of travelers were active on at least one social media platform in 2019.
Cost-per-click has been seen to decrease by 50% with the addition of user-generated content in social media ads.
The majority of millennials, 66%, book their travel trips using their smartphone. A higher majority, 74%, said that they use their smartphone for research related to their travels. Again the most trusted content tends to be UGC and peer commentary on travel experience.

These statistics show that User Generated Content (UGC) is a valuable tool for marketers to establish trust, engagement, and loyalty with their audiences. Engaging with UGC helps marketers listen to their customers, understand their needs, and collaborate with them as co-marketers to create more compelling content. This engagement strategy enables marketers to attract new customers, foster brand loyalty, and increase customer satisfaction.

However, research indicates that many businesses still struggle to comprehend, utilize, and harness the potential of fast-moving, high-impact UGC content. Furthermore, most marketing organizations remain focused on developing and disseminating brand messages, rather than actively monitoring and engaging with the ongoing stream of customer feedback across social media and the internet.

The Translation Challenge & Perspective

As can be expected, the volume of user-generated data is constantly increasing in the modern era, and the challenge for the modern enterprise is to track it in all its most relevant variants and to set up translation production processes for the most important and relevant content.

According to World Economic Forum estimations, by 2025, the amount of data created by humans each day will be about 463 exabytes (one exabyte is equal to one billion gigabytes). As of 2021, we produce over 500 million tweets, ~300 billion emails, and 4 million gigabytes of Facebook data every single day.

While this data has primarily focused on G7 economies in the past, it is expected to shift significantly as economic growth continues to surge in the Global South and South Asia over the next two decades. As a result, global business leaders must master the skills to listen, share, communicate, translate, and comprehend various content streams in an expanding array of languages. The languages that hold the utmost relevance at present may not retain the same level of significance in the upcoming decades.

This will require that leading global businesses will enable and be capable of being multilingual along all of the following content dimensions:

Social Media Content: As social media grows into a better search engine, it’s up to marketers to create searchable content. Many buyers request user-generated content along their buying journey and this should be easily accessible as they peruse and investigate your site. Here are some examples of B2B use of social media as a digital marketing channel.

Multilingual Email Content: Personalized email content that enables quick and effortless retrieval of User Generated Content (UGC) and reviews, and prompts customers to share their feedback for future content development.

Digital Advertising: There is a clear trend towards more video/audio content, along with a strong preference for access to genuine user-generated reviews, forums, and discussions.

Web Content: Customers crave reviews from others with similar needs. The inclusion of visual reviews on your website and product pages, in addition to user-generated content, can create the feedback loop necessary to satisfy your audience's desires.

Brand Content: Branded content mixed with relevant and specific user-generated content addressing evaluation issues raised by many customers is crucial. However, numerous consumers only consult it after they have already satisfied themselves with other customer opinion data. While consumers often consult other customer opinions before turning to UGC, buyers are 4-6 times more likely to purchase from purpose-driven companies that they advocate for through UGC and word-of-mouth referrals. Moreover, the addition of UGC in social media ads has been shown to decrease cost-per-click by 50%. 6 out of 10 marketers report that their audience more frequently engages with user-generated content (UGC) in marketing and communications channels than with branded content.

The truth is that today, the #1 marketing channel used by most companies is social media and the brand's website is the second most used marketing channel, especially in B2C settings.

Measuring the success of a UGC campaign involves tracking key performance indicators (KPIs) that align with overall business goals. These can vary by language and can thus help to identify the most and least receptive markets. Here are some KPIs and metrics to consider when evaluating the success of a UGC campaign:

Engagement Metrics: Monitor likes, comments, shares, and clicks to understand the impact of UGC on audience engagement.
Reach and Impressions: Measure the number of people who see your UGC and the total number of times it's displayed.
UGC Volume: Track the total number of user-generated posts, reviews, or other content forms associated with your brand.
Conversion Rates: Analyze how UGC influences customer behavior, such as driving traffic to your website, increasing sales, or prompting sign-ups for newsletters.
Content Performance Metrics: Track metrics tied to specific goals, pieces of content, or distribution channels, such as impressions, reach, engagement, clicks, conversions, sales, revenue, or customer loyalty.
ROI Calculation: Consider factors like content creation costs, revenue spent on paid social ads, the value of your visual content library, cost per click (CPC), and overall conversions when calculating the ROI of your UGC campaign

To be able to participate effectively in the global market an enterprise will need not only the most streamlined and efficient translation production capabilities but also have infrastructure and processes that continually improve and adapts to changing customer requirements.

This is precisely the solution that has been developed by Translated for any global enterprise to be able to undertake this content deluge challenge successfully. This is a solution and a technology that has been developed in close collaboration with clients who have focused on serving customers who have expressed a preference for having multilingual content access at scale, particularly for more dynamic real-time UGC which inform evaluation and purchase decisions.

Unveiling Hyper Adaptive ModernMT

Translated recently announced a new model of ModernMT, its adaptive machine translation (MT) system. The new model, called Hyper Adaptive, enables companies to translate billions of words at ultra-fast speeds without compromising quality. It is domain-specific and designed for use cases such as translating user-generated content, datasets for multilingual large language models, and web content for data mining activities.

In recent years, companies have approached Translated with requests to leverage the accuracy of ModernMT's adaptive MT system to quickly translate specialized, unique content and high volumes of ongoing content. While a generic adaptive MT model can handle the request to some extent, it is not designed to translate millions of words per minute in a specific domain.

Hyper Adaptive solves this issue by using sophisticated compression techniques and training the MT model for specific use cases based on the customer's previous translations and translation memories (TMs) to ensure high-quality performance even at a scale of many billions of words a month.

The resulting highly specialized MT model is much smaller and more efficient than a generic adaptive model and can process content at ultra-fast speeds, in as little as 50ms for a typical sentence. An example to clarify the performance capability at Translated's dedicated data centers: it can translate the entire English Wikipedia (4.4 billion words) into another language in less than a day (3 million words per minute). By training directly using customer data, the Hyper Adaptive model achieves translation accuracy equal to or better than state-of-the-art custom adaptive MT models.

Often, when very high throughput is required, MT systems will need to make compromises on output quality. Typically there is a trade-off between quality and throughput. In contrast, this solution helps companies maintain high quality even when translating massive volumes of content at ultra-high speeds.

In some specific use cases, such as dynamically changing user-generated content, combining the dynamically learning adaptive MT model with ongoing professional translator corrective feedback can further improve the quality of the MT output over time.

Even though the model is optimized throughput speed, the model is still adaptive, and thus, it continues to improve after initial training through ongoing corrective feedback and the addition of new TMs delivered to match the company's style.

As the demand for agile global enterprises scales to translating billions of words a month, solutions like Hyper Adaptive ModernMT allow continuous improvement daily yet can easily translate billions of words of relevant UGC into over 200 languages every day.

We designed the Hyper Adaptive model to enable the translation of content that has never been translated before. Its language coverage allows companies to reach over 99% of the world's population in their own language. Hyper Adaptive is one more step towards global understanding.

Marco Trombetti – Translated CEO

Integration and Costs

Like all other ModernMT models, the Hyper Adaptive model can be integrated into the translation workflow via API. Costs vary depending on the use case, the amount of data to be translated, and the amount and quality of existing translations and TMs. Existing Translated customers can contact their account manager to get a new service quote.

Thanks to the Hyper Adaptive model, user-generated content on Airbnb has reached an unprecedented level of quality, greatly improving the experience for our user base. The real-time, high-quality translation of UGC has helped Airbnb foster a stronger sense of community among our hosts and guests, which has had a tremendous impact on our business.

Salvo Giammarresi – Head of Localization at Airbnb

An Overview of ModernMT V7

2023-12-08T13:33:00.000-08:00

Serious MT technology development requires ongoing efforts and research to continually improve the performance of systems and to address important emerging requirements as the use of MT expands. Researchers have been working on MT for over 70 years and success requires a sustained and continuing effort.

These efforts approach the goal of producing as close as possible to human-quality MT output in multiple ways, and these improvement strategies can be summarized in the following ways:

Acquire better and higher volumes of relevant training data. Any AI initiative is highly dependent on the quality and volume of the training data that is used to teach the machine to properly perform the task.
Evaluate new algorithms that may be more effective in extracting improved performance from available training data. We have seen the data-driven MT technology evolve from Statistical MT (SMT) to various forms of Neural MT (NMT) using different forms of deep learning. The Transformer algorithm which also powers LLMs like GPT-4 is the state-of-the-art in NMT today.
Use more powerful computing resources to dig deeper into the data to extract more learning. As the demand for translation grows with the massive increases in content and ever-expanding volumes of user-created content (UGC) it becomes increasingly important for MT to handle massive scale. Today there are global enterprises that are translating billions of words a month into a growing portfolio of languages and thus scalability and scale are now key requirements for enterprise MT solutions. Some researchers use more computing during the training phase of the MT model development process as there can be quality advantages gained at inference from doing this extra-intensive training.
Build more responsive and integrated human-machine collaboration processes to ensure that expert human feedback is rapidly incorporated into the core data used to tune and improve these MT engines. While the benefits gained from more and better data, improved algorithms, and more computing resources are useful, the integration of expert human feedback into the MT model's continuous learning is a distinctive advantage that allows an MT model to significantly outperform models where only data, algorithms, and compute are used.
Add special features that address the unique needs of large groups of users, or use cases that are being deployed. As the use of MT continues to build momentum with the enterprise many specialized requirements also emerge e.g. enforcement of specific terminology for brand integrity, profanity filters to avoid egregious MT errors, and improvement of document-specific content awareness.

All these different approaches have the goal of producing improved MT output quality and it will require progress along all of these different fronts to achieve the best results.

The ModernMT development team pursues ongoing improvements along all these fronts on an ongoing basis, and ModernMT V7 is the result of several measured improvements on many of these dimensions to provide improved performance.

As machine translation (MT) continues to evolve and expand beyond the traditional use case areas such as e-commerce, global collaboration, and customer care, those interested in the expanding future of localization are now also looking to use generative artificial intelligence (AI) and, in particular, large language models (LLMs) such as OpenAI’s GPT

Unlike typical Neural MT, LLMs prioritize fluency over accuracy. But while LLMs show promising results in improving the fluency of translations, they can also produce confabulations (hallucinations), i.e. output that is inaccurate or unrelated to the input data and thus require careful monitoring and oversight to ensure accuracy.

With the latest release of ModernMT (V7), Translated has introduced a novel technique to increase the accuracy of neural MT models, called “Trust Attention,” which can also be used to address reliability within generative AI models.

The design and implementation of Trust Attention was inspired by how the human brain prioritizes trusted sources in the learning process, linking the origin of data to its impact on translation quality.

ModernMT V7 preferentially uses the most trusted data (identified by users) and thus the highest quality and most valuable training data has the greatest influence on how a model performs. This is in stark contrast to most MT models which have no discernment of data quality and thus tend to perform using only statistical density as the primary driver of model performance.

The Trust Attention capability prioritizes its learning based on data value and importance like how humans sift through multiple sources of information to identify the most trustworthy and reliable ones. Data extracted from translations performed and reviewed by professional translators is always preferred over other data, especially unverified translation memory content acquired from web crawling, which is typically used by most MT systems today.

The development team at ModernMT considers Trust Attention to be as significant an innovation as Dynamic Adaptive MT engines. It is the kind of feature that can dramatically improve MT system performance for different use cases when properly used.

According to an evaluation by professional translators, done to validate the beneficial impact, Trust Attention alone improves MT quality by up to 42%, and by an average of 16.5% in cases across the top 50 languages. Interestingly, even many high-resource languages, such as Italian and Spanish, showed significant improvements (in the 30% range) in human evaluations.

ModernMT V7 New Features: Up to 60% Better MT Quality

ModernMT V7 is the evolution of Translated’s renowned adaptive MT system, recognized as a leader in the Machine Translation Software Vendor Assessment for enterprises by IDC Marketscape 2022, and as “the most advanced implementation of responsive MT for enterprise use” in CSA Research’s 2023 Vendor Briefing.

In addition to Trust Attention, ModernMT V7 includes several other new features that further enhance the reliability and dependability of MT output. Here are the most impactful:

Advanced Terminology Control: Along with its ability to learn the client’s terminology from past translations, ModernMT now provides companies with self-managed glossary control to ensure brand and context-specific terminology consistency. This ability to enforce terminology has not been needed in the past because the dynamic adaptive MT technology acquires terminology very effectively even without this feature.
DataClean AI: V7 relies on a new sanitization algorithm that identifies and removes poor-quality data to refine the training material and reduce the likelihood of hallucinations. The close examination of errors over many years has provided clues on the root causes of strange output from MT engines. This learning and related benefits also transfer to LLM-based MT engines should they become more viable in the future.
Expanded Context: ModernMT can now leverage up to 100,000 words of document content —Four times more than GPT-4 - to preserve style and terminology preferences, providing unparalleled document-specific accuracy in MT suggestions and providing controls to solve persistent problems such as gender bias and inconsistent terminology.
Profanity Filter: V7 masks words in translation suggestions that could be regarded as inappropriate in the target language, minimizing the possibility of cultural offenses.

The combined effect of all the improvements and innovations described above has a significant impact on the overall performance and capabilities of ModernMT.

The MT quality is now considered to be 45% to 60% better than the previous version according to systematic human evaluations.

These improvements have greatly reduced the Time to Edit (TTE) for MT suggestions. At the end of July, the aggregate TTE measured across tens of thousands of samples showed a 20% reduction, reaching a record low of 1.74 seconds. This milestone indicates an acceleration towards singularity in translation, a trend further supported by preliminary TTE data collected continuously since the 1.74 seconds record was established.

The Hallmark of the Symbiosis Between Translators and MT

ModernMT V7 is available in 200 languages and covers all the fastest-growing economies likely to emerge over the next 20 years. Its hallmark is the ability of the MT model to learn from corrections in real time, enabling a powerful collaboration between the expertise of professional translators and the speed and capacity of MT.

Thanks to this unique approach, combined with Translated’s vast community of professional translators and leading AI-enabled localization solutions (Gartner 2022), Airbnb was able to ditch the translate button and simply make multilingual content pervasive and comprehensive across the platform and become one of the top 3 global brands (Global by Design 2023).

Success stories like that of Airbnb and others, along with market research that shows the ever-growing demand for more multilingual content, have led Translated to estimate that once MT reaches what is commonly referred to as “parity with human translation” (singularity in translation), we can expect a 100-fold increase in MT requests alongside a 10-fold growth in demand for professional translations.

We are entering a new era in which significantly larger volumes of content will be translated automatically. In this scenario, professional translators play an increasingly important role, not only in guiding the MT through the adaptive process but also in ensuring that the key messages are appropriately conveyed. By engaging the best translators with the best adaptive MT, companies can now take on projects that simply weren’t feasible before.

Moving Towards LLMs for Translation

Recently, Translated conducted a large-scale study to compare the performance of the most advanced MT systems with LLMs in terms of enterprise readiness. The findings showed real potential for LLMs, particularly in terms of more fluent translation quality, and also revealed areas where improvements are needed. Based on this research, Translated believes elements of both MT systems and LLMs will be critical as we move forward, and plans to provide in-depth insights into using LLMs in translation in the coming weeks and months.

Comments by John Tinsley of Translated SRL on LLM-based Translation in November 2023:

❗ LLMs - the new default for machine translation ❗

I've seen a lot of commentary along these lines over the past few months. I've also seen a lot of well-articulated commentary, not strictly opposing this line, but with added nuance and context (a challenge on the internet!)

I wanted to offer my two cents, from being at the forefront of these developments through actually building the software, and from having many conversations with clients.

In summary, today, LLMs are not fit for purpose as a drop-in replacement for MT for enterprises.

More broadly, any general-purpose GPT application will find it super challenging to outperform a purpose-built enterprise solution that considers an entire workflow in a holistic way (note, the purpose-built solution could be GPT-based itself, but with a much narrower scope).

🧠 As a concrete example, at Translated, we've built a version of ModernMT that uses GPT-4 as a drop-in replacement for our Transformer model (while retaining the framework in ModernMT that allows us to do real-time adaptation). We've also built, and continue to test, a version of ModernMT with other open source LLMs fine-tuned for translation.

While we find that they perform well in terms of quality on some content types and some languages, it's far from unanimous across the board. And that's just quality. Other critical enterprise factors such as speed, cost, and importantly, information security, are just not there yet. Similarly, language coverage for LLMs is a challenge as there are large discrepancies in performance, particularly for content generation.

I appreciate there's a lot of downward pressure today to use AI across workflows, particularly in localization teams for translation and content creation. Let me hop on my soapbox to give you some information that might help with those conversations...

📣 If you're using MT, you're already using very advanced AI! 📣

You probably already know that the T in GPT stands for Transformer. But did you know that the Transformer was invented at Google in 2017...specifically for machine translation!? So what we're seeing today is a repurposing of that technology for a different application (generative AI) other than translation.

There will come a day, possibly soon, when it's better across the board to use LLMs for translation. When that happens, it will become the standard and people will stop talking about it. Just like when Neural MT came on the scene ~6 years ago.

When it happens, Translated will have already deployed it in ModernMT and worked out the best way for you to adapt it to your business. We already have a lot of ideas. We already have a lot of data from the testing I mentioned earlier. And in the meantime, we still have what I believe to be the most complete enterprise translation solution available.

Prioritization of Trustworthy Data in NMT Model Development

2023-12-07T17:38:00.000-08:00

ModernMT: A History of Innovation and Evolution

Neural machine translation (NMT) has had impressive evolutionary progress over the last five years, showing continually improving performance in accuracy. This progress is specially marked and clear with the dynamically adaptive NMT models like ModernMT, where small amounts of ongoing corrective expert feedback results in continuously improving MT output quality.

The historical track record with ModernMT has been so impressive that it did not seem unreasonable to point out that ModernMT's performance across billions of samples and many languages was approaching singularity in production-use scenarios. This is a point at which human editors are unable to tell whether the sample is coming from a human or machine since they are so close in quality and style.

NMT technology continues to evolve and improve with recent updates that provide much richer and more granular document-level contextual awareness. Document-level adaptation in machine translation has been a core design intention with ModernMT from the outset. This originally involved referencing similar sentences in translation memories and using these to influence new translation requests.

Despite the success and pioneering nature of this approach, early implementations faced challenges: translators struggled with issues such as gender bias and inconsistent terminology due to the distance between the segment they were working on and its related context.

By taking into account all edits within an individual document, even those in completely different or distant segments, the MT model is now able to provide document-specific translation suggestions. This development significantly reduces the need for repeated corrections of elements such as pronouns. This has greatly eased the amount of corrective work needed to address gender bias errors and modify incorrect terminology.

The Emergence of LLM-Based Translation Models

In the summer of 2023, we are at an interesting junction in the development of AI-based language translation technology, where we now see that Large Language Models (LLMs) are also an emerging technological approach to having machines perform the language translation task. LLMs are particularly impressive in handling idioms and enhancing the fluency of machine translations.

However, at this point, there are still serious latency, high training, and inference costs, and most importantly trustworthiness issues with the output produced by Generative AI models like GPT-4. These issues will need to be addressed for Gen AI models to be viable in production-use translation settings. There is also the issue of poor performance in low-resource languages and a bias toward better performance with systems that translate into English.

The AI product team at Translated continues to research and investigate the possibilities for continued improvement of pure NMT models, hybrid NMT and Gen AI models, as well as pure Gen AI models. Special consideration is given to ensure that any major improvements made in existing NMT model technology can also be leveraged in the future with potentially production-use capable Gen AI translation models.

AI systems are trained on large datasets found on the internet, data that can be of varied quality and reliability. If the data used for training is biased or of poor quality, it can lead to biased or unreliable AI outputs, and we have seen that one of the biggest obstacles to the widespread use of Gen AI in mission-critical applications has been the high levels of problematic and fluent, but untrustworthy output.

Better data validation and verification can indeed improve the trustworthiness of AI output. Data validation involves ensuring that the data used to train and evaluate AI models is accurate, consistent, and representative of the real-world scenarios the AI system will encounter. This can be done through data cleaning, data preprocessing techniques, and careful selection of training data.

The Importance of Data Quality

With this in mind, ModernMT Version 7, introduces a significant upgrade to its core adaptive machine translation (MT) system. This new version introduces Trust Attention, a novel technique inspired by how human researchers prioritize information from trusted sources and the V 7 model preferentially uses identified trustworthy data both in training and inference.

This innovation is the first of a long-term thematic effort focused on improving data quality being undertaken at Translated, to ensure that data quality and trustworthiness is a pervasive and comprehensive attribute of all new translation AI initiatives.

Translated has realized from a large number of independent evaluations and internal testing over the years, that this focus on data quality enables ModernMT to compare favorably in quality performance evaluations to many other better-funded public generic MT engines produced by Google, Microsoft, and others.

They have developed a robust data governance framework to define data quality standards, processes, and roles over the last decade. This helps create a culture of data quality and ensures that data management practices are aligned with organizational efficiency goals and technology improvements.

This culture, together with close long-term collaboration with translators ensures that ongoing data replenishment is of the highest quality and systematically identifies and removes lower-quality data. Finally, regularly measuring and monitoring data quality metrics helps to identify and address potential issues before they impact AI performance.

Trust Attention is possible because of the long-term investment in developing a data-quality culture that produces the right data to feed innovation in new AI technologies.

While it is common practice in the industry to use automated algorithm-driven methods to drive data validation and verification practices, Translated’s 20 years of experience working with human translators show that human-verified data is the most trustworthy data available to drive the learning of language AI models.

This human-verified data foundation is precisely the most influential driver of preferential learning in the ModernMT Version 7 models. Automated cleaning and verification are valid ways to enhance data quality in machine learning applications, but 10 years of experience show that human-verified data provide a performance edge that is not easily matched by large-scale automated cleaning and verification methods.

Human quality assessments made comparing ModernMT V6 output versus V7 output show that the use of Trust Attention improves translation quality by as much as 42% of the time based on human evaluations. It is interesting to note that many high-resource languages like Spanish, Chinese, and Italian also saw major improvements near the 30% range in human evaluations.

Human evaluations and judgments are corroborated by concurrent BLEU and COMET score measurements which are also used to ensure that conclusions being drawn by introducing new technology are accurate and trustworthy.

The following is a sample of MT output from the ModernMT V7 system compared to the previous V6. Three independent professional reviewers were shown two randomized samples of a translation of the same source segment and asked to judge if one was better, no different, or worse. The chart above shows how often the V7 translation was preferred by a majority of the reviewers by language.

Examples below show sample sentences from English to Brazilian Portuguese and Simplified Chinese.

“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.”

Andrew Ng, Professor of AI at Standford University and founder of DeepLearning.AI

How is Trust Attention Different?

“Garbage in, garbage out” (GIGO) is a concept in computing and artificial intelligence (AI) that highlights the importance of input data quality. It means that if the input data to a system, such as an AI model or algorithm, is of poor quality, inaccurate, or irrelevant, the system’s output will also be of poor quality, inaccurate, or irrelevant.

This concept is particularly significant in the context of AI models which use machine learning and deep learning models, and rely heavily on the data used for training and validation. If the training data is biased, incomplete, or contains errors, the AI model will likely produce unreliable or biased results.

All Data Is Not Equally Important

Traditional MT systems generally are not able to distinguish between trustworthy data and lower-quality training material during the training process, and typically all the data has equal weight. Thus, high-quality data and high-volume noisy data can have essentially the same amount of impact on how a translation model will perform.

Trust Attention allows an engine to prioritize more trustworthy data and have this data influence ongoing model behavior more heavily.

ModernMT now uses a first-of-its-kind weighting system to enable primary learning from high-quality, trusted, and verified data – translations performed and/or reviewed by professional translators – over unverified data that is acquired from the Web.

As with adaptive MT, Translated looked to established human practices to develop this new technique. In any serious research, humans collect and sift through multiple information sources to identify and assign preferential status to the most trustworthy and reliable data sources.

ModernMT V7 similarly identifies the most valuable training data and prioritizes its learning based on certified and verified data by modeling this human behavior. This certification and verification is not an automated machine-led process, rather it is an expert human validation that raises the trustworthiness of the data.

This focus on prioritizing the use of trusted, verified data is a major step forward in the development of enterprise-focused MT technology.

The efforts made to identify and build repositories of high-quality data will also be useful in the future if there is indeed a shift to Gen AI-based language translation models.

Today, there is considerable discussion regarding the application of large language models in translation. While the traditional NMT models seem to perform much better on the accuracy dimension, though they can be less fluent than humans, LLMs tend to emphasize and often win on fluency, even though these models often produce misleading output due to hallucinations (generative fabrication).

Trust Attention methodology deployed in LLMs, will also enhance the accuracy of generative models, reducing the chances of random fabrication and confabulation errors. This could set the stage for an emerging era of new machine translation methodologies, one that combines the accuracy of dynamic adaptive NMT with the fluency of Gen AI models.

ModernMT Version 7 also introduces a data-cleaning AI that minimizes the likelihood of hallucinations, making it valuable for companies seeking greater accuracy in high-volume automated translation use cases, and is also useful for translators integrating MT into their workflow.

John Tinsley, VP of AI Solutions at Translated, added, "We are confident that these new data validation and verification techniques can also improve accuracy in generative AI systems, paving the way for the next generation of machine translation."

The introduction of this new approach is a major step forward for companies seeking greater accuracy in the translation of large volumes of content or requiring a high degree of customization of the MT engine, as well as for translators integrating MT into their workflow.

The combined impact of these multiple innovations provides global enterprises with a superior platform to rapidly transform generic engines into highly tuned enterprise-specific translation engines.

The English-Centric Bias of Large Language Models

2023-12-05T11:07:00.000-08:00

The internet is the primary source of information, economic opportunity, and community for many worldwide. However, the automated systems that increasingly mediate our interactions online — such as chatbots, content moderation systems, and search engines — are primarily designed for and work far more effectively in English than in the world’s other 7,000 languages

It is clear to anyone who works with LLMs and multilingual models, that there are now many powerful and impressive LLM models available for generating natural and fluent texts in English. While there has been substantial hype around the capabilities and actual potential value of a wide range of applications and use cases, the benefits have been most pronounced for English-speaking users.

It is also now increasingly being understood that achieving the same level of quality and performance for other languages, even the ones that are widely spoken, is not an easy task. AI chatbots are less fluent in languages other than English and are thus threatening to amplify the existing language bias in global commerce, knowledge access, basic internet research, and innovation.

In the past, it has been difficult to develop AI systems — and huge language models in particular — in languages other than English because of what is known as the resourcedness gap.

The resourcedness gap describes the asymmetry in the availability of high-quality digitized text that can serve as training data for a large language model and generative AI solutions in general.

English is an extremely highly resourced language, whereas other languages, including those used predominantly in the Global South, often have fewer examples of high-quality text (if any at all) on which to train language models.

English-speaking users have a better user experience with generative AI than users who speak other languages, and the current models will only amplify this English bias further.

It is estimated that although GPT-3's training data consists of > 90% English text it did include some foreign language text, but not enough to ensure that model performance across different languages is consistent. GPT-3 was the foundation model used to build ChatGPT and though we do not know what data was used in GPT-4 we can safely assume that no major sources of non-English data have been acquired, primarily because it is not easily available.

Source: Lost in Translation Large Language Models in Non-English Content Analysis

Researchers like Pascale Fung and others have pointed out the difficulty for many global customers because of the dominance of English in eCommerce. It is much easier to get information about products in English in online marketplaces than it is in any other language.

Fung, director of the Center for AI Research at the Hong Kong University of Science and Technology, who herself speaks seven languages, sees this bias even in her research field. “If you don’t publish papers in English, you’re not relevant,” she says. “Non-English speakers tend to be punished professionally.”

The following table describes the source data for the training corpus of GPT-3 which is the data foundation for ChatGPT:

Datasets	Quantity (Tokens)	Weight in Training Mix	Epochs elapsed when training for 300 BN tokens
Common Crawl (filtered)	410 BN	60%	0.44
WebText2	19 BN	22%	2.90
Books1	12 BN	8%	1.90
Books2	55 BN	8%	0.43
Wikipedia	3 BN	3%	3.40

Understanding what data has been used to train GPT-3 is useful. This overview provides some valuable details that also help us understand the English bias and US-centric perspective that these models have.

Fung and others are part of a global community of AI researchers testing the language skills of ChatGPT and its rival chatbots and sounding the alarm about providing evidence that they are significantly less capable in languages other than English.

ChatGPT still lacks the ability to understand and generate sentences in low-resource languages. The performance disparity in low-resource languages limits the diversity and inclusivity of NLP.
ChatGPT also lacks the ability to translate sentences in non-Latin script languages, despite the languages being considered high-resource.

“One of my biggest concerns is we’re going to exacerbate the bias for English and English speakers,” says Thien Huu Nguyen, a University of Oregon computer scientist who is also a leading researcher raising awareness about the often impoverished experience non-English speakers routinely experience with generative AI. Nguyen specifically points out:

ChatGPT’s performance is generally better for English than for other languages, especially for higher-level tasks that require more complex reasoning abilities (e.g., named entity recognition, question answering, common sense reasoning, and summarization). The performance differences can be substantial for some tasks and lower-resource languages.

ChatGPT can perform better with English prompts even though the task and input texts are intended for other languages.
ChatGPT performed substantially worse at answering factual questions or summarizing complex text in non-English languages and was more likely to fabricate information.

The research tends to point clearly to the English bias of the most popular LLMs and state: The AI systems are good at translating other languages into English, but they struggle with rewriting English into other languages—especially for languages like Korean, with non-Latin scripts.

“51.3% of pages are hosted in the United States. The countries with the estimated 2nd, 3rd, and 4th largest English-speaking populations—India, Pakistan, Nigeria, and The Philippines—have only 3.4%, 0.06%, 0.03%, 0.1% the URLs of the United States, despite having many tens of millions of English speakers.”

(Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, 2021, p. 4)

The chart below displays a deeper dive into the linguistic makeup of the Common Crawl data by the Common Sense Advisory research team.

Recently though, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. Instead of being trained on text from only one language, multilingual language models are trained on text from dozens or hundreds of languages at once.

Researchers posit that multilingual language models can infer connections between languages, allowing them to apply word associations and underlying grammatical rules learned from languages with more text data available to train on (in particular English) to those with less.

Languages vary widely in resourcedness, or the volume, quality, and diversity of text data they have available to train language models on. English is the highest-resourced language by multiple orders of magnitude, but Spanish, Chinese, German, and a handful of other languages have sufficiently high resources to build language models.

However, they are still expected to be lower in quality than English language models. Medium resource languages, with fewer but still high-quality data sets, such as Russian, Hebrew, and Vietnamese, and low resource languages, with almost no training data sets, such as Amharic, Cherokee, and Haitian Creole, have too little text for training large language models

However, there are many challenges and complexities involved in developing multilingual and multicultural LLMs that can cater to the diverse needs and preferences of different communities. Multilingual language models are still usually trained disproportionately on English language text and thus end up transferring values and assumptions encoded in English into other language contexts where they may not belong.

Most remedial approaches to address the English bias rely on the acquisition of large amounts of non-English data to be added to the core training data to reduce the English bias in current LLMs; data which is not easily found or often non-existent. Certainly not at the scale, volume, and diversity that English training data exists.

English is the closest thing there is to a global lingua franca. It is the dominant language in science, popular culture, higher education, international politics, and global capitalism; it has the most total speakers and the third-most first-language speakers.

The bias in the NLP research community is evident in the chart below. ACL papers are more likely to be published in English than any other language by a factor of 11X to 80X!

Languages mentioned in paper abstracts. Top most mentioned languages in abstracts of papers published by the Association for Computational Linguistics, May 2022-January 2023.

Recent US congressional hearings also focused on this language-bias problem when Senator Alex Padilla (a native Spanish speaker) of California questioned the CEO of OpenAI about improving the experience for the growing population of non-English users even in the US and said: “These new technologies hold great promise for access to information, education, and enhanced communication, and we must ensure that language doesn’t become a barrier to these benefits.”

However, the fact remains, and OpenAI clearly states that the majority of the underlying training data used to power ChatGPT (and most other LLMs) came from English and that the company’s efforts to fine-tune and study the performance of the model is primarily focused on English “with a US-centric point of view.”

This also results in the models performing better on tasks that involve going from Language X to English than on tasks that involve going from English to Language X. Because of the data scarcity and substantial costs involved in correcting this it is not likely to change soon.

Because the training text data sets used to train GPT models also have some other languages mixed in, the generative AI models do pick up some capability in other languages. However, their knowledge is not necessarily comprehensive or complete enough, and in a development approach that implicitly assumes that scale is all you need, most languages simply do not have enough scale in training data to perform at the same levels as English.

This is likely to change over time to some extent, and already the Google PaLM model claims to be able to handle more languages, but early versions show only very small incremental improvements in a very few select languages.

Each new language that is "properly supported" will require a separate set of guardrails and controls to minimize problematic model behavior.

Thus, beyond the monumental task of finding massive amounts of non-English text and re-training the base generative AI model from scratch, researchers are also trying other approaches e.g., creating new data sets of non-English text to try to accelerate the development of truly multilingual models, or by generating synthetic data by using what is available in high resource languages like English or Chinese, which are both less effective than simply having the adequate data volume in the low-resource language in the first place.

Nguyen and other researchers say they would also like to see AI developers pay more attention to the data sets they feed into their models and better understand how that affects each step in the building process, not just the final results. So far, the data and which languages end up in models has been a "random process," Nguyen says.

So when you make a prompt request in English, it draws primarily from all the English language data it has. When you make a request in traditional Chinese, it draws primarily from the Chinese language data it has. How and to what extent these two piles of data inform one another or the resulting outcome is not clear, but at present, experiments show that they at least are quite independent.

The training data for these models were collected through long-term web crawling initiatives, and a lot of it was pretty random. More rigorous controls to reach certain thresholds of content for each language -as Google tried to do with PaLM- could improve the quality of non-English output. It is also possible that more carefully collected and curated data that is better balanced linguistically could improve performance across more languages.

The T-LM (Translated Language Model) Offering

The fundamental data acquisition and limited and sub-optimal accessibility problems described above could take years to resolve. Thus, Translated Srl is introducing a way to address the needs of a larger global population interested in using GPT-4 for content creation, content analysis, basic research, and content refinement in their preferred language.

The following chart shows the improved performance available with T-LM across several languages. Users can expect the performance improvements to continue to increase and improve as they provide corrective feedback daily.

Combining the power of the state-of-the-art adaptive machine translation technology with OpenAI's latest language model will result in empowering users across 200 languages to engage and explore the capabilities of GPT-4 in a preferred non-English language and achieve superior performance.

T-LM will help unlock the full potential of GPT-4 for businesses around the world. It provides companies with a cost-effective solution to create and restructure content and do basic content research in 200 languages, bridging the performance gap between GPT-4 in English and non-English languages.

A detailed overview of the 200 specific languages and their importance in the changing global dynamics is described here.

Many users have documented and reported sub-optimal performance when searching with Bing Chat when they query in Spanish rather than English.

In a separate dialog, when queried in English, Bing Chat correctly identified Thailand as the rumored location for the next set of the TV show White Lotus, but provided “somewhere in Asia” when the query was translated to Spanish, says Solis, who runs a consultancy called Orainti that helps websites increase visits from search engines.

Other discussions point out that ChatGPT performs sub-optimally in most languages other than English. Techcrunch also ran some tests to demonstrate that ChatGPT has lesser performance in non-English languages.

Additionally, using GPT-4 in non-English languages can cost up to 15 times more (see the charts below). Research has shown that speakers of certain languages may be overcharged for language models while obtaining poorer results, indicating that tokenization may play a role in both the cost and effectiveness of language models. This study shows the difference in cost by language family which can be significantly higher than English.

Independent researchers point out how the same prompt varies across languages and that some languages consistently have a higher token count. Languages such as Hindi and Bengali (which together over 800 million people speak) resulted in a median token length of about 5 times that of English. The ratio is 9 times that of English for Armenian and over 10 times that of English for Burmese. In other words, to express the same prompt or sentiment, some languages require up to 10 times more tokens.

Source: All languages are NOT created (tokenized) equal

To express the same sentiment, some languages require up to 10 times more tokens

Implications of tokenization language disparity

Overall, requiring more tokens (to tokenize the same message in a different language) means:

Non-English users are limited in how much information they can put in the prompt (because the context window is fixed).
It is more costly as generally more tokens are needed for equivalent prompts.
It is slower and takes longer to run and often results in more fabrication and other errors.

OpenAI’s models are increasingly being used in countries where English is not the dominant language. According to SimilarWeb.com, the United States only accounted for 10% of the traffic sent to ChatGPT in Jan-March 2023. India, Japan, Indonesia, and France all have large user populations that are almost as large as the US user base.

Translated's T-LM service integrates the company’s award-winning adaptive machine translation (ModernMT) with GPT-4 to bring advanced generative AI capabilities to every business in the languages spoken by 95% of the world's population. This approach also lowers the cost of using GPT-4 in languages other than English, since the pricing model is based on text segmentation (tokenization) that is optimized for English. By ensuring that all prompts submitted to GPT-4 are in English the billing will be equivalent to the more favorable and generally lower-cost English tokenization. T-LM, instead, will always use the number of tokens in English for billing.

The Adaptive ModernMT technology, unlike most other MT technology available today can learn and improve dynamically and continuously with ongoing corrective feedback daily. Thus, users who work with T-LM can drive continuous improvements in output produced from GPT-4 by providing corrective feedback on the translations produced by T-LM. This is something that is not possible with the most commonly used static MT systems where users would be confined and limited to generic system performance.

T-LM addresses the performance disparity experienced by non-English users by translating the initial prompt from the source language to English and then back to the user's language using a specialized model that has been optimized for the linguistic characteristics typically used in prompts.

T-LM combines GPT-4 with ModernMT, an adaptive machine translation engine, to offer GPT-4 near English-level performance in 200 languages.

T-LM works by translating non-English prompts into English, executing them using GPT-4, and translating the output back to the original language, all using the ModernMT adaptive machine translation.
T-LM is available to enterprises via an API and to consumers through a ChatGPT plugin.

The result is a more uniform language model performance capability across many languages and enhanced GPT-4 performance in non-English languages.

Customers can optionally use their existing ModernMT keys to employ adaptive models within GPT-4.
An indirect benefit of T-LM is that it has cost up to 15x lower than GPT-4, thanks to a reduced number of tokens billed. GPT-4 counts significantly more tokens in non-English languages. T-LM, instead, will always use the number of tokens in English for billing

Therefore, Translated's integration with OpenAI enhances GPT-4's performance in non-English languages by combining GPT-4 with the ModernMT adaptive machine translation, resulting in a more uniform language model capability across languages and lower costs.

Use cases for T-LM include assisting global content creation teams in a broad range of international commerce-related initiatives, allowing companies from Indonesia, Africa, and various parts of India to make their products visible in online eCommerce platforms to US and EU customers, providing better multilingual customer support, making global user-generated content visible and understandable in the customer’s language.

T-LM can be used in many text analysis tasks needed in business settings, e.g., breaking down and explaining complicated topics, outlining blog posts, sentiment analysis, personalized responses to customers, summarization, creating email sales campaign material, or suggesting answers to customer agents.

T-LM works together with GPT to create a wide range of written content or augment existing content to give it a different intonation, by softening or professionalizing the language, to improve content creation and transformation automation while providing a fast and engaging user experience. This is now possible to do in 200 languages that ModernMT supports.

There are many ways GPT-4 can produce ‘draft’ text that meets the length and style desired, which can then be reviewed by the user,” Gartner said in a report on how to use GPT-4. “Specific uses include drafts of marketing descriptions, letters of recommendation, essays, manuals or instructions, training guides, social media or news posts.”

T-LM will allow students around the world to access knowledge content, and use GPT-4 as a research assistant and access a much larger pool of information. In education, GPT-4 can be used to create personalized learning experiences, as a tutor would. And, in healthcare, chatbots and applications can provide simple language descriptions of medical information and treatment recommendations.

T-LM will enhance the ability of large and SME businesses to engage in new international business by assisting in basic communication, and understanding, and providing more complete documentation on business proposals using the strengths of both GPT-4 and T-LM working together.

T-LM is available now through API. More information on the service can be found at translatedlabs.com/gpt.

The MT languages that will matter most over the next 50 years

2023-06-23T15:27:00.000-07:00

Translated recently announced that ModernMT now supports 200 languages, setting a new benchmark in the industry. No other commercial MT service currently supports such an extensive range of languages. By expanding its language coverage to a potential reach of 6.5 billion people, Translated enhances the ability of enterprises to create stronger connections with their users and customers worldwide, fostering better communication and understanding of a larger global customer base.

In the modern era, we are rapidly moving to a world where a global enterprise needs to expand the scope and nature of its communication and information sharing with the modern digital customer. There is a relationship between content strategy, e-commerce, and MT, as much of this new content that enhances the customer experience is constantly changing, and there is great value in making it multilingual to enable engagement with a broader global customer base.

The modern digital-first customer demands and expects much more relevant information from every organization they interact with. The lack of needed information can easily trigger a potential customer to walk away from a brand and company that may otherwise be a very good fit in terms of matching customer needs to available product offerings. We know that today:

The modern buyer and customer journey has many digital touchpoints.
Global-savvy companies are increasingly moving to a business model focused on customer needs, attempting to serve as much information as needed to improve the global customer experience.
Companies are translating everything that might be useful to a customer, not just what is mandated by local commercial regulations.

So, what are some of the things that are required of a globally focused business to be successful with an ever-growing global customer base?

The list of actions recommended by globalization consultants on best practices for providing relevant information to the modern digital-first customer includes all of the following actions:

Personalize communication and content to their interests and needs
Provide easy access to information through self-service portals and knowledge bases
Utilize chatbots and AI technology to provide instant and accurate answers to common questions
Collect customer feedback to continuously improve and adjust information and communication strategies
Offer various communication channels for different preferences, such as email, phone, social media, and messaging apps.

A global enterprise can immediately establish a comprehensive digital presence in international markets if they solve the translation challenge and make relevant content multilingual at scale. The notion of digital-first applies both to the global enterprise and the global customer. Digital-first also means that it is much easier for a company to start engaging with a global customer base. A digital-first globalization strategy allows the enterprise to expand rapidly and be customer-centric across the world.

Never has it been more critical for a company to translate its content into many different languages to quickly establish a relationship with global customers. The sheer volume and broad scope of the task require that the translation challenge needs to be handled in a way that is efficient, streamlined, and scalable.

To achieve a comprehensive digital presence, a global enterprise needs to focus on solving the translation challenge and making all relevant content multilingual rather than trying to minimize the translated content volume.

The modern era requirement in the digital age is to "translate everything".

Achieving this goal necessitates utilizing an adaptable machine translation technology that consistently learns and enhances itself, allowing corporations to interact with individuals in developing economies worldwide.

Source: The World in 2050 Study by PwC

The Changing Customer Requirements

The primary motivation for translating enterprise content is to enable and drive international revenue. Translation of all relevant content is necessary for building an international business. While the translation tasks related to the product packaging and basic instructions for the use of a product are mandated, there is now a growing need for translating more dynamic content that is a combination of both unstructured internal corporate content and external non-corporate content such as user impressions, reviews, and feedback from social media and influencers on the customer experience related to the product offerings of a company.

The digital landscape and audience of the 21st century present a challenging environment for modern global enterprises.

Most internet users do not search for products, but rather they seek answers to questions. If an enterprise can provide useful answers, potential customers may develop trust and possibly become advocates and product champions.

If the enterprise can help a customer understand, and educate them on the general subject domain, not just specific product-related subject matter, potential customers may begin to trust your corporate content, and if you can provide a good customer experience after they buy your product, they may even advocate using your products.

Generally, no potential customer is searching and hoping to find a website that is merely a corporate ad, filled with self-congratulating content on how great the company thinks it is and how wonderful it thinks its products are.

Unfortunately, this kind of crap content is still common on many corporate websites which are filled with marketing-speak. Social media technologies have facilitated an ongoing, real-time dialogue that has reversed the traditional direction of conversations between brands and their customers.

Consumers are now leading the conversation and brands need to listen.

Trusted, authentic user-generated content (UGC) has a significant influence on purchasing decisions, especially in online marketplaces.

New buyers trust the shared authentic experience of other real customers more than any slick pitch created by the corporation.

Customers want to know what does not work well, as much as what does, BEFORE they buy.

Marketing-speak refers to the use of clichés, buzzwords, and empty superlatives in marketing content. It is typically self-congratulating in tone and is often an abundance of empty assertions of being “the best” without meaningful context and trustworthy references.

Marketing-speak or corporate-speak is often found in press releases, brochures, white papers, and sales letters.

Changing Macroeconomic Trends

The world is also changing both in terms of geographical shifts in international trade opportunities and relative economic power. The relative importance of emerging and developing economies is growing, and any global enterprise wishing to be relevant 10 years from now needs to understand this shift.

The global economy is gradually moving away from a G7-dominated perspective. Examining these patterns clarifies why this language expansion project is crucial at this moment.

This comparison between the G7 and E7 economies done by Pricewaterhouse Coopers (PwC) for “The World in 2050” study, provides a capsule view of these trends. PwC estimates six of the seven largest economies in the world are projected to be emerging economies in 2050 led by China (1st), India (2nd), and Indonesia (4th).

Over the coming decades, emerging economies will drive global growth. Vietnam, India, and Bangladesh could be three of the fastest-growing larger economies over this period. This growth momentum directly relates to the languages that are growing in importance. By 2050, PwC projects that the G7’s share of world GDP will fall to only around 20%, while the E7 will increase their share to almost 50% of global GDP at Purchasing Power Parity (PPPs). This means that the top 10 Indic languages which include Bengali become strategic growth opportunities.

The chart above shows estimates of three of the fastest-growing economies with expected improvements in global rankings. In contrast, three of the fastest-falling countries are Australia (from 19^th to 28^th), Italy (12^th to 21^st), and Spain (15^th to 26^th).

Some other highlights from the PwC “The World in 2050” research include:

The top 15 fastest-growing economies over the next 30 years will all be developing and emerging market economies according to PwC projections
Europe’s share of the world economy at PPPs could fall from around 15% to 9% by 2050
Brazil and Mexico could be larger than Japan and Germany by 2050
India could increase its share of world GDP at PPPs by 8% to 15% by 2050
China’s share of world GDP at PPPs could increase to around 20% by 2050

Rising incomes in emerging markets will open up great opportunities for businesses with sufficiently flexible and patient strategies for these fast-evolving markets. As the purchasing power of an increasing portion of the population grows in these regions, so will the consumption as we see a rising middle-class emerge.

Other research also points to the rise of Asian economies, especially in South Asia and South East Asia as Chinese GDP growth also starts to slow down as the demographic impact of the One Child policy kicks in over the next two decades.

Goldman Sachs was one of the first to point out the emerging rise of the BRIC economies 15 years ago. This forecast has been more accurate for the Asian economies but in their latest research, they expect that growth will be more evenly distributed even though Asian economies will still dominate.

China, Vietnam, Uganda, Indonesia, and India are projected to be among the fastest-growing economies by 2030 according to the Harvard Growth Lab projections. Their research also factors in the ability of the country to develop complex production capabilities and finds that countries that have diversified their production into more complex sectors, like Vietnam and China, are those who will experience the fastest growth in the coming decade.

The Harvard Growth Lab has identified three growth poles using their Economic Complexity Index (ECI) which they believe is a much better predictor of economic growth prospects.

They state that several Asian economies already hold the necessary economic complexity needed to drive the fastest growth over the coming decade, led by China, Cambodia, Vietnam, Indonesia, Malaysia, and India. In East Africa, several economies are expected to experience rapid growth, though this is driven more by population growth than gains in economic complexity, and this includes Uganda, Tanzania, and Mozambique. They also saw several Eastern European countries including Georgia, Lithuania, Belarus, Armenia, Latvia, and Romania ranking high on a per capita basis because of improvements in their ECI.

According to the IMF’s recent World Economic Outlook on Africa, five of the world’s fastest-growing economies are Angola, Ethiopia, Nigeria, Kenya, and South Africa. However, many experts say that for most of Africa, the business opportunity for global enterprises is further out into the future. Perhaps in the 10-to-20-year time frame as a properly supported (improved infrastructure, education, health services) demographic dividend starts to kick in. However, there are some exceptions as shown in the Growth Lab chart above.

This growth momentum data correlates with macroeconomic business potential, but a global business needs to determine a market opportunity match by including several other factors beyond the possibility that there is a large and growing potential customer base. Apart from product fit and basic organizational infrastructure issues needed to serve global customers across the world, several other macroeconomic factors also need to be considered. These include all of the following:

After this analysis has been done the best-fitting products and services can be presented to the new market. Market viability tests can often be done initially by creating a digital presence and window front to assess interest and better define implementation issues.

It is at this point that the relevant content for the buyer and customer journeys and translation issues come into focus. This is what ModernMT and the service and process infrastructure at Translated are designed to address.

The ModernMT Language Expansion

One of the motivating ideas behind Translated's language expansion efforts is to help our enterprise customers reach more of the world's population. By expanding language coverage to potentially reach 6.5 billion native speakers, Translated enables companies to forge stronger connections with global users and customers by building the core translation infrastructure needed to share, communicate, and listen to new customer groups.

This initial launch and introduction of these new languages is the beginning of an evolutionary process. Translated has ensured that the current quality of MT produced by these new languages is at least equal to or better than systems produced by Big Tech companies, and has added high-quality training data resources when available to immediately improve the performance of “low-resource” languages.

The expectation and plan behind this launch are to enable these language systems to start improving immediately using the highly adaptive ModernMT technology which allows this to happen.

History shows that because of the volume of production work, greater data availability, and ongoing activity around the high-resource languages, those MT systems can reach levels of accuracy where discussions of human-equivalent performance are possible. Thus, we begin this journey with a large set of new languages. In addition to this gradual improvement effort, ongoing fundamental research will continue to increase the rate at which these systems can and will improve.

Digital leadership in emerging markets will require that enterprises translate tens of millions of words a month, to enable them to listen, communicate, understand, and share relevant information with these customers.

For the first time, 30 new languages are supported in the market, leapfrogging directly to the more powerful adaptive MT technology. Among the new languages now supported by ModernMT, are Bengali, Punjabi, and Javanese, which together with all the other newly added languages are spoken by over 2 billion people worldwide. Many of these languages have high commercial potential, enabling companies to connect and engage with some of the fastest-growing economic regions in the world.

In the customer-centric world of the future, it is also important that important tools in the localization technology stack can easily interact and connect to superior continuous learning tools like ModernMT. To further enable this we have also added API connectivity to Blackbird.io which is an Integration Platform as a Service (IPaaS). The inter-application connectivity reach will continue to improve. This will allow ModernMT to ingest data from and export translated data back to a growing set of TMS, CMS, Marketing Automation, CDP, Analytics, QE, and Storage solutions needed in modern CX-related automation deployments.

Unfortunately, many TMS systems of yesteryear still have no ability to interact with fast-evolving adaptive MT systems and trap enterprise data to create tech debt that undermines global success. Buyers need to be wary of such systems and move to open-source approaches that allow the greatest flexibility and agility.

Marco Tombetti was interviewed by Multilingual about this announcement, where he explains how data scarcity, and enabling the new languages to function with adaptive MT architecture were the two main challenges that had to be overcome to achieve this.

He also said, “We envision that this effort is merely the first step, and while 200 languages may appear substantial, it is not an extraordinary figure. We are at the beginning, and we plan to refine adaptive MT support for these languages in the coming months, as well as for numerous others.”