Pages

Friday, December 8, 2023

An Overview of ModernMT V7

 Serious MT technology development requires ongoing efforts and research to continually improve the performance of systems and to address important emerging requirements as the use of MT expands. Researchers have been working on MT for over 70 years and success requires a sustained and continuing effort.

These efforts approach the goal of producing as close as possible to human-quality MT output in multiple ways, and these improvement strategies can be summarized in the following ways:

  1. Acquire better and higher volumes of relevant training data. Any AI initiative is highly dependent on the quality and volume of the training data that is used to teach the machine to properly perform the task.
  2. Evaluate new algorithms that may be more effective in extracting improved performance from available training data. We have seen the data-driven MT technology evolve from Statistical MT (SMT) to various forms of Neural MT (NMT) using different forms of deep learning. The Transformer algorithm which also powers LLMs like GPT-4 is the state-of-the-art in NMT today.
  3. Use more powerful computing resources to dig deeper into the data to extract more learning. As the demand for translation grows with the massive increases in content and ever-expanding volumes of user-created content (UGC) it becomes increasingly important for MT to handle massive scale. Today there are global enterprises that are translating billions of words a month into a growing portfolio of languages and thus scalability and scale are now key requirements for enterprise MT solutions. Some researchers use more computing during the training phase of the MT model development process as there can be quality advantages gained at inference from doing this extra-intensive training.
  4. Build more responsive and integrated human-machine collaboration processes to ensure that expert human feedback is rapidly incorporated into the core data used to tune and improve these MT engines. While the benefits gained from more and better data, improved algorithms, and more computing resources are useful, the integration of expert human feedback into the MT model's continuous learning is a distinctive advantage that allows an MT model to significantly outperform models where only data, algorithms, and compute are used.
  5. Add special features that address the unique needs of large groups of users, or use cases that are being deployed. As the use of MT continues to build momentum with the enterprise many specialized requirements also emerge e.g. enforcement of specific terminology for brand integrity, profanity filters to avoid egregious MT errors, and improvement of document-specific content awareness.

All these different approaches have the goal of producing improved MT output quality and it will require progress along all of these different fronts to achieve the best results.

The ModernMT development team pursues ongoing improvements along all these fronts on an ongoing basis, and ModernMT V7 is the result of several measured improvements on many of these dimensions to provide improved performance.

As machine translation (MT) continues to evolve and expand beyond the traditional use case areas such as e-commerce, global collaboration, and customer care, those interested in the expanding future of localization are now also looking to use generative artificial intelligence (AI) and, in particular, large language models (LLMs) such as OpenAI’s GPT

Unlike typical Neural MT, LLMs prioritize fluency over accuracy. But while LLMs show promising results in improving the fluency of translations, they can also produce confabulations (hallucinations), i.e. output that is inaccurate or unrelated to the input data and thus require careful monitoring and oversight to ensure accuracy.

With the latest release of ModernMT (V7), Translated has introduced a novel technique to increase the accuracy of neural MT models, called “Trust Attention,” which can also be used to address reliability within generative AI models.

The design and implementation of Trust Attention was inspired by how the human brain prioritizes trusted sources in the learning process, linking the origin of data to its impact on translation quality.


ModernMT V7 preferentially uses the most trusted data (identified by users) and thus the highest quality and most valuable training data has the greatest influence on how a model performs. This is in stark contrast to most MT models which have no discernment of data quality and thus tend to perform using only statistical density as the primary driver of model performance.

The Trust Attention capability prioritizes its learning based on data value and importance like how humans sift through multiple sources of information to identify the most trustworthy and reliable ones. Data extracted from translations performed and reviewed by professional translators is always preferred over other data, especially unverified translation memory content acquired from web crawling, which is typically used by most MT systems today.

The development team at ModernMT considers Trust Attention to be as significant an innovation as Dynamic Adaptive MT engines. It is the kind of feature that can dramatically improve MT system performance for different use cases when properly used.

According to an evaluation by professional translators, done to validate the beneficial impact, Trust Attention alone improves MT quality by up to 42%, and by an average of 16.5% in cases across the top 50 languages. Interestingly, even many high-resource languages, such as Italian and Spanish, showed significant improvements (in the 30% range) in human evaluations.


ModernMT V7 New Features: Up to 60% Better MT Quality

ModernMT V7 is the evolution of Translated’s renowned adaptive MT system, recognized as a leader in the Machine Translation Software Vendor Assessment for enterprises by IDC Marketscape 2022, and as “the most advanced implementation of responsive MT for enterprise use” in CSA Research’s 2023 Vendor Briefing.

In addition to Trust Attention, ModernMT V7 includes several other new features that further enhance the reliability and dependability of MT output. Here are the most impactful:

  • Advanced Terminology Control: Along with its ability to learn the client’s terminology from past translations, ModernMT now provides companies with self-managed glossary control to ensure brand and context-specific terminology consistency. This ability to enforce terminology has not been needed in the past because the dynamic adaptive MT technology acquires terminology very effectively even without this feature.
  • DataClean AI: V7 relies on a new sanitization algorithm that identifies and removes poor-quality data to refine the training material and reduce the likelihood of hallucinations. The close examination of errors over many years has provided clues on the root causes of strange output from MT engines. This learning and related benefits also transfer to LLM-based MT engines should they become more viable in the future.
  • Expanded Context: ModernMT can now leverage up to 100,000 words of document content —Four times more than GPT-4 - to preserve style and terminology preferences, providing unparalleled document-specific accuracy in MT suggestions and providing controls to solve persistent problems such as gender bias and inconsistent terminology.
  • Profanity Filter: V7 masks words in translation suggestions that could be regarded as inappropriate in the target language, minimizing the possibility of cultural offenses.

The combined effect of all the improvements and innovations described above has a significant impact on the overall performance and capabilities of ModernMT.

The MT quality is now considered to be 45% to 60% better than the previous version according to systematic human evaluations.


These improvements have greatly reduced the Time to Edit (TTE) for MT suggestions. At the end of July, the aggregate TTE measured across tens of thousands of samples showed a 20% reduction, reaching a record low of 1.74 seconds. This milestone indicates an acceleration towards singularity in translation, a trend further supported by preliminary TTE data collected continuously since the 1.74 seconds record was established.

The Hallmark of the Symbiosis Between Translators and MT

ModernMT V7 is available in 200 languages and covers all the fastest-growing economies likely to emerge over the next 20 years. Its hallmark is the ability of the MT model to learn from corrections in real time, enabling a powerful collaboration between the expertise of professional translators and the speed and capacity of MT.

Thanks to this unique approach, combined with Translated’s vast community of professional translators and leading AI-enabled localization solutions (Gartner 2022), Airbnb was able to ditch the translate button and simply make multilingual content pervasive and comprehensive across the platform and become one of the top 3 global brands (Global by Design 2023).

Success stories like that of Airbnb and others, along with market research that shows the ever-growing demand for more multilingual content, have led Translated to estimate that once MT reaches what is commonly referred to as “parity with human translation” (singularity in translation), we can expect a 100-fold increase in MT requests alongside a 10-fold growth in demand for professional translations.

We are entering a new era in which significantly larger volumes of content will be translated automatically. In this scenario, professional translators play an increasingly important role, not only in guiding the MT through the adaptive process but also in ensuring that the key messages are appropriately conveyed. By engaging the best translators with the best adaptive MT, companies can now take on projects that simply weren’t feasible before.

Moving Towards LLMs for Translation

Recently, Translated conducted a large-scale study to compare the performance of the most advanced MT systems with LLMs in terms of enterprise readiness. The findings showed real potential for LLMs, particularly in terms of more fluent translation quality, and also revealed areas where improvements are needed. Based on this research, Translated believes elements of both MT systems and LLMs will be critical as we move forward, and plans to provide in-depth insights into using LLMs in translation in the coming weeks and months.

Comments by John Tinsley of Translated SRL on LLM-based Translation in November 2023:

❗ LLMs - the new default for machine translation ❗

I've seen a lot of commentary along these lines over the past few months. I've also seen a lot of well-articulated commentary, not strictly opposing this line, but with added nuance and context (a challenge on the internet!)

I wanted to offer my two cents, from being at the forefront of these developments through actually building the software, and from having many conversations with clients.

In summary, today, LLMs are not fit for purpose as a drop-in replacement for MT for enterprises.

More broadly, any general-purpose GPT application will find it super challenging to outperform a purpose-built enterprise solution that considers an entire workflow in a holistic way (note, the purpose-built solution could be GPT-based itself, but with a much narrower scope).

🧠 As a concrete example, at Translated, we've built a version of ModernMT that uses GPT-4 as a drop-in replacement for our Transformer model (while retaining the framework in ModernMT that allows us to do real-time adaptation). We've also built, and continue to test, a version of ModernMT with other open source LLMs fine-tuned for translation.

While we find that they perform well in terms of quality on some content types and some languages, it's far from unanimous across the board. And that's just quality. Other critical enterprise factors such as speed, cost, and importantly, information security, are just not there yet. Similarly, language coverage for LLMs is a challenge as there are large discrepancies in performance, particularly for content generation.

I appreciate there's a lot of downward pressure today to use AI across workflows, particularly in localization teams for translation and content creation. Let me hop on my soapbox to give you some information that might help with those conversations...

📣 If you're using MT, you're already using very advanced AI! ðŸ“£

You probably already know that the T in GPT stands for Transformer. But did you know that the Transformer was invented at Google in 2017...specifically for machine translation!? So what we're seeing today is a repurposing of that technology for a different application (generative AI) other than translation.

There will come a day, possibly soon, when it's better across the board to use LLMs for translation. When that happens, it will become the standard and people will stop talking about it. Just like when Neural MT came on the scene ~6 years ago.

When it happens, Translated will have already deployed it in ModernMT and worked out the best way for you to adapt it to your business. We already have a lot of ideas. We already have a lot of data from the testing I mentioned earlier. And in the meantime, we still have what I believe to be the most complete enterprise translation solution available.




Thursday, December 7, 2023

Prioritization of Trustworthy Data in NMT Model Development

 

ModernMT: A History of Innovation and Evolution

Neural machine translation (NMT) has had impressive evolutionary progress over the last five years, showing continually improving performance in accuracy. This progress is specially marked and clear with the dynamically adaptive NMT models like ModernMT, where small amounts of ongoing corrective expert feedback results in continuously improving MT output quality.

The historical track record with ModernMT has been so impressive that it did not seem unreasonable to point out that ModernMT's performance across billions of samples and many languages was approaching singularity in production-use scenarios. This is a point at which human editors are unable to tell whether the sample is coming from a human or machine since they are so close in quality and style.

NMT technology continues to evolve and improve with recent updates that provide much richer and more granular document-level contextual awareness. Document-level adaptation in machine translation has been a core design intention with ModernMT from the outset. This originally involved referencing similar sentences in translation memories and using these to influence new translation requests.

Despite the success and pioneering nature of this approach, early implementations faced challenges: translators struggled with issues such as gender bias and inconsistent terminology due to the distance between the segment they were working on and its related context.

By taking into account all edits within an individual document, even those in completely different or distant segments, the MT model is now able to provide document-specific translation suggestions. This development significantly reduces the need for repeated corrections of elements such as pronouns. This has greatly eased the amount of corrective work needed to address gender bias errors and modify incorrect terminology.


The Emergence of LLM-Based Translation Models

In the summer of 2023, we are at an interesting junction in the development of AI-based language translation technology, where we now see that Large Language Models (LLMs) are also an emerging technological approach to having machines perform the language translation task. LLMs are particularly impressive in handling idioms and enhancing the fluency of machine translations.

However, at this point, there are still serious latency, high training, and inference costs, and most importantly trustworthiness issues with the output produced by Generative AI models like GPT-4. These issues will need to be addressed for Gen AI models to be viable in production-use translation settings. There is also the issue of poor performance in low-resource languages and a bias toward better performance with systems that translate into English.

The AI product team at Translated continues to research and investigate the possibilities for continued improvement of pure NMT models, hybrid NMT and Gen AI models, as well as pure Gen AI models. Special consideration is given to ensure that any major improvements made in existing NMT model technology can also be leveraged in the future with potentially production-use capable Gen AI translation models.

AI systems are trained on large datasets found on the internet, data that can be of varied quality and reliability. If the data used for training is biased or of poor quality, it can lead to biased or unreliable AI outputs, and we have seen that one of the biggest obstacles to the widespread use of Gen AI in mission-critical applications has been the high levels of problematic and fluent, but untrustworthy output.

Better data validation and verification can indeed improve the trustworthiness of AI output. Data validation involves ensuring that the data used to train and evaluate AI models is accurate, consistent, and representative of the real-world scenarios the AI system will encounter. This can be done through data cleaning, data preprocessing techniques, and careful selection of training data.


The Importance of Data Quality

With this in mind, ModernMT Version 7, introduces a significant upgrade to its core adaptive machine translation (MT) system. This new version introduces Trust Attention, a novel technique inspired by how human researchers prioritize information from trusted sources and the V 7 model preferentially uses identified trustworthy data both in training and inference.

This innovation is the first of a long-term thematic effort focused on improving data quality being undertaken at Translated, to ensure that data quality and trustworthiness is a pervasive and comprehensive attribute of all new translation AI initiatives.

Translated has realized from a large number of independent evaluations and internal testing over the years, that this focus on data quality enables ModernMT to compare favorably in quality performance evaluations to many other better-funded public generic MT engines produced by Google, Microsoft, and others.

They have developed a robust data governance framework to define data quality standards, processes, and roles over the last decade. This helps create a culture of data quality and ensures that data management practices are aligned with organizational efficiency goals and technology improvements.

This culture, together with close long-term collaboration with translators ensures that ongoing data replenishment is of the highest quality and systematically identifies and removes lower-quality data. Finally, regularly measuring and monitoring data quality metrics helps to identify and address potential issues before they impact AI performance.

Trust Attention is possible because of the long-term investment in developing a data-quality culture that produces the right data to feed innovation in new AI technologies.

While it is common practice in the industry to use automated algorithm-driven methods to drive data validation and verification practices, Translated’s 20 years of experience working with human translators show that human-verified data is the most trustworthy data available to drive the learning of language AI models.

This human-verified data foundation is precisely the most influential driver of preferential learning in the ModernMT Version 7 models. Automated cleaning and verification are valid ways to enhance data quality in machine learning applications, but 10 years of experience show that human-verified data provide a performance edge that is not easily matched by large-scale automated cleaning and verification methods.

Human quality assessments made comparing ModernMT V6 output versus V7 output show that the use of Trust Attention improves translation quality by as much as 42% of the time based on human evaluations. It is interesting to note that many high-resource languages like Spanish, Chinese, and Italian also saw major improvements near the 30% range in human evaluations.

Human evaluations and judgments are corroborated by concurrent BLEU and COMET score measurements which are also used to ensure that conclusions being drawn by introducing new technology are accurate and trustworthy.


The following is a sample of MT output from the ModernMT V7 system compared to the previous V6. Three independent professional reviewers were shown two randomized samples of a translation of the same source segment and asked to judge if one was better, no different, or worse. The chart above shows how often the V7 translation was preferred by a majority of the reviewers by language.

Examples below show sample sentences from English to Brazilian Portuguese and Simplified Chinese.


“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.”

Andrew Ng, Professor of AI at Standford University and founder of DeepLearning.AI



How is Trust Attention Different?

“Garbage in, garbage out” (GIGO) is a concept in computing and artificial intelligence (AI) that highlights the importance of input data quality. It means that if the input data to a system, such as an AI model or algorithm, is of poor quality, inaccurate, or irrelevant, the system’s output will also be of poor quality, inaccurate, or irrelevant.

This concept is particularly significant in the context of AI models which use machine learning and deep learning models, and rely heavily on the data used for training and validation. If the training data is biased, incomplete, or contains errors, the AI model will likely produce unreliable or biased results.


All Data Is Not Equally Important

Traditional MT systems generally are not able to distinguish between trustworthy data and lower-quality training material during the training process, and typically all the data has equal weight. Thus, high-quality data and high-volume noisy data can have essentially the same amount of impact on how a translation model will perform.

Trust Attention allows an engine to prioritize more trustworthy data and have this data influence ongoing model behavior more heavily.

ModernMT now uses a first-of-its-kind weighting system to enable primary learning from high-quality, trusted, and verified data – translations performed and/or reviewed by professional translators – over unverified data that is acquired from the Web.

As with adaptive MT, Translated looked to established human practices to develop this new technique. In any serious research, humans collect and sift through multiple information sources to identify and assign preferential status to the most trustworthy and reliable data sources.

ModernMT V7 similarly identifies the most valuable training data and prioritizes its learning based on certified and verified data by modeling this human behavior. This certification and verification is not an automated machine-led process, rather it is an expert human validation that raises the trustworthiness of the data.

This focus on prioritizing the use of trusted, verified data is a major step forward in the development of enterprise-focused MT technology

The efforts made to identify and build repositories of high-quality data will also be useful in the future if there is indeed a shift to Gen AI-based language translation models.

Today, there is considerable discussion regarding the application of large language models in translation. While the traditional NMT models seem to perform much better on the accuracy dimension, though they can be less fluent than humans, LLMs tend to emphasize and often win on fluency, even though these models often produce misleading output due to hallucinations (generative fabrication).

Trust Attention methodology deployed in LLMs, will also enhance the accuracy of generative models, reducing the chances of random fabrication and confabulation errors. This could set the stage for an emerging era of new machine translation methodologies, one that combines the accuracy of dynamic adaptive NMT with the fluency of Gen AI models.

ModernMT Version 7 also introduces a data-cleaning AI that minimizes the likelihood of hallucinations, making it valuable for companies seeking greater accuracy in high-volume automated translation use cases, and is also useful for translators integrating MT into their workflow.

John Tinsley, VP of AI Solutions at Translated, added, "We are confident that these new data validation and verification techniques can also improve accuracy in generative AI systems, paving the way for the next generation of machine translation."

The introduction of this new approach is a major step forward for companies seeking greater accuracy in the translation of large volumes of content or requiring a high degree of customization of the MT engine, as well as for translators integrating MT into their workflow.

The combined impact of these multiple innovations provides global enterprises with a superior platform to rapidly transform generic engines into highly tuned enterprise-specific translation engines.

Tuesday, December 5, 2023

The English-Centric Bias of Large Language Models

 The internet is the primary source of information, economic opportunity, and community for many worldwide. However, the automated systems that increasingly mediate our interactions online — such as chatbots, content moderation systems, and search engines — are primarily designed for and work far more effectively in English than in the world’s other 7,000 languages

It is clear to anyone who works with LLMs and multilingual models, that there are now many powerful and impressive LLM models available for generating natural and fluent texts in English. While there has been substantial hype around the capabilities and actual potential value of a wide range of applications and use cases, the benefits have been most pronounced for English-speaking users.

It is also now increasingly being understood that achieving the same level of quality and performance for other languages, even the ones that are widely spoken, is not an easy task. AI chatbots are less fluent in languages other than English and are thus threatening to amplify the existing language bias in global commerce, knowledge access, basic internet research, and innovation.

In the past, it has been difficult to develop AI systems — and huge language models in particular — in languages other than English because of what is known as the resourcedness gap.

The resourcedness gap describes the asymmetry in the availability of high-quality digitized text that can serve as training data for a large language model and generative AI solutions in general.

English is an extremely highly resourced language, whereas other languages, including those used predominantly in the Global South, often have fewer examples of high-quality text (if any at all) on which to train language models.

English-speaking users have a better user experience with generative AI than users who speak other languages, and the current models will only amplify this English bias further.

It is estimated that although GPT-3's training data consists of > 90% English text it did include some foreign language text, but not enough to ensure that model performance across different languages is consistent. GPT-3 was the foundation model used to build ChatGPT and though we do not know what data was used in GPT-4 we can safely assume that no major sources of non-English data have been acquired, primarily because it is not easily available.

ategories of language resourcedness. Languages divided into different levels of resourcedness, according to labeled and unlabeled datasets available as of 2020
Source: Lost in Translation Large Language Models in Non-English Content Analysis

Researchers like Pascale Fung and others have pointed out the difficulty for many global customers because of the dominance of English in eCommerce. It is much easier to get information about products in English in online marketplaces than it is in any other language.

Fung, director of the Center for AI Research at the Hong Kong University of Science and Technology, who herself speaks seven languages, sees this bias even in her research field. “If you don’t publish papers in English, you’re not relevant,” she says. “Non-English speakers tend to be punished professionally.”

The following table describes the source data for the training corpus of GPT-3 which is the data foundation for ChatGPT:

Datasets

Quantity (Tokens)

Weight in Training Mix

Epochs elapsed when training for 300 BN tokens

Common Crawl (filtered)

410 BN

60%

0.44

WebText2

19 BN

22%

2.90

Books1

12 BN

8%

1.90

Books2

55 BN

8%

0.43

Wikipedia

3 BN

3%

3.40

Understanding what data has been used to train GPT-3 is useful. This overview provides some valuable details that also help us understand the English bias and US-centric perspective that these models have.

Fung and others are part of a global community of AI researchers testing the language skills of ChatGPT and its rival chatbots and sounding the alarm about providing evidence that they are significantly less capable in languages other than English.

  • ChatGPT still lacks the ability to understand and generate sentences in low-resource languages. The performance disparity in low-resource languages limits the diversity and inclusivity of NLP.
  • ChatGPT also lacks the ability to translate sentences in non-Latin script languages, despite the languages being considered high-resource.

“One of my biggest concerns is we’re going to exacerbate the bias for English and English speakers,” says Thien Huu Nguyen, a University of Oregon computer scientist who is also a leading researcher raising awareness about the often impoverished experience non-English speakers routinely experience with generative AI. Nguyen specifically points out:

ChatGPT’s performance is generally better for English than for other languages, especially for higher-level tasks that require more complex reasoning abilities (e.g., named entity recognition, question answering, common sense reasoning, and summarization). The performance differences can be substantial for some tasks and lower-resource languages.

  • ChatGPT can perform better with English prompts even though the task and input texts are intended for other languages.
  • ChatGPT performed substantially worse at answering factual questions or summarizing complex text in non-English languages and was more likely to fabricate information.

The research tends to point clearly to the English bias of the most popular LLMs and state: The AI systems are good at translating other languages into English, but they struggle with rewriting English into other languages—especially for languages like Korean, with non-Latin scripts.

“51.3% of pages are hosted in the United States. The countries with the estimated 2nd, 3rd, and 4th largest English-speaking populations—India, Pakistan, Nigeria, and The Philippines—have only 3.4%, 0.06%, 0.03%, 0.1% the URLs of the United States, despite having many tens of millions of English speakers.”

The chart below displays a deeper dive into the linguistic makeup of the Common Crawl data by the Common Sense Advisory research team.


Recently though, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. Instead of being trained on text from only one language, multilingual language models are trained on text from dozens or hundreds of languages at once.

Researchers posit that multilingual language models can infer connections between languages, allowing them to apply word associations and underlying grammatical rules learned from languages with more text data available to train on (in particular English) to those with less.

Languages vary widely in resourcedness, or the volume, quality, and diversity of text data they have available to train language models on. English is the highest-resourced language by multiple orders of magnitude, but Spanish, Chinese, German, and a handful of other languages have sufficiently high resources to build language models.

However,  they are still expected to be lower in quality than English language models. Medium resource languages, with fewer but still high-quality data sets, such as Russian, Hebrew, and Vietnamese, and low resource languages, with almost no training data sets, such as Amharic, Cherokee, and Haitian Creole, have too little text for training large language models

However, there are many challenges and complexities involved in developing multilingual and multicultural LLMs that can cater to the diverse needs and preferences of different communities. Multilingual language models are still usually trained disproportionately on English language text and thus end up transferring values and assumptions encoded in English into other language contexts where they may not belong.

Most remedial approaches to address the English bias rely on the acquisition of large amounts of non-English data to be added to the core training data to reduce the English bias in current LLMs; data which is not easily found or often non-existent. Certainly not at the scale, volume, and diversity that English training data exists.

English is the closest thing there is to a global lingua franca. It is the dominant language in science, popular culture, higher education, international politics, and global capitalism; it has the most total speakers and the third-most first-language speakers.

The bias in the NLP research community is evident in the chart below. ACL papers are more likely to be published in English than any other language by a factor of 11X to 80X!

Languages mentioned in paper abstracts. Top most mentioned languages in abstracts of papers published by the Association for Computational Linguistics, May 2022-January 2023.

Recent US congressional hearings also focused on this language-bias problem when Senator Alex Padilla (a native Spanish speaker) of California questioned the CEO of OpenAI about improving the experience for the growing population of non-English users even in the US and said: “These new technologies hold great promise for access to information, education, and enhanced communication, and we must ensure that language doesn’t become a barrier to these benefits.”

However, the fact remains, and OpenAI clearly states that the majority of the underlying training data used to power ChatGPT (and most other LLMs) came from English and that the company’s efforts to fine-tune and study the performance of the model is primarily focused on English “with a US-centric point of view.”

This also results in the models performing better on tasks that involve going from Language X to English than on tasks that involve going from English to Language X. Because of the data scarcity and substantial costs involved in correcting this it is not likely to change soon.

Because the training text data sets used to train GPT models also have some other languages mixed in, the generative AI models do pick up some capability in other languages. However, their knowledge is not necessarily comprehensive or complete enough, and in a development approach that implicitly assumes that scale is all you need, most languages simply do not have enough scale in training data to perform at the same levels as English.

This is likely to change over time to some extent, and already the Google PaLM model claims to be able to handle more languages, but early versions show only very small incremental improvements in a very few select languages.

Each new language that is "properly supported" will require a separate set of guardrails and controls to minimize problematic model behavior.

Thus, beyond the monumental task of finding massive amounts of non-English text and re-training the base generative AI model from scratch, researchers are also trying other approaches e.g., creating new data sets of non-English text to try to accelerate the development of truly multilingual models, or by generating synthetic data by using what is available in high resource languages like English or Chinese, which are both less effective than simply having the adequate data volume in the low-resource language in the first place.

Nguyen and other researchers say they would also like to see AI developers pay more attention to the data sets they feed into their models and better understand how that affects each step in the building process, not just the final results. So far, the data and which languages end up in models has been a "random process," Nguyen says.

So when you make a prompt request in English, it draws primarily from all the English language data it has. When you make a request in traditional Chinese, it draws primarily from the Chinese language data it has. How and to what extent these two piles of data inform one another or the resulting outcome is not clear, but at present, experiments show that they at least are quite independent.

The training data for these models were collected through long-term web crawling initiatives, and a lot of it was pretty random. More rigorous controls to reach certain thresholds of content for each language -as Google tried to do with PaLM- could improve the quality of non-English output. It is also possible that more carefully collected and curated data that is better balanced linguistically could improve performance across more languages.


The T-LM (Translated Language Model) Offering

The fundamental data acquisition and limited and sub-optimal accessibility problems described above could take years to resolve. Thus, Translated Srl is introducing a way to address the needs of a larger global population interested in using GPT-4 for content creation, content analysis, basic research, and content refinement in their preferred language.

The following chart shows the improved performance available with T-LM across several languages. Users can expect the performance improvements to continue to increase and improve as they provide corrective feedback daily.

Combining the power of the state-of-the-art adaptive machine translation technology with OpenAI's latest language model will result in empowering users across 200 languages to engage and explore the capabilities of GPT-4 in a preferred non-English language and achieve superior performance.

T-LM will help unlock the full potential of GPT-4 for businesses around the world. It provides companies with a cost-effective solution to create and restructure content and do basic content research in 200 languages, bridging the performance gap between GPT-4 in English and non-English languages.

A detailed overview of the 200 specific languages and their importance in the changing global dynamics is described here.

Many users have documented and reported sub-optimal performance when searching with Bing Chat when they query in Spanish rather than English.

In a separate dialog, when queried in English, Bing Chat correctly identified Thailand as the rumored location for the next set of the TV show White Lotus, but provided “somewhere in Asia” when the query was translated to Spanish, says Solis, who runs a consultancy called Orainti that helps websites increase visits from search engines.

Other discussions point out that ChatGPT performs sub-optimally in most languages other than English. Techcrunch also ran some tests to demonstrate that ChatGPT has lesser performance in non-English languages.

Additionally, using GPT-4 in non-English languages can cost up to 15 times more (see the charts below). Research has shown that speakers of certain languages may be overcharged for language models while obtaining poorer results, indicating that tokenization may play a role in both the cost and effectiveness of language models. This study shows the difference in cost by language family which can be significantly higher than English.

Independent researchers point out how the same prompt varies across languages and that some languages consistently have a higher token count. Languages such as Hindi and Bengali (which together over 800 million people speak) resulted in a median token length of about 5 times that of English. The ratio is 9 times that of English for Armenian and over 10 times that of English for Burmese. In other words, to express the same prompt or sentiment, some languages require up to 10 times more tokens.

Source: All languages are NOT created (tokenized) equal


To express the same sentiment, some languages require up to 10 times more tokens


Implications of tokenization language disparity

Overall, requiring more tokens (to tokenize the same message in a different language) means:

  • Non-English users are limited in how much information they can put in the prompt (because the context window is fixed).
  • It is more costly as generally more tokens are needed for equivalent prompts.
  • It is slower and takes longer to run and often results in more fabrication and other errors.

OpenAI’s models are increasingly being used in countries where English is not the dominant language. According to SimilarWeb.com, the United States only accounted for 10% of the traffic sent to ChatGPT in Jan-March 2023. India, Japan, Indonesia, and France all have large user populations that are almost as large as the US user base.

Translated's T-LM service integrates the company’s award-winning adaptive machine translation (ModernMT) with GPT-4 to bring advanced generative AI capabilities to every business in the languages spoken by 95% of the world's population. This approach also lowers the cost of using GPT-4 in languages other than English, since the pricing model is based on text segmentation (tokenization) that is optimized for English. By ensuring that all prompts submitted to GPT-4 are in English the billing will be equivalent to the more favorable and generally lower-cost English tokenization. T-LM, instead, will always use the number of tokens in English for billing.

The Adaptive ModernMT technology, unlike most other MT technology available today can learn and improve dynamically and continuously with ongoing corrective feedback daily. Thus, users who work with T-LM can drive continuous improvements in output produced from GPT-4 by providing corrective feedback on the translations produced by T-LM. This is something that is not possible with the most commonly used static MT systems where users would be confined and limited to generic system performance.

T-LM addresses the performance disparity experienced by non-English users by translating the initial prompt from the source language to English and then back to the user's language using a specialized model that has been optimized for the linguistic characteristics typically used in prompts.

T-LM combines GPT-4 with ModernMT, an adaptive machine translation engine, to offer GPT-4 near English-level performance in 200 languages. 

  • T-LM works by translating non-English prompts into English, executing them using GPT-4, and translating the output back to the original language, all using the ModernMT adaptive machine translation.
  • T-LM is available to enterprises via an API and to consumers through a ChatGPT plugin.

The result is a more uniform language model performance capability across many languages and enhanced GPT-4 performance in non-English languages.

  • Customers can optionally use their existing ModernMT keys to employ adaptive models within GPT-4.
  • An indirect benefit of T-LM is that it has cost up to 15x lower than GPT-4, thanks to a reduced number of tokens billed. GPT-4 counts significantly more tokens in non-English languages. T-LM, instead, will always use the number of tokens in English for billing

Therefore, Translated's integration with OpenAI enhances GPT-4's performance in non-English languages by combining GPT-4 with the ModernMT adaptive machine translation, resulting in a more uniform language model capability across languages and lower costs.

Use cases for T-LM include assisting global content creation teams in a broad range of international commerce-related initiatives, allowing companies from Indonesia, Africa, and various parts of India to make their products visible in online eCommerce platforms to US and EU customers, providing better multilingual customer support, making global user-generated content visible and understandable in the customer’s language.


T-LM can be used in many text analysis tasks needed in business settings, e.g., breaking down and explaining complicated topics, outlining blog posts, sentiment analysis, personalized responses to customers, summarization, creating email sales campaign material, or suggesting answers to customer agents.

T-LM works together with GPT to create a wide range of written content or augment existing content to give it a different intonation, by softening or professionalizing the language, to improve content creation and transformation automation while providing a fast and engaging user experience. This is now possible to do in 200 languages that ModernMT supports.

There are many ways GPT-4 can produce ‘draft’ text that meets the length and style desired, which can then be reviewed by the user,” Gartner said in a report on how to use GPT-4. “Specific uses include drafts of marketing descriptions, letters of recommendation, essays, manuals or instructions, training guides, social media or news posts.”

T-LM will allow students around the world to access knowledge content, and use GPT-4 as a research assistant and access a much larger pool of information. In education, GPT-4 can be used to create personalized learning experiences, as a tutor would. And, in healthcare, chatbots and applications can provide simple language descriptions of medical information and treatment recommendations.


T-LM will enhance the ability of large and SME businesses to engage in new international business by assisting in basic communication, and understanding, and providing more complete documentation on business proposals using the strengths of both GPT-4 and T-LM working together.

T-LM is available now through API. More information on the service can be found at translatedlabs.com/gpt.