Friday, February 17, 2023

The Problem With LangOps

This is a letter I wrote to the editor of Multilingual after reading several articles focused on LangOps in the December 2022 issue. This discussion started on LinkedIn and Cameron invited the active contributors to formalize our comments and write a letter to the editor with alternate viewpoints.

TLDR: LangOps is a term that refers to the vague use of "A.I." in/around localization or is nothing more than a way to describe the centralization of enterprise translation production processes.

I carefully read all of the following before writing my letter to ensure that I had not somehow missed the boat. The basic question I am still left with after looking carefully through the LangOps material is "Where's the real substance of this concept/idea/word?"

Article by Renato: LangOps- The Vision & the Reality
Article by Arthur Weitzel: LangOps: Pipe Dream, LSP´s Heaven or Just a New Hashtag?
Article by Andrew Warner: On the Origin of LangOps - The evolution of the localization roadmap
Article by Miguel Cerna: LangOps and Localization Integration rather than substitution?
Article by Riteba McCallum: The LangOps Paradigm: Perceptions of machine translation within the translation industry
and of course the LangOps principles.

Jump to 2' 50" to get to the relevant part

Here is a slightly ornamented version of the text of my letter to the editor which was published in the Multilingual February 2023 issue. I include a version with emphasis (mine) so that others may also comment on this, and perhaps correct my misperception.

Special Thanks to Marjolein Groot Nibbelink for taking the trouble to convert the letter to a really well-read audio track that can be played back faster.

Dear Multilingual Editor (Cameron),

After reading the various articles on LangOps in the Multilingual December 2022 issue, I had hoped that I would get a better sense of what LangOps is, and why it matters. But I cannot say that this happened for me, and I am not sure if I (or any other reader) have any more clarity on what LangOps is, beyond it being a vendor buzzword, that remains fuzzy and amorphous because there is not enough supporting evidence to document it properly. While there was much discussion about why a new definition that went further than localization is needed, there was not much that defined LangOps in more concrete terms. I suspect the fuzziness and lack of clarity that I felt are true for many other readers as well.

One is left asking. “Where’s the beef?” on this thing they call LangOps.

I reviewed the articles in the magazine on the LangOps subject again before writing this letter, to better identify the defining elements, and to make sure I was fair and had not missed some obvious facts. My intention with my comments here is to hopefully provide a coherent critique of the subject matter, which started in discussion with comments made by several readers about LangOps on LinkedIn.

From my reading, the articles in Multilingual were clearer on Why new definitions are needed, but less clear on the What [it is] or explaining the How.

It appears to me that the LangOps concept is another attempt by some stakeholders in the industry to raise the profile of the translation business, to make it more visible at the executive level, or to increase the perceived value of the translation production process by imbuing it with more complexity and mysterious undefined AI elements. However, in the absence of specifics, it becomes just another empty buzzword that creates more confusion than clarity for most of us, especially so for new buyers.

It is difficult to see how any sponsor could take the descriptions provided in this issue of Multilingual to a senior executive to ask for funding, or even to explain what it is.

It is clear that as the translation of some product and marketing content became recognized as a valuable international business-driving activity, the need to scale, organize and systematize it became more urgent and led to what most call localization today.

Thus, localization I think refers to the many processes, activities, and tools, used in making language translation processes more automated, structured, and systematic. Most often this work is related to relatively static content that is mandatory in international markets, but recently it has expanded to include more customer service and support content. It also sometimes includes cultural adaptations that are made in addition to the basic translation.

TMS systems have been central to the localization worldview over the past decade, as these TMS systems facilitate the development and management of different workflows, monitor translation work, and ease project management of distributed translation-related tasks (TEP). It is also true that MT has been minimally used in hard-core localization settings as MT systems were not deemed to be accurate, flexible, and simple enough to configure to be used in this work.

By carefully reviewing the published Multilingual articles again, I gathered that the following elements that are being used to define what LangOps is:

There are AI-driven capabilities applied to certain localization processes which are not defined,
Centralization of all translation production activities across the enterprise,
Introduction of “more” technology into existing localization workflows, but what this is specifically, is unclear,
LangOps is said to be made up of cross-functional and inter-disciplinary teams, but who and why is not clear,
Possibly adding other value-added language tasks (sentiment analysis, summarization, chatbots) in addition to the translation. [This at least is clear].

To my view, the only element here that is clear in the many descriptions [of LangOps] is that of the centralization of translation production.

The other elements used to describe what it is are kind of fuzzy and hard to pin down. They can mean anything or could mean nothing since vagueness is not easily pinned down. LangOps is another term, that is possibly even worse than localization (which confuses many regular people and many new customers) because it creates a communication problem.

How do you answer the question, “What do you do?” in an elevator, a cab, at a party, on an airplane, with family and friends? As you can see both Localization and LangOps present opaque, obfuscating images to the regular human mind.

Would it not be so much easier to just say “Language Translation to Drive International Business”? And then maybe add, “We use technology, tools, and people to do it at a large scale efficiently.”

I would like to suggest a different way to view the continuing evolution of business translation. It is my feeling that the LangOps movement is linking the growing number of MT use cases, which have more dynamic IT connectivity, and cross-organization collaboration implications, with a need for a new definition.

We have now reached that perfect storm moment where most B2C and B2B businesses recognize that they need a substantial digital presence, that it is important to provide large volumes of relevant content to serve and please their customers, and that they need to listen to customers in social media, understand trends faster, and communicate across the globe much faster.

This means that successful businesses have to share, communicate, listen, and produce translations at a much larger scale than they have had to in the past. The core competency from traditional localization work is less likely to be useful with these new challenges. These new market requirements need a shift away from TM and TMS-managed work to a more MT-centric view of the world. The volume of translation increases from thousands of translated words, a month, to millions or even billions of words a month to drive successful international business outcomes in the modern era.

As Generative AI improves and begins to be deployed in production customer settings, we will only see the translation volumes grow another 10X or 100X. Thus, deep MT competence increasingly becomes a core requirement to be in the enterprise translation business.

MT has been improving dramatically over the last five years in particular, and it is not ridiculous to say that it is getting close to human output in some special cases when systems are properly designed and deployed by competent experts.

Competence means that experts can quickly adapt and modify MT systems to produce useful output in the 20-30 different use cases where an enterprise faces an avalanche of text and/or audiovisual content. The new use cases go beyond the traditional focus of localization in terms of content and process. We now need to translate much more dynamic content related to customer services and support, translate more active communications (chat, email, forums), share more structured and unstructured content, pay more attention to social media feedback, and are just more real-time and dynamic in general.

The successful modern global enterprise listens, understands, communicates, and actively shares content across the globe to improve customer experience. Thus, I think it is fair to say that we (the translation business) are moving to a more MT-centric world from a previously TMS-centric world, and a critical skill needed today is deep competence with MT.

Useful MT output means it helps grow and drive international business, even though it may not be linguistically “perfect”. The requirement for MT competence requires moving far beyond choosing an MT system with the best BLEU or COMET score.

MT Competence means you can find egregious errors (MT & AI make these errors all the time) and instantly correct these problems to minimize damage.

MT Competence means the skill and agility to respond to changing business needs and new content types and the ability to rapidly modify MT systems as needed.

Competence in managing rapid, responsive, deep adaptation of MT systems will be a key requirement to actively participate as an enterprise partner (not vendor) on a global stage very shortly.

When language translation is mission-critical and pervasive, the service provider will likely evolve from being a vendor to being a partner. It can also often mean that the scope of localization teams is greatly expanded and become more mission-critical.

While I can see a business reality where there is Machine-First & Human Optimized translation approach to content across the global enterprise, which requires responsive, continuously improving MT, it also means moving beyond traditional MTPE where clean-up crews come to reluctantly fix badly formed MT output produced by inexperienced and incompetent MT practitioners.

However, the lights start to dim for me when I think of "LangOps" being part of this reality in any form whatsoever.

This continuing evolution of business translation also probably means that there is a much more limited role for the TMS or using it only for some localization (software and key documentation) workflows. The more common case as translation volumes grows is to connect all (Customer Experience) CX-related text directly into highly tuned, carefully adapted NMT systems in high-performance low-latency IT infrastructure that is directly customer-facing, or customer accessible.

Recent data I have seen on MT use across a broad swathe of enterprise users shows that as much as 95% of MT use completely bypasses the TMS. Properly tuned expert-built MT engines do not need the unnecessary overhead of a TMS system. The enterprise objective is to enable translation at scale for everything that might require instant, and mostly but not necessarily a perfectly accurate translation, as long as it furthers and enhances any and every global business initiative and communication.

Speed and scale are more important and have a more positive impact on international business success in many CX-related use cases than perfect linguistic quality does. The enterprise executives understand this even though we as an industry might not.

I am not aware of a single LangOps configuration or group on this earth or know any enterprise that claims to have such an initiative, but I can point to several massive-scale MT-driven translation engines around the world e.g. Airbnb, Amazon, Alibaba, and eBay where billions of words are translated regularly to drive international business and customer delight and serve a growing international customer base. I am confident we will see this pool of enterprise users grow beyond the eCommerce markets.

Thus, I see little value in promoting the concept of LangOps as what actually seems to be happening is that more expert-tuned enterprise MT is being used and we see the share of MT used to total translation volumes continue to grow.

As this kind of responsive, highly adaptive MT capability becomes more pervasive across an enterprise, it also becomes a critical requirement for international business success. The activities related to organizing and managing significantly more dynamic content and translation volumes should not be mistaken to be something as vague as LangOps, as no organization I am aware of has the building blocks or template to create such a vaguely defined function. I think that it is more likely that Localization teams will evolve and the scope of their activities will increase, perhaps as dramatically as we have seen at Airbnb.

Airbnb just booked its first annual profit in its near-15-year history, a whopping $1.9bn in 2022. It now appears to be in rarefied air, with its place as the de facto online marketplace for homestays and experiences, giving it a network effect that’s hard to compete with.

I did find all the articles on LangOps useful in furthering my understanding, especially the ones by Riteba McCallum, and Miguel Cerna, and my comments should not be mistaken as a wholesale dismissal of the viewpoints presented. On the contrary, I think we have much more agreement on many of the core issues discussed. Though I do admit that I find the general concept of LangOps as it has been painted, to be a likely hindrance to our mutual future rather than a beneficial concept to drive our success with globalization and international business initiatives with our common customers.

Respectfully Yours,

Kirti Vashee

Here is the LinkedIn article where the discussion began:

P.S. Maybe all I am saying is that LangOps just needs more cowbell 😄😄😄 to get the sound and the concept right?

Wednesday, February 1, 2023

The March Towards AI Singularity and Why It Matters

Why progress in MT is a good proxy of progress with the technological singularity

For as long as machine technology has been around (now over 70 years) there have been regular claims made by developers of the technology reaching “human equivalence”. However, until today we have not had a claim that has satisfied practitioners in the professional translation industry, who are arguably the most knowledgeable critics around. For these users, the actual experience with MT has not been matched by the many extravagant claims made by MT developers over the years.

This changes with the long-term study and translation production data presented by Translated SRL at the AMTA conference which provides the missing elements: a huge industrial-scale evidentiary sample validated by a large group of professional translators across multiple languages based on professional translation work done in real-world production scenarios.

The historical difficulty in providing acceptable proof does not mean that progress is not being made, but it is helpful to place these claims in proper context and perspective to better understand what the implications are for the professional and enterprise use of MT technology.

The history of MT (machine translation) is unfortunately filled with empty promises

MT (human language translation) is considered among the most difficult theoretical problems in AI, and thus we should not be surprised that it is a challenge that has not yielded completely to the continuing research efforts of MT technology experts over the decades. Also, many experts have said that MT is a difficult enough challenge (AI-complete: because it requires a deep contextual understanding of the data, and the ability to make accurate predictions based on that data) that it is a good proxy for AGI (Artificial general intelligence is the ability of a machine process/agent to understand or learn any intellectual task that a human being can) and thus progress with MT can also mean that we are that much closer to reaching AGI.

The Historical Lack of Compelling Evidence

MT researchers are forced to draw conclusions on research progress being made based on relatively small samples of non-representative data (from the professional translation industry perspective) that are evaluated by low-cost human "translators". The Google Translate claims in 2016 are an example of a major technology developer making "human-equivalence" claims based on limited data that was possible within the scope of the technology development process typical at the time.

Namely, here are 200 sentences that amateur translators say are as good as human translation, thus we claim we have reached human equivalence with our MT.

Thus, while Google did indeed make substantial progress with its MT technology, the evidence it provided to make the extravagant claim lacked professional validation, was limited only to a small set of news domain sentence samples, and was not representative of the diverse and broad scope of typical professional translation work which tends to be much more demanding and varied.

The problem from the perspective of the professional industry with these historical as-good-as-humans claims can be summarized as follows:

Very small samples of non-representative data: Human equivalence is claimed on the basis of evaluations of a few news domain segments where non-professional translators were unable to discern meaningful differences between MT and human translations. The samples used to draw these conclusions were typically based on no more than a few hundred sentences.
Automated quality metrics like BLEU were used to make performance claims: The small samples of human evaluation were generally supported by larger (a thousand or so) sentences where the quality was assessed by an automatic reference-based score. There are many problems with these automated quality scores as described here, and we now know that they miss much of the nuance and variation that is typical in human language, resulting in erroneous conclusions, and at best they are very rough approximations of competent human assessments. COMET and other metrics are slightly better quality approximation scores but still fall short of competent human assessments which are still the "gold standard" in assessing translation output quality. The assessments of barely bilingual translators found in Mechanical Turk settings and often used by MT researchers are likely to be quite different from expert professional translators whose reputations are defined by their work product. Competent human assessments ("gold standard") are often at odds and different from the segments suggested as the best-scoring ones based on metrics like COMET or hLepor.
Overreaching extrapolations: The limited evidence from these experiments was marketed as “human-equivalence” by Google and others, and invariably resulted in disappointing professional translators and enterprise users who quickly witnessed the poor performance of these systems when they strayed away from news domain content. Though these claims were not deliberately deceptive, they were made to document progress from a perspective that was much narrower than the scope and coverage typical of professional translation work. There has never been a claim of improved MT quality performance based on the huge scale (across 2 billion segments) presented by Translated SRL.

Translated SRL Finally Provides Compelling Evidence

The measurement used to describe ongoing progress with MT is Time To Edit (TTE). This is a measurement made during routine production translation work and represents the time required by the world’s highest-performing professional translators to check and correct MT-suggested translations.

Translated makes extensive use of MT in their production translation work and has found that TTE is a much better proxy for MT quality than measures like Edit Distance, COMET, or BLEU. They have found that rather than using these automated score-based metrics, it is more accurate and reliable to use a measurement of the actual cognitive effort extended by professional translators during the performance of production work.

Consistent scoring and quality measurement are challenging in the production setting because this is greatly influenced by varying content types, translator competence, and changing turnaround time expectations. A decade of careful monitoring of the production use of MT has yielded the data shown above. Translators were not coerced to use MT and it was only used when it was useful.

The data are compelling because of the following reasons:

The sheer scale of the measurements across actual production work is described in the link above. The chart focuses on measurements across 2 billion edits where long-term performance data was available.
The chart represents what has been observed over seven years, across multiple languages, measuring the experience of professional translators making about 2 billion segment edits under real-life production deadlines and delivery expectations.
Over 130,000 carefully selected professional translators contributed to the summary measurements shown on the chart.
The segments used in the measurements are all no TM match segments as this represents the primary challenge in the professional use of MT.
The broader ModernMT experience also shows that highly optimized MT systems for large enterprise clients are already outperforming the sample shown above which represents the most difficult use case of no TM match.
A very definite linear trend shows that if the rate of progress continues as shown, it MAY be possible to produce MT segments that are as good as those produced by professional translators within this decade. This is the point of singularity at which the time top professionals spend checking a translation produced by the MT is not different from the time spent checking a translation produced by their professional colleagues which may or may not require editing.

It is important to understand that the productivity progress shown here is highly dependent on the superior architecture of the underlying ModernMT technology which learns dynamically, and continuously, and improves on a daily basis based on ongoing corrective feedback from expert translators. ModernMT output has thus continued to steadily improve over time. It is also highly dependent on the operational efficiency of the overall translation production infrastructure at Translated SRL.

The virtuous data improvement cycle that is created by engaged expert translators providing regular corrective feedback provides the right kind of data to drive ongoing improvements in MT output quality. This improvement rate is not easily replicated by public MT engines and periodic bulk customization processes that are typical in the industry.

The corrective input is professional peer revision during the translation process - and this expert human input "has control," and guides the ongoing improvement of the MT, not vice versa. While overall data, computing, and algorithms are critical technological foundations to ongoing success, expert feedback has a substantial impact on the performance improvements seen in MT output quality.

The final quality of translations delivered to customers is measured by a metric called EPT (Errors per thousand words) which in most cases is 5 or even as low as 2 when two rounds of human review are used. The EPT rating provides a customer-validated objective measure of quality that is respected in the industry, even for purely human translation product when no MT is used.

There is a strong, symbiotic, and mutually beneficial relationship between the MT and the engaged expert translators who work with the technology. The process is quite different from typical clean-up-the-mess PEMT projects with customized static models where the feedback loop is virtually non-existent, and where the MT systems barely improve even with large volumes of post-edited data.

Responsive, Continuously Improving MT Drives Engagement from Expert Translators

Who See Immediate Benefit During the Work Process

The Problem with Industry Standard Automated Metrics for MT Quality Assessment

It has become fashionable in the last few years to use automated MT quality measurement scores like BLEU, Edit Distance, hLepor, and COMET as a basis to select the “best” MT systems for production work. And some companies use different MT systems for different languages in an attempt to maximize MT contributions to production translation needs. These scores are all useful for MT system developers to tune and improve MT systems, however, globalization managers who use this approach may overlook some rather obvious shortcomings of this approach for MT selection purposes.

Here is a summary listing of the shortcomings of this best-MT-based-on-scores approach:

These scores are typically based on measurements of static systems. The score is ONLY meaningful on a certain day with a certain test set and actual MT performance may be quite different from what the static score might suggest. The score is a measurement of a historical point and is generally not a reliable predictor of future performance.
Most enterprises need to adapt the system to their specific content/domain and thus the ability of a system to rapidly, easily, and efficiently adapt to enterprise content is usually much more important than any score on a given day.
These scores do not and can not factor in the daily performance improvements that would be typical of an adaptive, dynamically, and continuously improving system like ModernMT, which would most likely score higher every day it was actively used and provided with corrective feedback. Thus, they are of very limited value with such a system.
These scores can vary significantly with the test set that is used to generate the score and scores can vary significantly as test sets are changed. The cost of generating robust and relevant test sets often compromises the testing process as the test process can be gamed.
Most of these scores are only based on small test sets with only 500 or so sentences and the actual experience in production use on customer data could vary dramatically from what a score based on a tiny sample might suggest.
Averaged over many millions of segments, TTE gives an accurate quality estimate with low variance and is a more reliable indicator of quality issues in production MT use. Machine translation researchers have had to rely on automated score-based quality estimates such as the edit distance, or reference-based quality scores like COMET and BLEU to get quick and dirty MT quality estimates because they have not yet had the opportunity to work with such large (millions of sentences) quantities of data collected and monitored in production settings.
As enterprise use of MT evolves the needs and the expected capabilities of the system will also change and thus static scores become less and less relevant to the demands of changing needs.
Also, such a score does not incorporate the importance of overall business requirements in an enterprise use scenario where other workflow-related, integration, and process-related factors may actually be much more important than small differences in scores.
Leading-edge research presented at EMNLP 2022 and similar conferences provide evidence that COMET-optimized system rankings frequently do not match what “gold-standard” human assessments would suggest as optimal. Properly done human assessments are always more reliable in almost every area of NLP. The TTE measurements described above inherently allow us to capture human cognition impact and quality assessment at a massive scale in a way that no score or QE metric can today.
Different MT systems respond to adaptation and customization efforts in different ways. The benefit or lack thereof from these efforts can vary greatly from system to system especially when a system is designed to primarily be a generic system. Adaptive MT systems like ModernMT are designed from the outset to be tuned easily and quickly with small amounts of data to fit a wide range of unique enterprise use cases. ModernMT is almost never used without some adaptation effort, unlike generic public MT systems like Google MT which are primarily used in a default generic mode.

A “single point quality score” based on publicly sourced sentences is simply not representative of the dynamically changing, customized, and modified potential of an active and evolving enterprise adaptive MT system that is designed to be continuously adapted to unique customer use case requirements.

When it is necessary to compare two MT systems in a buyer selection & evaluation process, double-blind A/B human evaluations on actual client content would probably produce the most accurate and useful results that are also better understood by the executive and purchasing management.

Additionally, MT systems are not static: the models are constantly being improved and evolving, and what was true yesterday in quality comparisons may not be true tomorrow. For these reasons, understanding how the data, algorithms, and human processes around the technology interact is usually more important than any static score-based comparison snapshot. A more detailed discussion of the overall MT system comparison issues is provided here.

Conducting accurate and consistent comparative testing of MT systems is difficult with either automated metrics or human assessments. We are aware that the industry struggles in its communications about translation quality with buyers. Both are easy to do badly and difficult to do well. However, in most cases, properly done human A/B tests will yield much more accurate results than automated metrics.

Questions to ask when looking at automated metrics:

What specific data was used to calculate the score?
How similar or different is it from my data?
Can I see the data that was used?
How easy or difficult is it to adapt this MT system to my specific linguistic style and preferences?
How much effort is needed to teach this MT system to use my preferred style and language?
Will I need ML experts to do this or can my translators drive this?
Do small score differences really mean anything?
What happens to these scores if I make changes to the test set?
How quickly will this MT system improve as my translators provide daily corrections?
Do my translators accept these score-based rankings if I show them the output from 3 different systems?
Do my translators like working with this MT system?
Will I be forced to use less qualified translators if I use this MT system as the best translators will prefer to decline?

The Implications of Continuously Improving MT

Modern commerce is increasingly done with the support of online marketplaces and the importance of providing increasingly larger volumes of relevant content digitally to customers has become an important requirement for success.

As the volumes of content grow, the need for more translation also grows substantially. Gone are the days when it was enough for a global enterprise to provide limited, relatively static localization content.

Delivering superior customer experience (CX) requires much more content to be made available to global customers who have the same informational requirements as customers in the HQ country do. A deep and comprehensive digital presence that provides a broad range of relevant content to a buyer and global customer may be, even more, important to be successful in international markets.

The modern era requires huge volumes of content to support the increasingly digital buyer and customer journey. Thus, the need for high-quality, easily adapted machine translation grows in importance for any enterprise with global ambitions.

The success and relentless progress of the ModernMT technology described here make it an ideal foundation for building a rapidly growing base of multilingual content without compromising too much on the quality of translations delivered to delight global customers. This is critical technology needed to allow an enterprise to go multilingual at scale. This means that it is possible to translate billions of words a month at relatively high quality.

The availability of adaptive, highly responsive MT also enables new kinds of knowledge sharing to take place.

A case in point: Unicamullus Medical University in Rome experimented with using ModernMT to translate their medical journal into several new languages and test acceptance and usability. They were surprised to find that the MT quality was much better than expected. The success of the initial tests was promising enough to encourage it to expand the experiment and make valuable medical journal content available in 28 languages.

The project also allows human corrective feedback to be added to the publishing cycle when needed or requested. This machine-first and human-optimized approach is likely to become an increasingly important approach to large-scale translation needs when intelligent adaptive MT is the foundation.

It is quite likely that we will see possibly 1000X or more growth, in the content volume that is translated in the years to come, but also that we see a growing use of adaptive and responsive MT systems like ModernMT which are deeply integrated with active system-improving human feedback loops that can enable and drive this massive multilingual expansion.

There is increasing evidence that the best-performing AI systems across many areas in NLP have a well-engineered and tightly integrated human-in-the-loop to ensure optimal results in production use scenarios. The Translated SRL experience with ModernMT is proof of what can happen when this is done well.

We should expect to see many more global companies translating hundreds of millions of words a month in the near future to serve their global customers. A future that will increasingly be machine-first and human-optimized.

The following interview with Translated CEO, Marco Trombetti, provides additional insight into the progress that we have witnessed with MT over a decade of careful observation and measurement. The interview highlights the many steps taken to ensure that all the measurements are useful KPIs in a professional translation services setting which has been and will continue to be the most demanding arena of performance for MT technology. Marco also points out that ModernMT and new Generative AI like ChatGPT are made of the same DNA, and that MT research has provided the critical technological building blocks used to make these LLMs (Large Language Models like ChatGPT) possible.

eMpTy Pages

Pages