Friday, November 26, 2021

The Carbon Footprint of Machine Learning

 AI and machine learning (ML) news are everywhere we turn today, impacting virtually every industry from healthcare, finance, retail, agriculture, defense, automobile, and even social media. The solutions are developed using a form of ML called neural networks which are called “deep” when multiple layers of abstraction are involved. Deep Learning (DL) could be the most important software breakthrough of our time. Until recently, humans programmed all software. Deep learning, is a form of artificial intelligence (AI), that uses data to write software and typically “learns” from large volumes of reference training data.

Andrew Ng, Baidu’s Chief Scientist and Co-founder of Coursera, has called AI the new electricity. Much like the internet, deep learning will have broad and deep ramifications. Like the internet, deep learning is relevant for every industry, not just for the computing industry.

The internet made it possible to search for information, communicate via social media, and shop online. Deep learning enables computers to understand photos, translate language, diagnose diseases, forecast crops, and drive cars. The internet has been disruptive to media, advertising, retail, and enterprise software. Deep learning could change the manufacturing, automotive, health care, and finance industries dramatically.

By “automating” the creation of software, deep learning could turbocharge every industry, and today we see it is transforming our lives in so many ways. Deep learning is creating the next generation of computing platforms, e.g.

  • Conversational Computers: Powered by AI, smart speakers answered 100 billion voice commands in 2020, 75% more than in 2019.
  • Self-Driving Cars: Waymo's autonomous vehicles have collected more than 20 million real-world driving miles across 25 cities, including San Francisco, Detroit, and Phoenix.
  • Consumer Apps: We are familiar with recommendation engines that learn from all our digitally recorded behavior, and drive our product, services, and entertainment choices. They control our personalized views of ads that we are exposed to and are the primary sources of revenue for Google, Facebook, and others. Often using data that we did not realize was being used without our consent. But it can build market advantage, for example, TikTok, which uses deep learning for video recommendations, has outgrown Snapchat and Pinterest combined.

According to ARK Investment research, deep learning will add $30 trillion to the global equity market capitalization over the next 15-20 years. They estimate that the ML/DL-driven revolution is as substantial a transformation of the world economy as the IT Computing to Internet Platforms change was in the late ’90s, and predict that Deep Learning will the dominant source of market capital creation over the coming decades.

Three factors drive the advance of AI: algorithmic innovation, data, and the amount of computing capacity available for training. Though we are seeing substantial improvements in computing and algorithmic efficiency, the data volumes are also increasing dramatically, and some recent Large Language Model (LLM) innovation from OpenAI (GPT-3), Google (BERT), and others show that there is significant resource usage impact from this approach.

"If we were able to give the best 20,000 AI researchers in the world the power to build the kind of models you see Google and OpenAI build in language; that would be an absolute crisis, there would be a need for many large power stations." 
Andrew Moore, GM Cloud Ops Google Cloud

This energy-intensive workload has seen immense growth in recent years. Machine learning (ML) may become a significant contributor to climate change if this exponential trend continues. Thus, while there are many reasons to be optimistic about the technological progress we are making, it is also wise to both consider what can be done to reduce the carbon footprint, and take meaningful action to address this risk.

The Problem: Exploding Energy Use & The Growing Carbon Footprint

Lasse Wolff Anthony, one of the creators of Carbontracker and co-author of a study of the subject of AI power usage, believes this drain on resources is something the community should start thinking about now, as the energy costs of AI have risen 300,000-fold between 2012 and 2018.

They estimated that training OpenAI’s giant GPT-3 text-generating model is akin to driving a car to the Moon and back, which is about 700,000 km or 435,000 miles. They estimate it required roughly 190,000 kWh, which using the average carbon intensity of America would have produced 85,000 kg of CO2 equivalents. Other estimates are even higher.

“As datasets grow larger by the day, the problems that algorithms need to solve become more and more complex," Benjamin Kanding, co-author of the study, added. “Within a few years, there will probably be several models that are many times larger than GPT-3.”

The financial cost for training GPT-3 reportedly cost $12 million for a single training run. However, this is only possible after reaching the right configuration for GPT-3. Training the final deep learning model is just one of several steps in the development of GPT-3. Before that, the AI researchers had to gradually increase layers and parameters, and fiddle with the many hyper-parameters of the language model until they reached the right configuration. That trial-and-error gets more and more expensive as the neural network grows. We can’t know the exact cost of the research without more information from OpenAI, but one expert estimated it to be somewhere between 1.5X and 5X the cost of training the final model.

This would put the cost of research and development between $11.5 million and $27.6 million, plus the overhead of parallel GPUs. This does not even include the cost of human expertise which is also substantial.

OpenAI has stated that while the training cost is high, the running costs would be much lower but access will only be possible through an API as few could invest in the hardware needed to run it regularly. The efforts to develop the potentially improved GPT-4 which is 500+ times larger than GPT-3 are estimated will cost more than $100 million just in training costs!

These costs mean that this kind of initiative can only be attempted by a handful of companies with huge market valuations. It also suggests that today’s AI research field is inherently non-collaborative. The research approach of “obtain the dataset, create model, beat present state-of-the-art, rinse, repeat” makes it so that there is a big barrier to entry to the market for new researchers and researchers with low computational resources.

Ironically, a company with the word “open” in the name has now chosen to not release the architecture and the pre-trained model. The company has opted to commercialize the deep learning model instead of making it freely available to the public.  

So how is this trend to large models likely to progress? While advances in hardware and software have been driving down AI training costs by 37% per year, the size of AI models is growing much faster, 10x per year. As a result, total AI training costs continue to climb. Researchers believe that state-of-the-art AI training model costs are likely to increase 100-fold, from roughly $1 million today to more than $100 million by 2025. The training cost outlook from Ark Investments is shown below in log scale, where you can also see how the original NMT efforts compare to GPT-3.

Training a powerful machine-learning algorithm often means running huge banks of computers for days, if not weeks. The fine-tuning required to perfect an algorithm, by for example searching through different neural network architectures to find the best one, can be especially computationally intensive. For all the hand-wringing, though, it remains difficult to measure how much energy AI consumes and even harder to predict how much of a problem it could become.

There have been several efforts in 2021 to build even bigger models than GPT-3. All probably with a huge carbon footprint. But there is good news, Chinese tech giant Alibaba announced M6, a massive model that has 10 trillion parameters (50x the size of GPT-3). However, they managed to train it at 1% of the energy consumption needed to train GPT-3!

Another graphic that illustrates the carbon impact that deep learning generates as model enhancement efforts are made is shown below. All deep learning models have ongoing activity directed at reducing the error rates of existing models as a matter of course. An example with Neural MT is when a generic model is adapted or customized with new data.

Subsequent efforts to improve accuracy and reduce error rates in models often need additional data, re-training, and processing. The chart below shows how much energy is needed to reduce the model error rates for image recognition on the ImageNet model. As we can see that the improvement process for large models has an environmental impact that is substantial and needs to be considered.

There is ongoing research and new startups focused on more efficient training and improvement techniques, lighter footprint models that are just slightly less accurate, and more efficient hardware. All of these are needed and will hopefully help reverse or reduce the 300,000x increase in deep learning-driven energy use of the last 5 years.

Here is another site that lets AI researchers roughly calculate the carbon footprint of their algorithms.

And as the damage caused by climate change becomes more apparent, AI experts are increasingly troubled by those energy demands. Many of the deep learning initiatives shown in the first chart above are being conducted in the Silicon Valley area in Northern California. This is an area that has witnessed several catastrophic climate events in the recent past:

  • Seven of the ten largest recorded forest wildfires in California have happened in the last three years!
  • In October 2021 the Bay area also witnessed a “bomb cyclone” rain event after a prolonged drought that produced the largest 24-hour rainfall in San Francisco since the Gold Rush!

San Francisco skyline turns orange during wildfires in September 2020

The growing awareness of impending problems is raising awareness in big tech companies about implementing carbon-neutral strategies. Many consumers are now demanding that their preferred brands take action to show awareness and move toward carbon neutrality.

Climate Neutral Certification gives businesses and consumers a way to a net-zero future and also builds brand loyalty and advocacy. Looking at a list of some committed public companies shows that this is now recognized as a brand-enhancing and market momentum move.

Uber and Hertz recently announced a dramatic expansion of their electric vehicle fleet and received much positive feedback from customers and the market.

Carbon Neutral Is The New Black

What Sustainability Means to Translated Srl

In the past few years, Translated’s energy consumption linked to AI tasks has increased exponentially, and it now accounts for two-thirds of the company’s total energy consumption. Training a translation model for a single language can produce as much CO2 as driving a car for thousands of kilometers.

A large model produces as much CO2 as hundreds of airline flights would. This is why Translated is pledging to become a completely carbon-neutral company.

"How? Water is among the cleanest energy sources out there, so we have decided to acquire one of the first hydroelectric power plants in Italy. This plant was designed by Albert Einstein’s father in 1895. We are adapting and renovating this historic landmark, and eventually, it will produce over 1 million kW of power a year, which will be sufficient to cover the needs of our data center, campus, and beyond."

Translated's electric plant located in Sannazzaro de’ Burgondi

Additionally, the overall architecture of ModernMT minimizes the need for large energy-intensive re-training and the need for maintenance of multiple client-specific models that is typical of most MT deployments today.

Global enterprises may have multiple subject domains and varied types of content so multiple optimizations are needed to ensure that MT performs well. Typical MT solutions require different MT engines for web content, technical product information, customer service & support content,  and user-generated and community content for each language.

ModernMT can handle all these adaptation variants with a single engine that can be differently optimized.  

  • ModernMT is a ready-to-run application that does not require any initial training phase. It incorporates user-supplied resources immediately without needing model retraining.
  • ModernMT learns continuously and instantly from user feedback and corrections made to MT output as production work is done. It produces output that improves by the day and even the hour in active-use scenarios.
  • The ModernMT system manages context automatically and does not require building domain-specific systems.

ModernMT is perhaps the most flexible and easy to manage enterprise-scale MT in the industry.

ModernMT’s goal is to deliver the quality of multiple custom engines by adapting to the provided context on the fly. This makes it much easier to manage on an ongoing basis as only a single engine is needed. This reduces the training need, hence carbon footprint, and makes it easier to manage and update over time.

As described before, a primary driver of improvement in ModernMT is the tightly integrated human-in-the-loop feedback process which provides continuous improvements in model performance, but yet greatly reduces the need for large-scale retraining.

ModernMT is a relatively low footprint approach to continuously learning NMT that we hope to make even more energy efficient in the future.

Thursday, November 11, 2021

The Challenge of Using MT in Localization

We live in an era where MT is translating more than 99% of all the translation being done on the planet on any given day.

However, the adoption of MT by the enterprise is still nascent and still building momentum. Business enterprises have been slower to adopt MT even though national security and global surveillance-focused government agencies have used MT heavily. This adoption delay has mostly been because MT has to be adapted and tuned to perform better with very specific language used in specialized enterprise content.

Early enterprise adoption of MT was focused on eCommerce and customer support use-cases (IT, Auto, Aerospace) where huge volumes of technical support content made it a necessity to use MT technology to allow any possibility of translating the voluminous content in a timely and cost-effective manner to improve the global customer experience.

Microsoft was a pioneer who translated its widely used technical knowledge base to support an increasingly global customer base. The positive customer feedback for doing this has led to many other large IT and consumer electronics firms doing the same.

The adaptation of the MT system to perform better on enterprise content is a critical requirement in producing successful outcomes. In most of these early use-cases we see that MT is used to manage translation challenges when the content volumes were huge, i.e., millions of words a day or week. These were “either use MT or provide nothing” knowledge-sharing scenarios.

These enterprise-optimized MT systems have to adapt to the special terminology and linguistic style of the content they translate, and this customization has been a key element of success with any enterprise use of MT.

eBay was an early MT adopter in eCommerce and has stated often that MT is key in promoting cross-border trade. It was understood that “Machine translation can connect global customers, enabling on-demand translation of messages and other communications between sellers and buyers, and helps them solve problems and have the best possible experiences on eBay.”

A study by an MIT economist showed that after eBay improved its automatic translation program in 2014, commerce shot up by 10.9 percent among pairs of countries where people could use the new system.

Today we see that MT is a critical element of the global strategy for Alibaba, Amazon, eBay, and many other eCommerce giants.

Even in the COVID-ravaged travel market segment, MT is critical as we see with Airbnb, which now translates billions of words a month to enhance the international customer experience on their platform. In November 2021 Airbnb announced a major update to the translation capabilities of their platform in response to rapidly growing cross-border bookings and increasingly varied WFH activity.

“The real challenge of global strategy isn’t how big you can get, but how small you can get.”
Dennis Goedegebuure, former head of Global SEO at Airbnb.

However, MT use for localization use cases has trailed far behind these leading-edge examples, and even in 2021, we find that the adoption and active use of MT by Language Service Providers (LSPs) is still low. Much of the reason lies in the fact that LSPs work on hundreds or thousands of small projects rather than a few very large ones.

Early MT adopters tend to focus on large-volume projects to justify the investments needed to build adapted systems capable of handling the high-volume translation challenge.

What options may be available to increase adoption in the localization and professional business translation sectors?

At the MT Summit conference in August 2021, CSA's Arle Lommel shared survey data on MT use in the localization sector in his keynote presentation. He noted that while there has been an ongoing increase in adoption by LSPs there is considerable room to grow.

Arle specifically pointed out that a large number of LSPs who currently have MT capacity only use it for less than 15% of their customer workload and, “our survey reveals that LSPs, in general, process less than one-quarter of their [total] volume with MT.”

The CSA survey polled a cross-section of 170 LSPs (from their "Ranked 191" set of largest global LSPs) on their MT use and MT-related challenges. The quality of the sample is high and thus these findings are compelling.

The graphic below highlights the survey findings.

CSA Survey of MT Use at LSPs

When they probed further into the reasons behind the relatively low use of MT in the LSP sector they discovered the following:

  • 72% of LSPs report difficulty in meeting quality expectations with MT
  • 62% of LSPs struggle with estimating effort and cost with MT

Both of these causes point to the difficulty that most LSPs face with the predictability of outcomes with an MT project.

Arle reported that in addition to LSPs, many enterprises also struggle with meeting quality expectations and are often under pressure to use MT in inappropriate situations or face unrealistic ROI expectations from management. Thus, CSA concluded that while current-generation MT does well relative to historical practice, it does not (yet) consistently meet stakeholder requirements.

This apparent market reality validated by this representative sample is in stark contrast to what happens at Translated Srl, where 95% of all projects and all client work use MT (ModernMT) since it is a proven way to expedite and accelerate translation productivity.

Adaptive, continuously learning ModernMT has been proven to work effectively over thousands of projects with tens of thousands of translators.

This ability to properly use MT in an effective and efficient assistive role in production translation work has resulted in Translated being one of the most efficient LSPs in the industry, with the highest revenue per employee and high margins.

Another example of the typical LSP experience: a recent study by Charles University done with only 30 translators using 13 engines (EN>CS) concludes: "the previously assumed link between MT quality and post-editing time is weak and not straightforward." They also found that these translators had “a clear preference for using even imprecise TM matches (85–94%) over MT output."

This is hardly surprising, as getting MT to work effectively in production scenarios requires more than choosing the system with the best BLEU score.

Understanding The Localization Use Case For MT

Why is MT so difficult for LSPs to deploy in a consistently effective and efficient manner?

There are at least four primary reasons:

  1. The localization use case requires the highest quality MT output to drive productivity which is only possible with specialized expertise and effort,
  2. Most LSPs work on hundreds/thousands of smallish projects (relative to MT scale) that can vary greatly in scope and focus,
  3. Effective MT adaptation is complex,
  4. MT system development skills are not typically found in an LSP team.

MT Output Expectations

As the CSA survey showed, getting MT to consistently produce output quality to enable use in production work is difficult. While using generic MT is quite straightforward, most LSPs have discovered that rapidly adapting and optimizing MT for production use is extremely difficult.

It is a matter of both MT system development competence and workflow/process efficiency. 

Many LSPs feel that success requires the development of multiple engines for multiple domains for each client, which is challenging since they don't have a clear sense of the effort and cost needed to achieve positive ROI.

If you don’t know how good your MT output will be, how do you plan for staffing PEMT work and calculate PEMT costs?

Thus, we see MT is only used when very large volumes of content are focused around a single subject domain or when a client demands it.

A corollary to this is that it requires deep expertise and understanding of NMT models to acquire the skills and data needed to raise MT output to useful high-quality levels consistently.

Project Variety & Focus

Most LSPs handle a large and varied range of projects that cover many subject domains, content types, and user groups on an ongoing basis. The translation industry has evolved around a Translate>Edit>Proof (TEP) model that has multiple tiers of human interaction and evaluation in a workflow.

Most LSPs struggle to adapt this historical people-intensive approach to an effective PEMT model which requires a deeper understanding of the interactions between data, process, and technology.

The biggest roadblock I have seen is that many LSPs get entangled in opaque linguistic quality assessment and estimation exercises, and completely miss the business value implications created by making more content multilingual. Localization is only one of several use-cases where translation can add value to the global enterprise's mission.

Typically, there is not enough revenue concentration around individual client subject domains, thus, it is difficult for LSPs to invest in building MT systems that would quickly add productivity to client projects. 

MT development is considered a long-term investment that can take years to yield consistently positive returns.

This perceived requirement for the development of multiple engines for many domains for each client requires an investment that cannot be justified with short-term revenue potential. MT projects, in general, need a higher level of comfort with outcome uncertainty, and, handling hundreds of MT projects concurrently to service the business is too demanding a requirement for most LSPs.

MT is Complex

Many LSPs have dabbled with open-source MT (Moses, OpenNMT) or AutoML and Microsoft Translator Hub only to find that everything from data preparation to model tuning, and quality measurement is complicated, and requires deep expertise that is uncommon in the language industry.

While it is not difficult to get a rudimentary MT model built, it is a very different matter to produce an MT engine that consistently works in production use. For most LSPs, open-source and DIY MT is the path to a failed project graveyard.

Neural MT technology evolution is happening at a significantly faster pace than Statistical MT. To stay abreast with the state-of-the-art (SOTA) requires a serious commitment, both in manpower and computing resources.

LSPs are familiar with translation memory technology that has barely changed in 25 years, but MT has changed dramatically over the same period. In recent years the neural network-based revolution has driven multiple open-source platforms to the forefront and keeping abreast with the change is difficult.

NMT requires expertise not only around "big data", NMT algorithms, and open-source platform alternatives but also around understanding parallel processing hardware.

Today AI and Machine Learning (ML) are synonymous, and engineers with ML expertise are in high demand.

MT requires long-term commitment and investment before consistent positive ROI is available and few LSPs have an appetite for such investments.

Some say that an MT development team might be ready for prime-time production work only after they have built a thousand engines and have this experience to draw from. This competence-building experience seems to be a requirement for sustainable success.

Talent Shortage

Even if LSP executives are willing to make these strategic long-term investments, finding the right people has gotten increasingly harder. According to a recent survey by Gartner, executives see the talent shortage not just as a major hurdle to progressing organizational goals and business objectives, but it is also preventing many companies from adopting emerging technologies.

The Gartner research, which is built on a peer-based view of the adoption plans of 111 emerging technologies from 437 IT global organizations over a 12- to 24-month time period, shows that talent shortage is the most significant adoption barrier to 64% of emerging technologies, compared with just 4% in 2020.

IT executives cited talent availability as the main adoption risk factor for the majority of IT automation technologies (75%) and nearly half of digital workplace technologies (41%).

But using technology early and effectively creates a competitive advantage. Bain estimates that “born-tech” companies have captured 54% of the total market growth since 2015. “Born-tech” companies are those with a tech-led strategy. Think Tesla in automobiles, Netflix in media, and Amazon in retail.

Technology has emerged as the primary disruptor and value creator across all sectors. The demand for data scientists and machine learning engineers is at an all-time high.

LSPs need to compete with the global 2000 enterprises who offer more money and resources to the same scarce talent. Thus, we even see technical talent migrating out of translation services to the “mainstream” industry.

There is a gold rush happening around well-funded ML-driven startups and enterprise AI initiatives. ML skills are being seen as critical to the next major evolution in value creation in the overall economy as the chart below shows.

This perception is driving huge demand for data scientists, ML engineers, and computational linguists who are all necessary to build momentum and produce successful AI project outcomes. The talent shortage will only get worse as more people realize that deep learning technology is fueling most of the value growth across the global economy.

Thus, it appears that MT is likely to remain an insurmountable challenge for most LSPs. The option for an LSP to start building robust state-of-the-art MT capabilities in 2021 is increasingly unlikely. 

Even the largest LSPs today have to use “best-of-breed” public systems rather than build internal MT competence. Strategies employed to do this typically depend on selecting MT systems based on BLEU, hLepor, TER, Edit Distance, or some other score-of-the-day, which again explains why there is <15% MT-in-production-use.

As CSA has discovered, LSP MT use has been largely unsuccessful because a good Edit Distance/hLepor/Comet score does not necessarily translate to responsiveness, ease of use, adaptability of the MT system to the production localization use-case needs.

For MT to be useable on 95%+ of the production translation work done by an LSP, it needs to be reliable, flexible, manageable, rapidly adaptive, and continuously learning. MT needs to produce predictably useful output and be truly assistive technology for it to work in localization production work.

The contrast of the MT experience at Translated Srl is striking. ModernMT was designed from the outset to be useful to translators and created to collect the right kind of data needed to rapidly improve and assist in localization project-focused systems.

ModernMT is a blend of the right data, deep expertise in both localization processes and machine learning, and a respectful and collaborative relationship between translators and MT technologists. It is more than just an adaptive MT engine.

Translated has been able to overcome all of the challenges listed above using ModernMT, which today is possibly the only viable MT technology solution that is optimized for the core localization-focused business of LSPs.

ModernMT is the creation of an MT system optimized for LSP use. It could be used quickly and successfully by any LSP as there is no startup setup and training needed, it is a simple "load TM and immediately use" model.

ModernMT Overview & Suitability for Localization

ModernMT is an MT system that is responsive, adaptable, and manageable in the typical localization production work scenario. It is an MT system architecture that is optimized for the most demanding MT use-case: localization. And it is thus able to handle many other use-cases which may have more volume but are less demanding on the output quality requirements.

ModernMT is a context-aware, incremental, and responsive general-purpose MT technology that is price competitive to the big MT portals (Google, Microsoft, Amazon) and is uniquely optimized for LSPs and any translation service provider, including individual translators.

It can be kept completely secure and private for those willing to make the hardware investments for an on-premise installation. It is also possible to develop a secure and private cloud instance for those who wish to avoid making hardware investments.

ModernMT overcomes technology barriers that hinder the wider adoption of currently available MT software by enterprise users and language service providers:

  • ModernMT is a ready-to-run application that does not require any initial training phase. It incorporates user-supplied resources immediately without needing upfront model training.
  • ModernMT learns continuously and instantly from user feedback and corrections made to MT output as production work is being done. It produces output that improves by the day and even the hour in active-use scenarios.
  • ModernMT is context-sensitive.
  • The ModernMT system manages context automatically and does not require building domain-specific systems.
  • ModernMT is easy to use and rapidly scales across varying domains, data, and user scenarios.
  • ModernMT has a data collection infrastructure that accelerates the process of filling the data gap between large web companies and the machine translation industry.
  • Driven easily by the source sentence to be translated and optionally small amounts of contextual text or translation memory.

ModernMT’s goal is to deliver the quality of multiple custom engines by adapting to the provided context on the fly. This fluidity makes it much easier to manage on an ongoing basis as only a single engine is needed.

The translation process in ModernMT is quite different from common, non-adapting MT technologies. The models created with this tool do not merge all the parallel data into a single indistinguishable heap; separate containers for each data source are created instead and this is how it maintains the ability to adapt to hundreds of different contextual use scenarios instantly.

ModernMT consistently outperforms the big portals in MT quality comparisons done by independent third-party researchers, even on the static baseline versions of their systems.

ModernMT systems can easily outperform competitive systems once adaptation begins, and active corrective feedback immediately generates quality-improving momentum.

The following charts show how ModernMT is a consistent superior performer even as the quality measurement metrics change over multiple independent third-party evaluations conducted over the last three years. 

None of these metrics capture the ongoing and continuous improvements in output quality that is the daily experience of translators who work with dynamically improving ModernMT at Translated Srl.

Independent evaluations confirm ModernMT quality improves faster with COVID data set on English > German in the chart below.

ModernMT was also the "top performer" on several other languages tested with COVID data.

In Q4 2021 the COMET metric is widely being considered a "better" score because it is more aligned with human assessments and also incorporates semantic similarity, and again ModernMT shines.

If the predictions about the transformative impact of the deep learning-driven revolution are true, DL will likely disrupt many industries including the translation industry. MT is a prime example of an opportunity lost by almost all the Top 20 LSPs.

While it is challenging to get MT working consistently in localization scenarios, ModernMT and Translated show that it is possible and that there are significant benefits when you do.

This success also shows that when you get MT properly working in professional translation work, you create competitive advantages that provide long-term business leverage.  The future of business translation increasingly demands collaborative working models with human services integrated with responsive adapted MT. The future for LSPs that do not learn to use MT effectively will not be rosy.

A detailed overview of ModernMT is provided here. It is easy to test it against other competitive MT alternatives, as the rapid adaptation capabilities can be easily seen by working with MateCat/Trados or with a supported TMS product (MemoQ) if that is preferred.

ModernMT is an example of an MT system that can work for both the LSP and the Translator. The ease of the "instant start experience" with Matecat + ModernMT is striking when compared to the typical plodding, laborious MT customization process we see elsewhere today. Try it and see.

Friday, November 5, 2021

The Human-In-The-Loop Driving MT Progress

 We live in an age where artificial intelligence (AI) is a term that is used extensively in conversations both in our personal and professional lives. In most cases, the term AI refers to specialized machine learning (ML) applications that “acquire knowledge” through a process called deep learning.

While great progress has been made in many ML cases, we also realize now that machine learning alone is unlikely to completely solve challenging problems, especially in natural language processing (NLP). The neural machine translation (NMT) capabilities today are impressive in contrast to historical MT offerings but still fall short of competent human translation.

Neural MT “learns to translate” by looking closely (aka as "training") at large datasets of human-translated data. Deep learning is self-education for machines; you feed the system huge amounts of data, and it begins to discern complex patterns within the data.

But despite the occasional ability to produce human-like outputs, ML algorithms are at their core only complex mathematical functions that map observations to outcomes. They can forecast patterns that they have previously seen and explicitly learned from. Therefore, they’re only as good as the data they train on and start to break down as real-world data starts to deviate from examples seen during training.

Neural MT has made great progress indeed but is far from having solved the translation problem. Post-editing is still needed in professional settings and no responsible language services company would depend entirely on MT without human oversight or review.

We hear regularly about "big data" that is driving AI progress, but we are finding more and more cases where the current approach of deep learning and adding more data is not enough. The path to progress is unlikely to be brute force training of larger neural networks with deeper layers on more data.

Whilst deep learning excels at pattern recognition, it’s very poor at adapting to changing situations when even small modifications of the original case are encountered, and often has to be re-trained with large amounts of data from scratch. This is one reason we see so little production use of MT amongst LSPs.

In most cases, the AI learning process happens upfront and only takes place in the development phase. The model that is developed is then brought onto the market as a finished program. Continuous “learning” is neither planned nor does it always happen after a model is put into production use. This is also true of most public MT systems. While these systems are updated periodically, they are not easily able to learn and adapt to new, ever-changing production requirements.

Machine learning progress still falls short of human performance in NLP

Recently there has been much fanfare around huge pre-trained language model-based initiatives like BERT and GPT-3. This involves training a neural network model on an enormous amount of data and then adapting (“fine-tune”) the model to a bunch of more specific NLP tasks that require classification, sequence labeling, or similar digesting of text, e.g., named entity recognition, question answering, sentiment analysis. GPT-3 can sometimes generate human-sounding textual responses to questions.

Researchers at Stanford have been most vocal in claiming that this is a “sweeping paradigm shift in AI”. They have coined a new term, “Foundation Models” to characterize this shift, but are being challenged by many experts.

Some examples of the counter view :

Jitendra Malik, a renowned expert in computer vision at Berkeley, said, “I am going to take a ... strongly critical role when we talk about them as the foundation of AI ... These models are castles in the air. They have no foundations whatsoever.”

Georgia Tech professor Mark Riedl wrote on Twitter “Branding very large pre-trained neural language models as “foundation” models is a brilliant … PR stunt. It presupposes them as inevitable to any future in AI”. But that doesn’t make it so.

The reality is that foundation model demos, at least in their current incarnations. are more like parlor tricks than genuine intelligence. They work impressively well some of the time but also frequently fail, in ways that are erratic, unsystematic, and even foolish. One recent model, for example, mistook an apple with the word “iPod” on a piece of paper for an actual iPod.

The initial enthusiasm for GPT-3 has been followed by increasing concern as people have realized how these systems are prone to producing unpredictable obscenity, prejudiced remarks, misinformation, and so forth. Some experts fear that GPT-3-like capabilities could even become the engine for a massively scaled misinformation engine creating crap/mediocre content to instigate increased societal dysfunction and polarization. 

Large pre-trained statistical models can do almost anything, at least enough for a proof of concept, but there is little that they can do reliably—precisely because they skirt the foundations that are actually required.

OpenAI who believe in the scaling hypothesis is supposedly working on GPT-4 which they say will have 100 Trillion Parameters — 500+ times the size of GPT-3, in a presumed attempt to achieve AGI. But critics are skeptical that increasing data and scale alone will be the answer.

Foundational AI models are a dead end: they will never yield systems that understand; their maniacal focus on “moar data!” is superficial; they grow at the expense of ignoring better architectures.Grady Booch

Stuart Russell, professor at Berkeley and AI pioneer, argues that “focusing on raw computing power misses the point entirely […] We don’t know how to make a machine intelligent — even if it were the size of the universe.” Deep learning isn’t enough to achieve AGI.

We have seen that these two opposing views have also been true with machine translation. There were significant advances when we moved from Rule-Based MT to Statistical MT initially. The improvements plateaued after an initial forward leap, and then we discovered that more data is not always better, and this happened again with Neural MT. Good isht but not quite enough.

There are some who believe that NMT will replace human translators, but the reality in professional translation is not quite as shiny. Today we are much more aware that data quality and the “right data” matters more than volume alone. Human oversight is mandatory for most professional translations. Some say that setting up an active learning and corrective feedback process is a better way forward than brute force data and computing resource application.

What is human-in-the-loop (HITL) based human-machine collaboration?

Human-in-the-loop (HITL), is the process of leveraging the power of the machine and enabling high-value human intelligence interactions to create continuously improving machine learning-based AI models. Active learning generally refers to the humans handling low confidence units and feeding improvements back into the model. Human-in-the-loop is broader, encompassing active learning approaches as well as the creation of data sets through human labeling.

HITL describes the process when the machine is unable to solve a problem based on initial training data alone and needs human intervention to improve both the training and testing stages of building an algorithm. Properly done, this creates an active feedback loop allowing the algorithm to give continuously better results with ongoing use and feedback

ML “learns” by collecting “experience” from the contents of exemplary data sets, arranging these “experiences”, developing a complex model from it, and finally gaining “knowledge” from the patterns and laws that have emerged. In other words, machines learn by being trained — fed with data sets. Thus, “learning” is only as good as the data that they learn from.

The computer encodes this learning into an algorithm using neural net deep learning techniques. This algorithm is then used to convert new input data with the learned patterns embedded in the algorithms to hopefully generate acceptable and useful output. Public MT that produces “gist quality” output is an example of widely used NLP AI that “translates” trillions of words a day.

With language translation, the critical training data is translation memory. 
However, the truth is that there is no existing training data set (TM) that is so perfect, complete, and comprehensive as to produce an algorithm that consistently produces perfect translations.
While some MT systems can produce compelling output in a limited area of use (usually on new data that is similar to the training material used), the professional use of MT often requires ongoing human review and post-editing before widespread dissemination and business use of translated content.

Language is always evolving and words have innumerable ways of being combined to preclude the possibility that the machine algorithm will have seen every possible combination.

With most MT systems, ongoing “learning” is neither planned nor does it happen often after the initial development phase.

Also, adapting large generic models to unique enterprise use cases is often fraught with difficulty because developers lack insight into the volume, nature, and quality of the underlying base data.

While some systems may have periodic updates as new chunks of training data become available, in the interim post-editors are forced to repeatedly correct the same type of errors, over and over again. Thus, we see that many LSPs and translators tend to be averse to using MT and do so with reluctance.

AI models don't make predictions with 100% confidence as their "understanding" of data is largely based on statistics, which lacks the concept of certainty as humans use it in practice. To account for this inherent algorithmic uncertainty, some AI systems like ModernMT allow humans to directly interact with it to actively contribute relevant new learning.

As a consequence of this interaction (feedback), the machine keeps adjusting its "view of the world" and adapts to the new learning. This works much like you would teach a child when it points at a cat saying "woof woof" – through repeated correction ("No, that's a cat"), the child will learn to connect to the updated learning.

Human-in-the-loop aims to achieve what neither a human being nor a machine can achieve on their own. When a machine isn’t able to solve a problem, humans step in and intervene. This process results in the creation of a continuous feedback loop that produces output that is useful to the humans using the system.

With constant feedback, the algorithm learns and produces better results over time. Active and continuous feedback to improve existing learning and create new learning is a key element of this approach.

As Rodney Brooks, the co-founder of iRobot said in a post entitled - An Inconvenient Truth About AI:

 "Just about every successful deployment of AI has either one of two expedients: It has a person somewhere in the loop, or the cost of failure, should the system blunder, is very low."

In the translation context, with ModernMT, this means that the system is designed from the ground up to actively receive feedback and rapidly incorporate this into the existing model on a daily or even hourly basis.

This rapid and continuous feedback and learning loop produce better MT output. This is in contrast to most MT models where corrective data is collected over many months or years and laboriously re-trained to learn from the corrective feedback, often with limited success as opaque baseline data dominates the model’s predictive behavior.

HITL refers to systems that allow humans to give direct feedback to a model for predictions below a certain level of confidence. This approach allows ModernMT to address the problem of quickly acquiring the “right” data for the specific translation task at hand. HITL within the ModernMT framework allows the system to perform best on the material that is currently in focus.

The HITL approach also enables ModernMT to rapidly acquire competence in sparse data situations as many enterprise use scenarios do not initially have the right training data available a priori.

An examination of the increasing research interest in HITL can be obtained through a Google Scholar search with the keywords: “human-in-the-loop” and “machine learning”. As the use of machine learning proliferates, there is an increasing awareness that humans working together with machines in an active learning contribution mode can often outperform the possibilities of machines or humans alone.

Effective HITL implementations allow the machine to capture an increasing amount of highly relevant knowledge and enhance the core application as ModernMT does with MT.

While there are some who talk about AI "sentience" and singularity the reality is more sobering. Something anyone who tries to ask Alexa, Siri, or the latest Chatbot a question that goes beyond the simplest form can testify to. Humans learn how to make Alexa work some of the time for simple things but mostly have to hear the AI say that they do not understand the question when probed for anything beyond simple database lookups.

AI lacks a theory of mind, common sense and causal reasoning, extrapolation capabilities, and a body, and so it is still extremely far from being “better than us” at almost anything slightly complex or general.

This also suggests that humans will remain at the center of complex, knowledge-based AI applications even though the way humans work will continue to change. The future is more likely to be about how to make AI be a useful assistant than it is about replacing humans. 

In language translation, we see that HITL MT systems like ModernMT enable humans to address a much broader range of translation challenges that can add significant value to a growing range of enterprise use cases.

ModernMT: Humans and Machines, Hand in Hand

ModernMT is a highly interactive and engaged MT architecture that has been built and refined over a decade with active feedback and learning from both translators and MT researchers. ModernMT is used intensively in all the production translation work done by Translated Srl. and was a functioning HITL machine learning system before the term was even coined.

This long-term engagement with translators and continuous feedback-driven improvement process also results in creating a superior training data set over the years. This superior data enables users to have an efficiency and quality advantage that is not easily or rapidly replicated. This is also the reason why ModernMT does so consistently well in third-party MT system comparisons, even though evaluators do not always measure its performance optimally. ModernMT simply has more informed translator feedback built into the system.

ModernMT is an "Instance-Based Adaptive MT" platform. This means that it can start adapting and tuning the MT output to a customer subject domain immediately, without a batch customization phase. There is no long-running (hours/days/weeks) data preparation and pre-training process needed upfront.

There is also no need to wait and gather a sufficient volume of corrective feedback to update and improve the MT engine on an ongoing basis. It is learning all the time. This also makes it an ideal MT capability for any LSP or translator or any competent bilingual human who can provide ongoing feedback to the system.

In the typical MT development scenario across the world, we see that MT developers and translators have minimal communication and interaction. The typical PEMT workflow involves low-paid translators/editors correcting MT output with little to no say in how the MT system works and responds to feedback. In the typical MT scenario, humans are the downstream clean-up crew after MT produces an initial messy draft.

Typical use of MT has infrequent human feedback once a model is produced and large data volumes of corrective feedback have to be collected slowly to properly train and update models to learn customer-specific language patterns, style, and terminology.

This is in dramatic contrast to the ModernMT development scenario. There is an active and ongoing dialog between MT developers and translators on an ongoing and continuous basis. This makes developers more aware of translator frustrations/needs and also teaches translators to provide actionable and concrete feedback on system output.

The understanding of the translation task and resulting directives that ongoing translator feedback brings to the table is an ingredient that most current MT systems lack.

Corrective feedback given to the MT system is dynamic and continuous and can have an immediate impact on the next sentence produced by the MT system. Over the years the ModernMT product evolution has been driven by changes to identify and reduce post-editing and translation cognition effort rather than optimizing BLEU scores as most other MT developers have done.

Recently, in sales presentations at MT Summit, several MT vendors claimed to have “human-in-the-loop” MT systems when presenting traditional PEMT (translator-as-slave) workflows. However, it is much easier to add those words on a slide than to implement the key set of functional requirements and capabilities that make HITL a reality.

Expert MT use is a result of the right data, the right process, and ML algorithms. In the localization use case, the "right" process is particularly important. Like much of machine intelligence, the real genius [of deep learning] comes from how the system is designed, not from any autonomous intelligence of its own. Clever representations, including clever architecture, make clever machine intelligence,” Roitblat writes.

ModernMT is an example of a superior implementation that brings key elements together compellingly and consistently to solve enterprise translation problems efficiently and at scale.

The following is a summary of features in a well-designed HITL system, such as the one underlying ModernMT:

  • Easy setup and startup process for any and every new adapted MT system
  • Active and continuous corrective feedback is rapidly processed so that translators can see the impact of corrections in real-time.
  • An MT system that is continuously training and improving with this feedback (by the minute, day, week, month).
  • Active communication and collaboration between translators and MT research and development to address high-friction problems for translators.
  • An inherently superior and continuously improving foundational training data set progressively vetted by humans.
  • Ongoing system evaluation from human feedback and assessment rather than from automated metrics like BLEU, hLepor, Comet, or TER.
  • Tightly integrated into the foundational CAT tools used by translators who provide the most valuable system-enhancing feedback.
  • Translators WANT-TO-USE MT for productivity benefits, unlike many PEMT scenarios where translators do NOT want to work with and actively avoid MT.
  • Multiple points of feedback and system improvement data build collaborative momentum.

As we look to the future of MT technology, it is increasingly apparent that progress will be more likely to come from HITL contributions than from algorithms, computing power, or even new large-scale data acquisition.

MT systems like ModernMT that easily and dynamically engage informed human feedback will learn what matters most to the production use of MT and improve specifically on the most relevant data.

The future of localization is likely to be increasingly a "machine-first, human-optimized" model, and dynamically better, more responsive machine performance will likely result in more positive and successful human interaction and engagement.

Until we come to the day where perfect training data sets are available to train MT, properly designed feedback processing and dynamic model updating capabilities like ModernMT are much more likely to deliver the best and most useful professional-use MT performance. 

This is a reprint of a post originally published here with small formatting changes.

Friday, October 22, 2021

Understanding Machine Translation Quality: A Review

This is a reprint of a post I wrote and already published here with some minor formatting changes made for emphasis. It is the first of a series of ongoing posts that will be published at that site and also shared here if it seems appropriate. 

For those who may seek or insist that I maintain a truly objective viewpoint, I should warn you that these posts will reflect my current understanding that ModernMT is truly a superior MT implementation for enterprise MT use. I will stress this often in future posts as I have not seen a better deployment of MT technology for professional business translation in the 15 years I have been involved with Enterprise MT.


Today we live in a world where machine translation (MT) is pervasive, and increasingly a necessary tool for any global enterprise that seeks to understand, communicate and share information with a global customer base.

It is estimated by experts that trillions of words are being translated daily with the aid of the many “free” generic public MT portals worldwide.

This is the first in a series of posts that will explore the issue of MT Quality in some depth, with several goals:

  • Explain why MT quality measurement is necessary,
  • Share best practices,
  • Expose common misconceptions,
  • Understand what matters for enterprise and professional use.

While much has been written on this subject already, it has not seemed to have reduced the amount of misunderstanding and confusion around this subject. Thus, there is value in continued elucidation to ensure that greater clarity and understanding are achieved.

So let’s begin.

MT Quality and Why Does It Matter?

Machine Translation (MT) or Automated Translation is a process when computer software “translates” text from one language to another without human involvement.

There are ten or more public MT portals available to do this in the modern era, and additionally, many private MT offerings are available to the modern enterprise to address their large-scale language translation needs. For this reason, the modern global enterprise needs an understanding of the relative strengths and weaknesses of the many offerings available in the marketplace.

Ideally, the “best” MT system would be identified by a team of competent translators who would run a diverse range of relevant content through the MT system after establishing a structured and repeatable evaluation process. 

This is slow, expensive, and difficult, even if only a small sample of 250 sentences is evaluated.

Thus, automated measurements that attempt to score translation adequacy, fluency, precision, and recall have to be used. They attempt to do what is best done by competent humans. This is done by comparing MT output to a human translation in what is called a Reference Test set. These reference sets cannot provide all the possible ways a source sentence could be correctly translated. Thus, these scoring methodologies are always an approximation of what a competent human assessment would determine, and can sometimes be wrong or misleading.

Thus, identifying the “best MT” solution is not easily done. Consider the cost of evaluating ten different systems on twenty different language combinations with a human team versus automated scores. Even though it is possible to rank MT systems based on scores like BLEU and hLepor, they do not represent production performance. The scores are a snapshot of an ever-changing scene. If you change the angle or the focus the results would change.

A score is not a stable and permanent rating for an MT system. There is no single, magic MT solution that does a perfect job on every document or piece of content or language combination. Thus, the selection of MT systems for production use based on these scores can often be sub-optimal or simply wrong.

Additionally, MT technology is not static: the models are constantly being improved and evolving, and what was true yesterday in quality comparisons may not be true tomorrow.

For these reasons, understanding how the data, algorithms, and human processes around the technology interact is usually more important than any comparison snapshot.  

In fact, building expertise and close collaboration with a few MT providers is likely to yield better ROI and business outcomes than jumping from system to system based on transient and outdated quality score-based comparisons.

Two primary groups have an ongoing and continuing interest in measuring MT quality. They are:

  1. MT developers
  2. Enterprise buyers and LSPs

They have very different needs and objectives and it is useful to understand why this is so.

Measurements that may make sense for developers can often be of little or no value to enterprise buyers and vice versa. 


MT Developers

MT developers typically work on one model at a time, e.g.: English-to-French. They will repeatedly add and remove data from a training set, then measure the impact to eventually determine the optimal data needed.

They may also modify parameters on the training algorithms used, or change algorithms altogether, and then experiment further to find the best data/algorithm combinations using instant scoring metrics like BLEU, TER, hLepor, ChrF, Edit Distance, and Comet.

While such metrics are useful to developers, they should not be used to cross-compare systems, and have to be used with great care.  The quality scores from several (data/algorithm) combinations are calculated by comparing MT output from each of these systems (models) to a Human Reference translation of the same evaluation test data. The highest scoring system is usually considered the best one.

In summary, MT developers use automatically calculated scores that attempt to mathematically summarize overall precision, recall, adequacy, and fluency characteristics of an MT system into a numeric score, This is done to identify the best English-to-French system, as stated in our example, that they can build with available data and computing resources.

However, a professional human assessment may often differ from what these scores say.

In recent years, Neural MT (NMT) models have exposed that using these automated scoring metrics in isolation can lead to sub-optimal choices. Increasingly, human evaluators are also engaged to ensure that there is a correlation between automatically calculated scores and human assessments.

This is because the scores are not always reliable, and human rankings can differ considerably from score-based rankings. Thus, the quality measurement process is expensive, slow, and prone to many procedural errors, and sometimes even deceptive tactics.

Some MT developers test on training data which can result in misleadingly high scores. (I know of a few who do this!) The optimization process described above is essentially how the large public MT portals develop their generic systems, where the primary focus is on acquiring the right data, using the best algorithms, and getting the highest (BLEU) or lowest (TER) scores.

Enterprise Buyers and LSPs

Enterprise Buyers and LSPs usually have different needs and objectives. They are more likely to be interested in understanding which English-to-French system is the “best” among five or more commercially available MT systems under consideration.

Using automated scores like BLEU, hLepor and TER do not make as much sense in this context. The typical enterprise/LSP is also additionally interested in understanding which system can be “best” modified to learn enterprise terminology and language style.

Optimization around enterprise content and subject domain matters much more, and a comparison of generic (stock) systems can often be useless in the considered professional use context.

Many forget that many business problems require a combination of both MT and human translation to achieve the required level of output quality. Thus, a tightly linked human-in-the-loop (HITL) process to drive MT performance improvements has increasingly become a key requirement for most enterprise MT use cases.

Third-party consultants have compared generic (stock or uncustomized) engines and ranked MT solutions using a variety of test sets that may or may not be relevant to a buyer. These rankings are then often being used to dynamically select different MT systems for different languages, but it is possible and even likely, that they are making sub-optimal choices.  

The ease, speed, and cost of tuning and adapting a generic (stock) MT system to enterprise content, terminology, and language style matter much more in this context, and comparisons should only be made after determining this aspect.

However, as generic system comparisons are much easier and less costly to do, TMS systems and middleware that allow MT system selection using these generic evaluation test data scores, often make choices based on irrelevant and outdated data and can thus be sub-optimal. This is a primary reason that so many LSP systems perform so poorly and why MT is so underutilized in this sector.

While NMT continues to gain momentum as the average water level keeps rising, there is still a great deal of naivete and ignorance in the professional translation community about MT quality assessment and MT best practices in general. The enterprise/LSP use of MT is much more demanding in terms of focused accuracy and sophistication in techniques, practices, and deployment variability, and few LSPs are capable or willing to make the investments needed to achieve ongoing competence as the state-of-the-art (SOTA) continues to evolve.

Dispelling MT Quality Misconceptions

1) Google has the “best” MT systems

This is one of the most widely held misconceptions. While Google does have excellent generic systems and broad language coverage, it is not accurate to say that they are always the best.

Google MT is complicated and expensive to customize for enterprise use cases, and there are significant data privacy and data control issues to be navigated.  Also, because Google has so much data underlying their MT systems, they are not easily customized by the relatively meager data volumes that most enterprises or LSPs have available. DeepL is often a favorite of translators, but also has limited customization and adaptation options.

ModernMT is a dynamically adaptive, and continuously learning breakthrough neural MT system. As it is possibly the only MT system that learns and improves with every instance of corrective feedback in real-time, a comparative snapshot based on a static system is even less useful.

A properly implemented ModernMT system will improve rapidly with corrective feedback, and easily outperform generic systems on the enterprise-specific content that matters most. Enterprise needs are more varied, and rapid adaptability, data security, and easy integration into enterprise IT infrastructure typically matter most.

2) MT Quality ratings are static & permanent

MT systems managed and maintained by experts are updated frequently and thus snapshot comparisons are only true for a single test set at a point in time. These scores are a very rough historical proxy for overall system quality and capability, and deeper engagement is needed to better understand system capabilities.

For example, to make proper assessments with ModernMT, it is necessary to actively provide corrective feedback to see the system improve exactly on the content that you are most actively translating now. If multiple editors concurrently provide feedback, ModernMT will improve even faster. These score-based rankings do not tell you how responsive and adaptive an MT system is to your unique data.

TMS systems that switch to different MT systems via API for each language are of dubious value since selections are often based on static and outdated scores. Best practices recommend that efforts to improve an MT systems adaptation to enterprise content, domain, and language style yield higher value than using MT system selection based on embedded scores built into TMS systems and middleware.

3) MT quality ratings for all use cases are the same.

The MT quality discussion needs to evolve beyond targeting linguistic perfection as the final goal, or comparison of BLEU, TER, or hLepor scores, and proximity to human translation.

It is more important to measure the business impact and make more customer-relevant content multilingual across global digital interactions at scale. While it is always good to get as close to human translation quality as possible, this is simply not possible with the huge volumes of content that are being translated today.

There is evidence now that shows that for many eCommerce use scenarios, even gist translations that contain egregious linguistic errors can produce a positive business impact. In information triage scenarios typical in eDiscovery (litigation, pharmacovigilance, national security surveillance) the translation needs to be accurate on key search parameters but not on all the text.  Translation of user-generated content (UGC) is invaluable to improving and understanding the customer experience and is also a primary influence on new purchase activity. None of these scenarios require perfect linguistic quality MT output, to have a positive business impact and drive successful customer engagement.

4) The linguistic quality of MT output is the only way to assess the “best” MT system.

The linguistic quality of MT output is only one of several critical criteria needed for robust evaluation for an enterprise/LSP buyer. Enterprise requirements like the ease and speed of customization to enterprise domain, data security and privacy, production MT system deployment options, integration into enterprise IT infrastructure,  overall MT system manageability, and control also need to be considered.

Given that MT is rapidly becoming an essential tool for a globally agile enterprise, we need new ways to measure the quality and value of MT in global CX scenarios. In the scenarios where MT enables better communication, information sharing, and understanding of customer concerns on a global scale, we need new ways to measure success.  A closer examination of business impact reveals that the metrics that matter the most would be:

  • Increased global digital presence and footprint
  • Enhanced global communication and collaboration
  • Rapid response in all global customer service/support scenarios
  • Productivity improvement in localization use cases to enable more content to be delivered at higher quality
  • Improved conversion rates in eCommerce

And ultimately the measure that matters at the executive level is the measurably improved customer experience of every customer in the world. 

This is often more a result of process and deployment excellence than the reported semantic similarity scores of any individual MT system.

The reality today is that increasingly larger volumes of content are being translated and used with minimal or no post-editing.  The highest impact MT use cases may only post-edit a tiny fraction of the content they translate and distribute.

However, much of the discussion in the industry today still focuses on post-editing efficiency and quality estimation processes that assume all the content will be post-edited.

It is time for a new approach that easily enables tens of millions of words to be translated daily, in continuously learning MT systems that improve by the day and enable new communication, understanding, and collaboration with globally distributed stakeholders.

In the second post in this series, we will dig deeper into BLEU and other automated scoring methodologies and show why competent human assessments are still the most valuable feedback that can be provided to drive ongoing and continuous improvements in MT output quality.

Friday, June 11, 2021

Close Call - Observations on Productivity, Talent Shortages, & Human Parity MT

This is a guest post by Luigi Muzii, a frequent contributor to this blog. I wanted to make sure I had a chance to re-publish his thoughts on the MT human parity issue before he withdraws from blogging, and hopefully, this is not his last contribution. He has been a steady and unrelenting critic of many translation industry practices, mostly, I think with the sincere hope of driving evolution and improvement in business practices. To my mind, his criticism always had the underlying hope that business processes and strategies in the translation industry would evolve to look more like other industries where service work is more respected and acknowledged or more closely align to the business mission needs of clients. His acerbic tone and dense writing style have been criticized, but I have always appreciated his keen observation and unabashed willingness to expose bullshit, overused cliches, and platitudes in the industry. There is just too much Barney-love in the translation industry. Even though I don't always agree with him, it is refreshing to hear a counter opinion that challenges the frequent self-congratulation that we also see in this industry.  

When I first came to the translation industry from the mainstream IT industry I noticed that people in the industry were more world-wise, cultured, and even gentler than most I had encountered in the IT industry. However, the feel-good vibe engendered by the multicultural sensitivity also sustains a cottage industry characteristic to processes, technology, and communication style in this industry. People are much more tolerant of inefficiency and sub-optimal technology use. I noticed this especially from the technology viewpoint as I entered the industry as a spokesperson for Language Weaver who was an MT pioneer with data-driven MT technology, the first wave of "machine learning". I was amazed by the proliferation of shoddy in-house TMS systems and the insistence to keep these mostly second-rate systems running. When a group of more professionally developed TMS systems emerged, these TMS vendors struggled to convince key players to adopt the improved technology. It is amazing that even companies that reach hundreds of millions of dollars in annual revenue still have processes and technology use profiles of late-stage cottage industry players. Even Jochen Hummel the inventor of Trados (TM) has expressed surprise that a technology he developed in the 1980s is still around, and has stated openly that it should properly be replaced by some form of NMT! 

The resistance to MT is a perfect example of a missed opportunity. Instead of learning to use it better, in a more integrated, knowledgeable, and value-adding way for clients, it has become another badly used tool whose adoption struggles along, and MT use is most frequently associated with inflicting pain and low compensation on the translators forced to work with these sub-optimal systems.

In an era where trillions of words are being translated by MT daily in public MT portals, the chart above should properly be titled  "Clueless with MT". I would also change it to N=170 LSPs that don't know how to use MT. Most LSPs who claim to "do MT", even the really large ones, in fact, do it really badly. The Translated - ModernMT deployment in my opinion is one of the very few exceptions of how to do MT right for the challenging localization use case. It is also the ONLY LSP user scenario I know where MT is used in 90% or more of all translations work done by the LSP. Why? Because it CONSISTENTLY makes work easier, more efficient, and most importantly translators consistently ask for access to the rapidly learning ModernMT systems. Rather than BLEU scores, a production scenario where translators regularly and fervently ask for MT access is the measure of success. It can only happen with superior engineering that understands and enhances the process. It also means that this LSP can process thousand words projects with the same ease as they can process billions of words a month and scale easily to trillions of words if needed. In my view, this is a big deal and that is what happens when you use technology properly. It is no surprise that most of the largest MT deployments in the world outside of the major Public MT Portals (eCommerce, OSI, eDiscovery) have little to no LSP involvement. Why would any sophisticated global enterprise be motivated to bring in an LSP that offers nothing but undifferentiated project management, dead-end discussions on quality measurement, and a decade-long track record of incompetent technology use?  

Expert MT use is a result of the right data, the right process, and ML algorithms which are now commoditized. In the localization space, the "right" process is particularly important.  Like much of machine intelligence, the real genius [of deep learning] comes from how the system is designed, not from any autonomous intelligence of its own. Clever representations, including clever architecture, make clever machine intelligence,” Roitblat writes. I think it is fair to say that most MT use in the translation industry does not reach the level of "clever machine intelligence". It follows that most translation industry MT use projects would qualify as sub-optimal machine intelligence.

This, I felt was a fitting introduction to Luigi's post. I hope he shows up once in a while in the coming future, as I don't know many others who are as willing to point out "areas of improvement" for the community as willingly as he does.



The Productivity Paradox

Economists have argued for decades that massively investing in office technologies would enormously boost up productivity. However, already in 1994 authoritative studies had cast doubts on the reliability of certain projections. Recent studies reported that a 12 percent annual increase in the data processing budgets for U.S. corporations have yielded annual productivity gains of less than 2 percent.

The reasons for those gains to be much less than expected might be in long-established business practices that have possibly been holding them back by restraining knowledge workers from taking full advantage of better and better tools, thus boosting productivity, proving the significance of the law of the instrument.

Therefore, to achieve the expected increases in productivity most business practices should change.

Word Rates v. Hour Rates

Translation pays have been based on per-word rates for over thirty years. The reasons are basically twofold. On one hand, computer-aided translation tools have finally enabled buyers to understand (more or less) precisely what they have been paying for. On the other hand, computer-aided translation tools have been allowing to measure throughput (almost) objectively and productivity, thus helping statistics and projections.

Add to that the ability for buyers to request discounts based on the percentage of matches between a text and a translation memory and it instantly becomes obvious that it is not the translator’s time, expertise, or skills that they are buying and paying for.

Nevertheless, a translation assignment/project inevitably ends up involving a series of collateral tasks whose fee cannot be computed on a per-word basis.

The price LSPs charge buyers, then, includes the price for services for which they then pay vendors on a different basis. Similarly, in setting their own fees, these vendors include the compensation for non-productive or non-remunerative tasks. The word-rate fee, then, is also based on the time required to complete a certain task. In short, this means that even the conundrum of measured fees (word rate and hourly rate) v. fixed fees is pointless. The moment the parties agree on how to compute the fee, only measuring is left open. And when it comes to statistics and projections, this is of more interest to the supplier—specifically the middleman—than the buyer.

Not only would reducing non-productive tasks allow for regaining margins and cutting the selling price, but also for regaining productivity and resources to allocate for increasing efficiency through automation, thus ultimately productivity itself.

If anything, now more than ever, it is necessary to foster standardization and reach an agreement on reference models, metrics, and methods of measurement. The resulting standardization of exchange formats, data models, and metrics would help productivity and interoperability.

In fact, some tasks, like file preparation or, more precisely, the assembly of localization packages and kits, cannot be fully automated or outstripped from the translation/localization workflow, although they are indeed separate jobs. In this respect, standardization might also help automate such tasks. Nevertheless, when extensive and time-tolling, these tasks should be the buyer’s responsibility. Incidentally, given the traditionally poor consideration of buyers for the translation industry and their insufficient understanding of translation and localization and the related workflow, most of the problems associated with project setup and file-preparation is attributable to sloppiness and immaturity. This includes job instructions requiring project teams to spend time reading through them.

On the other hand, some of these tasks, like quotation, are commonly part of project tasks while they should not. So, for example, when formulating quotations at selling, any subsequent task relating to it can be (at least partially) automated. The same goes for instructions that might become mandatory workflow steps (when platforms allow for custom workflows) and checklists to run.

Skill, Labor Shortages, and Education

Here are a few questions for those who have designed or design, have held, or hold translation and localization courses: 

  • Have your lectures ever dealt with style guides and job instructions for students to learn how to follow them? 
  • Have you ever included in your assessments the degree of compliance with style guides and instructions during exams?

Customers and LSPs, to the same extent, have always been complaining about the lack of qualified language professionals.

At the TAUS Industry Summit 2017, Bodo Vahldieck, Sr. Localization Manager at VMware, expressed his frustration at not being able to find young talent willing and able to go and work with the “fantastic localization technology suites” at his company.

Sometime earlier, CommonSense Advisory had also launched the alarm on the talent shortage in the language service industry.

Even earlier, Inger Larsen, Founder & MD at Larsen Globalization Recruitment, a recruitment company for the translation industry wrote an article titled Why we still need more good translators reporting about the outcome of a little informal poll showing a failure rate for translators passing professional test translations was about 70 percent, although they all were qualified translators, many of them with quite a lot of experience.

The talent shortage is no news, then, and lately many companies in other industries have been reporting hiring troubles. Apparently, Gresham’s Law  [an economic principle commonly stated as Bad money drives out good] is ruling everywhere, not just in the translation space.

Actually, the labor shortage is a myth. The complaints of Domino’s Pizza CEO, Uber, and other companies are insubstantial because the simplest way to find enough labor is by offering higher wages. In doing so, new workers will enter the market and any labor shortages will quickly end. A rare case for true labor shortage in a free economy is when wages are so high that businesses cannot afford to pay them without going broke. But this would be like the dot-com bubble that led an entire economy to collapse.

Therefore, such complaints are most possibly the sign that corporate executives have grown so accustomed to a low-wage economy to believe anything else is abnormal.

But when bad resources have driven out good ones altogether, offering higher wages might not be enough and presents the risk of overpaying; even more so if the jobs available are very low-profile and can hardly be automated.

Interestingly, as part of a more comprehensive study, Citrix recently conducted a survey from which three key priorities emerged for knowledge workers:

  1. Complete flexibility in hours and location
    This means that, in response to skill shortages and to position themselves to win in the future, companies will have to leverage flexible work models and meet employees where they are. And yet, many still seem to be on a different path.
  2. Different productivity metrics
    Traditional productivity metrics will have to address the value delivered, not the volume i.e., companies will have to prioritize outcomes over output. Surprisingly, many companies claim this is already how they operate.
  3. Diversity
    A diverse workforce will become even more important as roles, skills, and company requirements change over time, although this will challenge current productivity metrics even further.

Machines Do Not Raise Wage Issues

If the linear decrease of pay in the face of the exponential growth of translation demand is puzzling, it is because we are accustomed to the fundamental market law: When demand increases, prices rise. But the technology lag that educational institutions and industry players generally, show compared with other industries and, most importantly, clients which mean that even the best resources do not keep up with productivity expectations, regardless of whether these are more or less reasonable. Also, the common failure of LSPs to differentiate, maximize efficiency and reduce costs leads them to compete on price alone, which only exacerbates the situation, making translation and localization a commodity. Finally, the all too often unreasonable demands of LSPs, even more, unreasonable than those of their customers, have been driving the best resources off the industry. It is a vicious circle that makes productivity a myth and an illusion.

Productivity is a widely discussed subject that has got even more attention during the pandemic. As David J. Lynch recently put it in The Washington Post, “Greater productivity is the rare silver lining to emerge from the crucible of covid-19”. This eventually has kick-started a turn to automation, which is gradually spreading through structural shifts that will further spur it.

Lynch also pointed out that, assuming and not conceding that labor shortages actually exist and are a problem, after helping businesses survive, automation will help them attract labor to meet surging demand.

There is a general understanding that, during the pandemic, firms became more productive and learned to do more with less, even though, in this respect, the effect of technology has been fairly marginal, and less than that from purely organizational measures.

Anyway, according to a McKinsey study, investments in new technologies are going to accelerate through 2024 with an expectation of significant productivity growth. That is because automation is generally understood as different from office technologies or, more likely, because the organizational measures above are more challenging, cost more and are less tax-efficient. Or maybe because more and more businesses complaining of labor shortages are convinced that automation will allow them to fill orders they otherwise would have to turn down.

After all, this is exactly the approach of LSPs towards machine translation and even more so post-editing. But automation as understood is limited and distorted and leads to an exacerbation of the effects of the Gresham’s Law. On the other hand, many translators are still quite unconvinced of machine translation and see it as slightly useful. This is due mostly to the negative policies of most LSPs and their widespread attitude towards automation, machine translation and technology at large that have repeatedly exposed LSPs and their vendors to the deadly effects of incompetently implemented and deployed machine translation systems, whose only objective is to try and reduce translator compensation and safeguard margins.

Playing with Grown-ups

Experienced customers know that machine translation is no panacea [for translation challenges] and does not come cheap. True, online machine translation engines are free, but they are not suitable for business or professional use, requiring experienced linguists to exploit them for professional use. A corporate machine translation platform requires a substantial initial investment, plus specific know-how and resources, including a proper (substantial) amount of quality data to train models. Most importantly, it requires time and patience, which are traditionally a rare commodity in today’s business world.

The most coveted achievement of any LSP is to play in the same league as grown-ups, but grown-ups do not want to play with LSPs when they get to know them, and learn LSPs cannot help them find the best suited machine translation system, implement, train, and tune it because they do not have the necessary know-how, ability, and resources. For the same reasons, they know they cannot outsource their machine translation projects to the LSPs themselves, no matter how hard these offer their services in this field too.

Disenchantment when not skepticism or outright distrust is the consequence of LSPs not being attuned to the needs of clients, especially the bigwigs (the grown-ups), and the resulting lack of integration with their processes. Then again, clients have always been asking for understanding and integration and what have they got in response? A pointless post-editing standard.

LSPs are losing the continuous localization battle too. Rather than adjusting processes to the customer’s modus operandi, LSPs—and their reference consultants—blame customers for demanding localization teams to keep up with code and content as these are developed, before deployment. On the other hand, rather than streamlining their processes, LSPs try and stick hopelessly to the traditional clumsy ones. No wonder customers have issues in trusting LSPs.

Apparently, in fact, many LSPs are concerned about the effects of continuous localization on linguistic quality, when the kind of quality LSPs are accustomed to is exactly what they should forget. Not for nothing, a basic rule in the Agile model, consists of using every new iteration to correct the errors made in the previous one.

If anything, it is odd that machine translation has not become predominant already and that clients and, more importantly, LSPs insist on maintaining working and payment models that are, to say the least, obsolete.

What if, for example, the idea around quality rapidly changes, and customer experience becomes the new paradigm?

This would reinforce the base for wide-ranging service level agreements to cover a stable buyer-vendor relationship first on the client-LSP side and then on the LSP-vendor side, with international payments going through a platform enabling the buyer to pay vendors in their local preferred currency. A clause in the agreement may require the payees sign up with the platform and input their banking details and preferred currency.

Payment platforms already exist that allow clients to qualify for custom (flat) rates by submitting a pricing assessment form, and that connect with other systems through a web API translator via no-code applets based on an IFTTT (If This Then That) mechanism.

Payments are not easy, but it is worth getting right because it is the sore point paving the road for Gresham’s Law.

Perverted Debates

If the debate around rates and payments has never gone past the stage of rants and complaints, the one around quality has been intoxicating the translation space for years without leading to any significant outcome.  Yet they still produce tons of academic publications around the same insubstantial fluff and generate thousands of lines of code just to keep repeating the same mistakes.

As long as machine translation was a subject confined to specialists, relatively objective metrics and models ruled the quality assessment process with the goal of improving the technology and the assessment metrics and models themselves.

After entering the mainstream, a few years ago, machine translation became marketing prey. Marketing people at machine translation companies started targeting translation industry players with improvements in automated evaluation metrics, typically BLEU, and the public with claims of "human parity". [And also the increasing use of bogus MT quality rankings done by third parties.]

Both are smoke and mirrors, though. On one side, automated metrics are no more than just the scores they deliver, and their implications are hard to grasp; also, they have been showing all their limitations with Neural MT models. On the other hand, no one has bothered to offer a consistent, unambiguous, and undisputable definition of ‘human parity’ other than the ones from the companies bragging they have achieved it.

Saying that machine translation output is “nearly indistinguishable from” or “equivalent to” a human translation is misleading and means almost nothing. Saying that a machine has achieved human parity if “There is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations” may sound more exhaustive and accurate, but comparisons depend anyway on the characteristics of input and output and on the conditions for comparison and evaluation.

In other words, the questions to answer are, “Is every human capable of translating in any language pair? Can any human produce a translation of equivalent quality in any language pair? Can any human translate better than machines in any language pair?” And vice versa.

All too often, people, even in the professional translation space, tend to forget that machine translation is a narrow-AI application i.e., it focuses on one narrow task, with each language pair being a separate task. In other words, the singularity that would justify making the claim of  "human parity" is still afar, and not just in time, so much for Ray Kurzweil’s predictions or Elon Musk’s confidence in Neuralink’s development of a universal language and brain chip.

Using automatic MT quality scores as a marketing lever is therefore misleading because there are too many variables at play. Talking about "human parity" is misleading too because one should consider the conditions under which the assessment leading to certain statements has been conducted.

Now, it is quite reasonable for a client to ask a partner (as LSPs like to think of themselves) to help them correctly and fully interpret machine translation scores and certain catchphrases that may sound puzzling for vagueness or ambiguity.

Most clients—the largest ones anyway—are in a different league in terms of organizational maturity than their language service providers, and cannot understand the reason for the sloppiness and inefficiency they see in these would-be partners. And yet it is quite simple: The traditional, still common translation process model they follow are not sustainable even for mission-critical content. Incidentally, this brings us back to productivity, payments, Gresham’s law, and skill and labor shortages, all interrelated.

Not only are leaner, faster, and more efficient processes necessary more than ever, a mutual understanding is crucial. To help customers understand translation products and services, and value them accordingly, the people in this industry should waive the often obfuscating jargon that no client is interested in and is willing to learn and decipher. Is this jargon part of the notorious information asymmetry?

A greater and more honest self-assessment is necessary, which the industry is, instead, dramatically lacking at all levels. And this possibly explains the greater interest in the machine translation market and industry rather than in the translation industry.


Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm. He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization-related work.

This link provides access to his other blog posts.