Friday, June 11, 2021

Close Call - Observations on Productivity, Talent Shortages, & Human Parity MT

This is a guest post by Luigi Muzii, a frequent contributor to this blog. I wanted to make sure I had a chance to re-publish his thoughts on the MT human parity issue before he withdraws from blogging, and hopefully, this is not his last contribution. He has been a steady and unrelenting critic of many translation industry practices, mostly, I think with the sincere hope of driving evolution and improvement in business practices. To my mind, his criticism always had the underlying hope that business processes and strategies in the translation industry would evolve to look more like other industries where service work is more respected and acknowledged or more closely align to the business mission needs of clients. His acerbic tone and dense writing style have been criticized, but I have always appreciated his keen observation and unabashed willingness to expose bullshit, overused cliches, and platitudes in the industry. There is just too much Barney-love in the translation industry. Even though I don't always agree with him, it is refreshing to hear a counter opinion that challenges the frequent self-congratulation that we also see in this industry.  

When I first came to the translation industry from the mainstream IT industry I noticed that people in the industry were more world-wise, cultured, and even gentler than most I had encountered in the IT industry. However, the feel-good vibe engendered by the multicultural sensitivity also sustains a cottage industry characteristic to processes, technology, and communication style in this industry. People are much more tolerant of inefficiency and sub-optimal technology use. I noticed this especially from the technology viewpoint as I entered the industry as a spokesperson for Language Weaver who was an MT pioneer with data-driven MT technology, the first wave of "machine learning". I was amazed by the proliferation of shoddy in-house TMS systems and the insistence to keep these mostly second-rate systems running. When a group of more professionally developed TMS systems emerged, these TMS vendors struggled to convince key players to adopt the improved technology. It is amazing that even companies that reach hundreds of millions of dollars in annual revenue still have processes and technology use profiles of late-stage cottage industry players. Even Jochen Hummel the inventor of Trados (TM) has expressed surprise that a technology he developed in the 1980s is still around, and has stated openly that it should properly be replaced by some form of NMT! 

The resistance to MT is a perfect example of a missed opportunity. Instead of learning to use it better, in a more integrated, knowledgeable, and value-adding way for clients, it has become another badly used tool whose adoption struggles along, and MT use is most frequently associated with inflicting pain and low compensation on the translators forced to work with these sub-optimal systems.

In an era where trillions of words are being translated by MT daily in public MT portals, the chart above should properly be titled  "Clueless with MT". I would also change it to N=170 LSPs that don't know how to use MT. Most LSPs who claim to "do MT", even the really large ones, in fact, do it really badly. The Translated - ModernMT deployment in my opinion is one of the very few exceptions of how to do MT right for the challenging localization use case. It is also the ONLY LSP user scenario I know where MT is used in 90% or more of all translations work done by the LSP. Why? Because it CONSISTENTLY makes work easier, more efficient, and most importantly translators consistently ask for access to the rapidly learning ModernMT systems. Rather than BLEU scores, a production scenario where translators regularly and fervently ask for MT access is the measure of success. It can only happen with superior engineering that understands and enhances the process. It also means that this LSP can process thousand words projects with the same ease as they can process billions of words a month and scale easily to trillions of words if needed. In my view, this is a big deal and that is what happens when you use technology properly. It is no surprise that most of the largest MT deployments in the world outside of the major Public MT Portals (eCommerce, OSI, eDiscovery) have little to no LSP involvement. Why would any sophisticated global enterprise be motivated to bring in an LSP that offers nothing but undifferentiated project management, dead-end discussions on quality measurement, and a decade-long track record of incompetent technology use?  

Expert MT use is a result of the right data, the right process, and ML algorithms which are now commoditized. In the localization space, the "right" process is particularly important.  Like much of machine intelligence, the real genius [of deep learning] comes from how the system is designed, not from any autonomous intelligence of its own. Clever representations, including clever architecture, make clever machine intelligence,” Roitblat writes. I think it is fair to say that most MT use in the translation industry does not reach the level of "clever machine intelligence". It follows that most translation industry MT use projects would qualify as sub-optimal machine intelligence.

This, I felt was a fitting introduction to Luigi's post. I hope he shows up once in a while in the coming future, as I don't know many others who are as willing to point out "areas of improvement" for the community as willingly as he does.



The Productivity Paradox

Economists have argued for decades that massively investing in office technologies would enormously boost up productivity. However, already in 1994 authoritative studies had cast doubts on the reliability of certain projections. Recent studies reported that a 12 percent annual increase in the data processing budgets for U.S. corporations have yielded annual productivity gains of less than 2 percent.

The reasons for those gains to be much less than expected might be in long-established business practices that have possibly been holding them back by restraining knowledge workers from taking full advantage of better and better tools, thus boosting productivity, proving the significance of the law of the instrument.

Therefore, to achieve the expected increases in productivity most business practices should change.

Word Rates v. Hour Rates

Translation pays have been based on per-word rates for over thirty years. The reasons are basically twofold. On one hand, computer-aided translation tools have finally enabled buyers to understand (more or less) precisely what they have been paying for. On the other hand, computer-aided translation tools have been allowing to measure throughput (almost) objectively and productivity, thus helping statistics and projections.

Add to that the ability for buyers to request discounts based on the percentage of matches between a text and a translation memory and it instantly becomes obvious that it is not the translator’s time, expertise, or skills that they are buying and paying for.

Nevertheless, a translation assignment/project inevitably ends up involving a series of collateral tasks whose fee cannot be computed on a per-word basis.

The price LSPs charge buyers, then, includes the price for services for which they then pay vendors on a different basis. Similarly, in setting their own fees, these vendors include the compensation for non-productive or non-remunerative tasks. The word-rate fee, then, is also based on the time required to complete a certain task. In short, this means that even the conundrum of measured fees (word rate and hourly rate) v. fixed fees is pointless. The moment the parties agree on how to compute the fee, only measuring is left open. And when it comes to statistics and projections, this is of more interest to the supplier—specifically the middleman—than the buyer.

Not only would reducing non-productive tasks allow for regaining margins and cutting the selling price, but also for regaining productivity and resources to allocate for increasing efficiency through automation, thus ultimately productivity itself.

If anything, now more than ever, it is necessary to foster standardization and reach an agreement on reference models, metrics, and methods of measurement. The resulting standardization of exchange formats, data models, and metrics would help productivity and interoperability.

In fact, some tasks, like file preparation or, more precisely, the assembly of localization packages and kits, cannot be fully automated or outstripped from the translation/localization workflow, although they are indeed separate jobs. In this respect, standardization might also help automate such tasks. Nevertheless, when extensive and time-tolling, these tasks should be the buyer’s responsibility. Incidentally, given the traditionally poor consideration of buyers for the translation industry and their insufficient understanding of translation and localization and the related workflow, most of the problems associated with project setup and file-preparation is attributable to sloppiness and immaturity. This includes job instructions requiring project teams to spend time reading through them.

On the other hand, some of these tasks, like quotation, are commonly part of project tasks while they should not. So, for example, when formulating quotations at selling, any subsequent task relating to it can be (at least partially) automated. The same goes for instructions that might become mandatory workflow steps (when platforms allow for custom workflows) and checklists to run.

Skill, Labor Shortages, and Education

Here are a few questions for those who have designed or design, have held, or hold translation and localization courses: 

  • Have your lectures ever dealt with style guides and job instructions for students to learn how to follow them? 
  • Have you ever included in your assessments the degree of compliance with style guides and instructions during exams?

Customers and LSPs, to the same extent, have always been complaining about the lack of qualified language professionals.

At the TAUS Industry Summit 2017, Bodo Vahldieck, Sr. Localization Manager at VMware, expressed his frustration at not being able to find young talent willing and able to go and work with the “fantastic localization technology suites” at his company.

Sometime earlier, CommonSense Advisory had also launched the alarm on the talent shortage in the language service industry.

Even earlier, Inger Larsen, Founder & MD at Larsen Globalization Recruitment, a recruitment company for the translation industry wrote an article titled Why we still need more good translators reporting about the outcome of a little informal poll showing a failure rate for translators passing professional test translations was about 70 percent, although they all were qualified translators, many of them with quite a lot of experience.

The talent shortage is no news, then, and lately many companies in other industries have been reporting hiring troubles. Apparently, Gresham’s Law  [an economic principle commonly stated as Bad money drives out good] is ruling everywhere, not just in the translation space.

Actually, the labor shortage is a myth. The complaints of Domino’s Pizza CEO, Uber, and other companies are insubstantial because the simplest way to find enough labor is by offering higher wages. In doing so, new workers will enter the market and any labor shortages will quickly end. A rare case for true labor shortage in a free economy is when wages are so high that businesses cannot afford to pay them without going broke. But this would be like the dot-com bubble that led an entire economy to collapse.

Therefore, such complaints are most possibly the sign that corporate executives have grown so accustomed to a low-wage economy to believe anything else is abnormal.

But when bad resources have driven out good ones altogether, offering higher wages might not be enough and presents the risk of overpaying; even more so if the jobs available are very low-profile and can hardly be automated.

Interestingly, as part of a more comprehensive study, Citrix recently conducted a survey from which three key priorities emerged for knowledge workers:

  1. Complete flexibility in hours and location
    This means that, in response to skill shortages and to position themselves to win in the future, companies will have to leverage flexible work models and meet employees where they are. And yet, many still seem to be on a different path.
  2. Different productivity metrics
    Traditional productivity metrics will have to address the value delivered, not the volume i.e., companies will have to prioritize outcomes over output. Surprisingly, many companies claim this is already how they operate.
  3. Diversity
    A diverse workforce will become even more important as roles, skills, and company requirements change over time, although this will challenge current productivity metrics even further.

Machines Do Not Raise Wage Issues

If the linear decrease of pay in the face of the exponential growth of translation demand is puzzling, it is because we are accustomed to the fundamental market law: When demand increases, prices rise. But the technology lag that educational institutions and industry players generally, show compared with other industries and, most importantly, clients which mean that even the best resources do not keep up with productivity expectations, regardless of whether these are more or less reasonable. Also, the common failure of LSPs to differentiate, maximize efficiency and reduce costs leads them to compete on price alone, which only exacerbates the situation, making translation and localization a commodity. Finally, the all too often unreasonable demands of LSPs, even more, unreasonable than those of their customers, have been driving the best resources off the industry. It is a vicious circle that makes productivity a myth and an illusion.

Productivity is a widely discussed subject that has got even more attention during the pandemic. As David J. Lynch recently put it in The Washington Post, “Greater productivity is the rare silver lining to emerge from the crucible of covid-19”. This eventually has kick-started a turn to automation, which is gradually spreading through structural shifts that will further spur it.

Lynch also pointed out that, assuming and not conceding that labor shortages actually exist and are a problem, after helping businesses survive, automation will help them attract labor to meet surging demand.

There is a general understanding that, during the pandemic, firms became more productive and learned to do more with less, even though, in this respect, the effect of technology has been fairly marginal, and less than that from purely organizational measures.

Anyway, according to a McKinsey study, investments in new technologies are going to accelerate through 2024 with an expectation of significant productivity growth. That is because automation is generally understood as different from office technologies or, more likely, because the organizational measures above are more challenging, cost more and are less tax-efficient. Or maybe because more and more businesses complaining of labor shortages are convinced that automation will allow them to fill orders they otherwise would have to turn down.

After all, this is exactly the approach of LSPs towards machine translation and even more so post-editing. But automation as understood is limited and distorted and leads to an exacerbation of the effects of the Gresham’s Law. On the other hand, many translators are still quite unconvinced of machine translation and see it as slightly useful. This is due mostly to the negative policies of most LSPs and their widespread attitude towards automation, machine translation and technology at large that have repeatedly exposed LSPs and their vendors to the deadly effects of incompetently implemented and deployed machine translation systems, whose only objective is to try and reduce translator compensation and safeguard margins.

Playing with Grown-ups

Experienced customers know that machine translation is no panacea [for translation challenges] and does not come cheap. True, online machine translation engines are free, but they are not suitable for business or professional use, requiring experienced linguists to exploit them for professional use. A corporate machine translation platform requires a substantial initial investment, plus specific know-how and resources, including a proper (substantial) amount of quality data to train models. Most importantly, it requires time and patience, which are traditionally a rare commodity in today’s business world.

The most coveted achievement of any LSP is to play in the same league as grown-ups, but grown-ups do not want to play with LSPs when they get to know them, and learn LSPs cannot help them find the best suited machine translation system, implement, train, and tune it because they do not have the necessary know-how, ability, and resources. For the same reasons, they know they cannot outsource their machine translation projects to the LSPs themselves, no matter how hard these offer their services in this field too.

Disenchantment when not skepticism or outright distrust is the consequence of LSPs not being attuned to the needs of clients, especially the bigwigs (the grown-ups), and the resulting lack of integration with their processes. Then again, clients have always been asking for understanding and integration and what have they got in response? A pointless post-editing standard.

LSPs are losing the continuous localization battle too. Rather than adjusting processes to the customer’s modus operandi, LSPs—and their reference consultants—blame customers for demanding localization teams to keep up with code and content as these are developed, before deployment. On the other hand, rather than streamlining their processes, LSPs try and stick hopelessly to the traditional clumsy ones. No wonder customers have issues in trusting LSPs.

Apparently, in fact, many LSPs are concerned about the effects of continuous localization on linguistic quality, when the kind of quality LSPs are accustomed to is exactly what they should forget. Not for nothing, a basic rule in the Agile model, consists of using every new iteration to correct the errors made in the previous one.

If anything, it is odd that machine translation has not become predominant already and that clients and, more importantly, LSPs insist on maintaining working and payment models that are, to say the least, obsolete.

What if, for example, the idea around quality rapidly changes, and customer experience becomes the new paradigm?

This would reinforce the base for wide-ranging service level agreements to cover a stable buyer-vendor relationship first on the client-LSP side and then on the LSP-vendor side, with international payments going through a platform enabling the buyer to pay vendors in their local preferred currency. A clause in the agreement may require the payees sign up with the platform and input their banking details and preferred currency.

Payment platforms already exist that allow clients to qualify for custom (flat) rates by submitting a pricing assessment form, and that connect with other systems through a web API translator via no-code applets based on an IFTTT (If This Then That) mechanism.

Payments are not easy, but it is worth getting right because it is the sore point paving the road for Gresham’s Law.

Perverted Debates

If the debate around rates and payments has never gone past the stage of rants and complaints, the one around quality has been intoxicating the translation space for years without leading to any significant outcome.  Yet they still produce tons of academic publications around the same insubstantial fluff and generate thousands of lines of code just to keep repeating the same mistakes.

As long as machine translation was a subject confined to specialists, relatively objective metrics and models ruled the quality assessment process with the goal of improving the technology and the assessment metrics and models themselves.

After entering the mainstream, a few years ago, machine translation became marketing prey. Marketing people at machine translation companies started targeting translation industry players with improvements in automated evaluation metrics, typically BLEU, and the public with claims of "human parity". [And also the increasing use of bogus MT quality rankings done by third parties.]

Both are smoke and mirrors, though. On one side, automated metrics are no more than just the scores they deliver, and their implications are hard to grasp; also, they have been showing all their limitations with Neural MT models. On the other hand, no one has bothered to offer a consistent, unambiguous, and undisputable definition of ‘human parity’ other than the ones from the companies bragging they have achieved it.

Saying that machine translation output is “nearly indistinguishable from” or “equivalent to” a human translation is misleading and means almost nothing. Saying that a machine has achieved human parity if “There is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations” may sound more exhaustive and accurate, but comparisons depend anyway on the characteristics of input and output and on the conditions for comparison and evaluation.

In other words, the questions to answer are, “Is every human capable of translating in any language pair? Can any human produce a translation of equivalent quality in any language pair? Can any human translate better than machines in any language pair?” And vice versa.

All too often, people, even in the professional translation space, tend to forget that machine translation is a narrow-AI application i.e., it focuses on one narrow task, with each language pair being a separate task. In other words, the singularity that would justify making the claim of  "human parity" is still afar, and not just in time, so much for Ray Kurzweil’s predictions or Elon Musk’s confidence in Neuralink’s development of a universal language and brain chip.

Using automatic MT quality scores as a marketing lever is therefore misleading because there are too many variables at play. Talking about "human parity" is misleading too because one should consider the conditions under which the assessment leading to certain statements has been conducted.

Now, it is quite reasonable for a client to ask a partner (as LSPs like to think of themselves) to help them correctly and fully interpret machine translation scores and certain catchphrases that may sound puzzling for vagueness or ambiguity.

Most clients—the largest ones anyway—are in a different league in terms of organizational maturity than their language service providers, and cannot understand the reason for the sloppiness and inefficiency they see in these would-be partners. And yet it is quite simple: The traditional, still common translation process model they follow are not sustainable even for mission-critical content. Incidentally, this brings us back to productivity, payments, Gresham’s law, and skill and labor shortages, all interrelated.

Not only are leaner, faster, and more efficient processes necessary more than ever, a mutual understanding is crucial. To help customers understand translation products and services, and value them accordingly, the people in this industry should waive the often obfuscating jargon that no client is interested in and is willing to learn and decipher. Is this jargon part of the notorious information asymmetry?

A greater and more honest self-assessment is necessary, which the industry is, instead, dramatically lacking at all levels. And this possibly explains the greater interest in the machine translation market and industry rather than in the translation industry.


Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm. He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization-related work.

This link provides access to his other blog posts.

Monday, May 24, 2021

ModernMT: A Closer Look At An Emerging Enterprise MT Powerhouse

 As one observes the continuing evolution of MT use in the professional translation industry, we see that we have reached a point where we have some useful insights about producing successful outcomes in our use of MT. From my perspective as a long-term observer and expert analyst of enterprise MT use, some of these include:

  • Adaptation and customization of a generic MT engine done with expertise generally produces a better outcome than simply using a generic public MT system. 
  • Working with enhanced baseline engines built by experts is likely to produce better outcomes than dabbling with Open Source options with limited expertise. While it has gotten easier to produce MT systems with open-source platforms, real expertise requires long-term exposure and repeated experimentation. 
  • The algorithms underlying Neural MT have become largely commoditized and there is little advantage gained by jumping from one NMT platform to another.
  • More data is ONLY better if it is clean, relevant, and applicable to the enterprise use case in focus. It can be said today that (training) data often matters more than the algorithms used, but data quality and organization is a critical factor for creating successful outcomes.
  • A large majority of translators still view MT with great skepticism and see it as marginally useful, mostly because of repeated exposure to incompetently deployed MT systems that are used to reduce translator compensation. Getting active and enthusiastic translator buy-in continues to be a challenge for most MT developers and getting this approval is a clear indicator of superior expertise.
  • Attempts to compare different MT systems are largely unsuccessful or misleading, as they are typically based on irrelevant test data or draw conclusions based on very small samples.
  • A large number of enterprise use cases are limited by scarce training data resources and thus adaptation and customization attempts have limited success.
I have been skeptical of the validity of many of the comparisons we see of MT systems produced by LSPs and "independent" evaluators nowadays, because of the questionable evaluation methodologies used. The evaluators often produce nice graphics but just as often produce misleading results that need further investigation. However, these comparative evaluations of different MT systems can still be useful to get a rough idea of the performance of generic systems of these MT vendors. Over the last few years ModernMT has been consistently showing up amongst the top-performing MT systems in many different evaluations, and thus I decided to sit down with the ModernMT team to better understand their technology and product philosophy and understand what might be driving this consistent performance advantage. The level of transparency and forthcoming nature of the responses from the ModernMT team was refreshing in contrast to other conversations I have had with other MT developers.

The MT journey here began over 10 years ago with Moses and Statistical MT, but unlike most other long-term MT initiatives I know of, this effort was very translator-centric right from its inception. The system was used heavily by translators who worked for Translated and the MT systems were continually adapted and modified to meet the needs of production translators. This is a central design intention and it is important to not gloss over this, as this is the ONLY MT initiative I know of where Translator Acceptance is used as the primary criterion on an ongoing basis, in determining whether MT should be used for production work or not. The operations managers will simply not use MT if it does not add value to the production process and causes translator discontent.  Over many years the ongoing collaboration with translators at ModernMT has triggered MT system and process development changes to reach the current status quo, where the MT value-add and efficiency is clear to all the stakeholders. The long-term collaboration between translators and MT developers, and resulting system and process modifications are a key reason why ModernMT does so well in both generic MT system comparisons, and especially in adapted/customized MT comparisons.
Thus, translators who actively use the ModernMT platform do so most often through MateCat, an open-source CAT tool that ties together MyMemor(a large free-access shared TM repository with around 50 billion words in it) together with ModernMT or other MT platforms. MT is presented to translators as an alternative to TM on a routine basis, and corrections are dynamically and systematically used to drive continuous improvements in the ModernMT engines. Trados and other CAT tools are also able to seamlessly connect to the ModernMT back-end but these systems may see less immediate improvements in the MT output quality. However, this has not stopped ~25,000 downloads of the ModernMT plugin for Trados on the SDL Appstore. Translators who do production work for Translated are often given a choice of using Google instead of ModernMT but most have learned that ModernMT output improves rapidly from corrective feedback and that collaborative input is also easier, and thus tend to prefer it as shown in the surveys below. Over the years the ModernMT product evolution has been driven by changes to identify and reduce post-editing effort rather than optimizing BLEU scores as most others have done. 

In contrast to most MTPE experiences, the individual translator experience here is characterized by the following:
  • A close and symbiotic relationship between a relevant translation memory and MT, even at the translator UX level
  • An MT system that is constantly updated and can potentially improve with every single interaction and unit of corrective feedback
  • Immediate project startup possibilities as no batch MT training process is necessary
  • Translator control over all steering data used in a project means very straightforward control over terminology and term consistency, mirroring the latest TMs and linguistic preferences
  • Corrective feedback given to the MT system is dynamic and continuous and can have an immediate impact on the next sentence produced by the MT system
  • One of very few MT systems available today that can provide a context-sensitive translation 
  • Measurable and palpable reduction in post-editing effort and translator UX compared to other MT platforms
  • Continuing free access to the CAT tool needed to integrate MT with TM, and interact proactively with MT with the option to use other highly regarded CAT tools if needed. 

Memory here refers to user input data TM and glossaries to tune the generic system to the needs of the current translation task

Instance-Based Adaptation

ModernMT describes itself as an "Instance-Based Adaptive MT" platform. This means that it can start adapting and tuning the MT output to the customer subject domain immediately, without a batch customization phase. There is no long-running (hours/days/weeks) data preparation and pre-training process needed upfront. There is also no need to wait and gather a sufficient volume of corrective feedback to update and improve the MT engine on an ongoing basis. It is learning all the time. 

Rapid adaptation to customer-unique language and terminology is perhaps the single most critical requirement for a global enterprise, and thus this is an optimal design for enterprises that works optimally with their specialized and unique content. This is also true for LSPs too, for that matter. ModernMT can adapt the MT system with as little as a single sentence, though the results are better if more data is provided. The team told me that 100K words (10-12,000 sentences)  would generally produce consistently good results that are superior to any generic engine. The long-term impact of this close collaboration with translators who provide ongoing corrections, feedback on critical requirements to improve efficiency and process workflow, and careful acquisition of the right kind of data, results in the kind of relative performance rankings that ModernMT now regularly sees as a matter of course. One might even go so far as to say that they have built a sustainable competitive advantage. 

I have always felt that a properly designed Man-Machine collaboration would very likely outperform an MT design approach that relies entirely on algorithms and/or data alone. We can see this is true from the comparative results of the large public MT portals who probably have 100X or more of the resources and budget that ModernMT does. The understanding of the translation task and resulting directives that ongoing translator feedback brings to the table is an ingredient that most current MT systems lack. Gary Marcus and other AI experts have been vocal in pointing out that machine learning and data alone is not the best way forward and more human steering and symbolic knowledge is needed for better outcomes.  

Special Features

ModernMT is a context-aware machine translation product that learns from user corrections. There has recently been growing interest in the MT research community to bring a greater degree of contextual awareness to MT systems and ModernMT has also been investigating implementing capabilities around doing this. The current production version has an implementation of this already, and this feature continues to evolve in speed, efficiency, and capability.

The ModernMT Context Analyzer analyzes an entire document text to be translated in milliseconds before producing a translation. This analysis seeks out and identifies the distinctive terminology and intrinsic style of the document. This information is then used to automatically select the most suitable private translation memories loaded by the user for that particular document. This results in the engine selecting the translation memory inventory that best reflects the right terminology and writing style. It is precisely this inventory that the MT engine leverages to customize the output in real-time, for each and every sentence of the document.

As translators at Translated working with ModernMT regularly have the ability to compare the MT output with that of Google Translate, the developers monitor translator preferences on an ongoing basis. This ensures that translators are always working with the MT output that they find most useful and that developers understand when their own engines need to be improved or enhanced. The following charts are based on feedback from translators during production work and show a very definite preference for the rapidly improving ModernMT engine output. This preference is seen in internal translator assessments working in production mode rather than just a selective test set, and this has also been confirmed by independent third-party assessments with both automated scores and human evaluations. They all consistently show that ModernMT customizations regularly outperform most others in independent comparative evaluations. The forces driving this superior performance are the result of design philosophy and long-term man-machine collaboration that cannot be easily replicated by others.

Recent comparative assessments done by independent third parties also confirm this preference using different evaluation methods that include both human and automated metrics as shown below. It is not unreasonable to presume that this performance advantage will remain intact for at least the short term.

Data Privacy

In response to a question on data privacy, Davide Caroselli, VP of Product, ModernMT responded: "Any content sent to ModernMT, whether a “TMX” memory or an MTPE correction from a professional translator, is saved in the user’s private data area. In fact, only you will be able to access your resources and make ModernMT adjust to them; in no way will another user be able to utilize that same inventory for his/her system, nor will ModernMT itself be able to use those contents, other than to exclusively offer your personalized translation service.

In addition, ModernMT uses state-of-the-art encryption technologies to provide its cloud services. Our data centers, employee processes and office operations are ISO 27001:2013 certified." 

On-Premise Capabilities

While the bulk of the current ModernMT customer base works with the secure cloud deployment, the team at ModernMT has also defined a range of on-premise deployment capabilities for those enterprises that need the security, control, and assured data privacy needs that characterizes some National Security, Financial, Legal, and Healthcare/Pharma industry requirements. The open-source foundations of much of the ModernMT infrastructure should make it particularly interesting to US Government Intelligence and Law-Enforcement agencies seeking large-scale multilingual data processing capabilities for eDiscovery and Social Media Surveillance applications.  

Given that ModernMT is a continuous learning MT platform that learns with each correction, dynamically, there is a requirement for more GPU infrastructure than some other on-premise solutions in the market. However, there is a strong focus on computational efficiency to minimize the IT footprint needed to deploy it on-premise, and based on information provided to me, their capabilities are quite similar to competitive alternatives both in terms of hardware requirements and software pricing. Hardware costs are linked to throughput expectations with more hardware required for high throughput requirements. As with most machine learning-intensive capabilities, only enterprises with competent IT teams could undertake this as an internal deployment, and most LSPs and localization departments will see a lower total cost of ownership with the cloud deployment. 

Enterprise Readiness  

As ModernMT has evolved from the localization world it is already optimized for MT use cases where there is a significant need for a machine-first human optimized approach. More and more we see this model as being a preferred approach for the exploding volumes of localization content.  The Localization Use Case is possibly the most challenging MT use case out there, as it requires very high-quality initial output that translators are willing to work with where it can be proven that the MT enhances productivity and efficiency. Localization use cases demand the highest quality MT output from the outset compared to eDiscovery,  social media surveillance, eCommerce,  customer service & support use cases which are all more tolerant of lower MT output quality on much larger volumes of data. Very few MT developers have had success with the high-quality and rapid responsiveness needs of the localization use case and many have tried and failed. This is why LSP adoption of MT is so low. ModernMT's success with the challenging localization use case, however, positions them very well for other MT use cases as their growing success with these other use cases proves.

The ASTW case study illustrates the success of ModernMT in Intellectual Property (Patents) and Life Science focused translations, where the ease of customization for complex terminology and morphology, the ability to learn continuously and quickly from corrective feedback, and superior MTPE experience compared to other MT solutions has quickly made it a preferred solution. 
"ModernMT is currently our favorite MT engine, especially in patent translations and in the Life Science sector, because it proves reliable, efficient, qualitatively better than its competitors, easily customizable and advantageous in terms of cost."

Domenico Lombardini, CEO ASTW

We see that eCommerce giants understand the positive impact of translating huge volumes of catalog and user-generated CX content has on driving international revenue growth with the examples of eBay, Amazon, and Alibaba. ModernMT is now the MT engine driving the multilingual expansion of Airbnb web content and is translating many billions of words a month for them. User-generated content influences future customers, and there is great value in translating this content to drive and grow international business. Interestingly ModernMT began this initiative with almost no translation memory and had to perform specialized heuristic analysis on Airbnb content to build the training material.    

ModernMT has reached this point with very little investment in sales and marketing infrastructure. As this builds out and expands I will be surprised if ModernMT does not continue to expand and grow its enterprise presence, as enterprise buyers begin to understand that a tightly integrated man-machine collaborative platform that is continuously learning, is key to creating successful MT outcomes. I am aware that many other high-profile enterprise conversations are underway, and I expect that most enterprise buyers who evaluate the ModernMT platform will very likely find it is a preferred, cost-efficient way to implement large-scale MT solutions in a way that dramatically raises the likelihood of success. 

Future Directions

Davide also mentioned to me that his team is very connected to the AI community in Italy, and have been experimenting with GPT-3 and BERT, and will continue to do so until clear value-added applications that support and enhance their MT product emerge. ModernMT has a close relationship with Pi Campus and thus has regular interaction with luminaries in the AI community e.g.  Lukasz Kaiser who will be speaking about improvements in the Transformer architecture later this month.

The team also showed me demos of complex video content that had ModernMT-based automated dubbing from English to Italian injected into it. Apparently, Italy is one of the largest dubbing markets in the world. Who knew? Since my wife speaks Italian, I showed her some National Geographic content on geology, filled with complex terminology and scientific subject matter that she was shocked to find out had been done completely without human modification. The Translated team is exploring Speech Translation and I expect that they will be quality leaders here too.

ModernMT will continue to expand its connectivity to other translation and content management infrastructure to make it easier to get translation-worthy data in and out of their environment. They also continue to explore ways to make the ModernMT continuous training infrastructure more computationally efficient so that it can be more easily deployed on smaller footprint hardware. 

I expect we will see more and more of ModernMT on the enterprise MT stage from now on, as buyers realize that this is a significantly improved next-generation MT solution that is more likely to produce successful outcomes in digital transformation-related enterprise use scenarios. The ModernMT approach reduces the uncertainty that is so common with most MT-related initiatives and does it so seamlessly that most would not realize how sophisticated the underlying technology is until they attempt to replicate the functionality.

On a completely different note, I participated some months ago in responding to a question posed by  Luca Di Biase, the Imminent Research Director. He posed this same question to many luminaries in the translation industry, and also to me. The question has already triggered several discussions on Twitter.

“Is language a technology or a culture?”  

My response was as follows, but I think you may find the many other responses more interesting and complete if you go to this link or look at some of the other Twitter comments.
It is neither. Language is a means of communication and an information-sharing protocol that employs sounds, symbols, and gestures. Language can sometimes use technology to enable amplification, extend the reach of messages, and accelerate information and knowledge sharing. Language can create a culture when shared with(in) a group and used with well-understood protocols and norms. Intercultural communication can also mean cross species, e.g., when communicating with dogs and horses.

Translated's Research Center has just released the Imminent publication which has a distinctive style coupled with interesting content, that I think most in the language industry would find compelling and worth a close look.  

Monday, March 29, 2021

The Quest for Human Parity Machine Translation

The Challenge of Defining Translation Quality 

The subject of  "translation quality" has always been a challenging communication issue for the translation industry. It is particularly difficult to explain this concept in a straightforward way to an industry outsider or a customer whose primary focus is building business momentum in international markets, and who is not familiar with localization industry translation-quality-speak.  Nowadays they tend to focus on creating and managing the dynamic and ever-changing content that enhances a global customer's digital journey, rather than the static content that is the more typical focus of localization managers. Thus, the conventional way in which translation quality is discussed by LSPs is not very useful to these customers. Since every LSP claims to deliver the "best quality " or "high quality" translations", it is difficult for these buyers to tell the difference in this service aspect from one service provider to another. The quality claim between vendors thus essentially cancels out. 

These customers also differ in other ways. They need larger volumes of content to be translated rapidly at the lowest cost possible, but yet at a quality level that is useful to the customer in digital interactions with the enterprise. For millions of digital interactions with enterprise content, the linguistic perfection of translations is not a meaningful and achievable goal given the volume, short shelf-life, and instant turnaround expectations a digital customer will have.  
As industry observer and critic Luigi Muzii describes it:
"Because of the intricacies related to the intrinsic nature of languages, objective measurement of translation quality has always been a much researched and debated topic that has borne very little fruit. The notion of understood quality level remains unsolved, together with any kind of generally accepted and clearly understood quality assessment and measurement."
The industry response to this need for a better definition of translation quality is deeply colored by the localization mindset and thus we see the emergence of approaches like the Dynamic Quality Framework (DQF). Many critics consider it too cumbersome and detailed to implement in translating modern fast-flowing content streams needed for superior digital experience. While DQF can be useful in some limited localization use-case scenarios, it will surely confound and frustrate the enterprise managers who are more focused on digital transformation imperatives.  The ability to rapidly handle and translate large volumes of DX-relevant content cost-effectively is increasingly a higher priority and needs a new and different view on monitoring quality. The quality of the translation does matter in delivering superior DX but has a lower priority than speed, cost, and digital agility.

While machines do most of the translation done on the planet today, this does not mean that there is not a role for higher value-added human translation (increasingly supported by CAT tools). If the content is a critical and high-impact communication, most of us understand that human oversight is critical for success in the business mission. And if translation involves finesse, nuance, and high art, it is probably best to leave the "translating" computers completely out of the picture. 

However, in this age of digitally-driven business transformation and momentum, competent MT solutions are essential to the enterprise mission. Increasingly, more and more content is translated and presented to target customers without EVER going through any post-editing modificationThe business value of the translation is often defined by its utility to the consumer in a digital journey, basic understandability, availability-on-demand, and the overall CX impact, rather than linguistic perfection. Generally, useable accuracy and timely delivery matter more than perfect grammar and fluency. The phrase "good enough" is used both disparagingly, and as a positive attribute, for the translation output that is useful to a customer even in a less than “perfect” state.

So we have a situation today where the term translation quality is often meaningless even in "human translation" because it cannot be described to an inexperienced buyer of translation services (or regular human beings) in a clear, objective, and consistently measurable way. Comparing different human translation works of the same source material is often an exercise in frustration or subjective preference at best. Every sentence can have multiple correct, accurate translations, so how do we determine what is the best translation?  Since every LSP in the industry claims to provide the "best quality", such a claim is useless to a buyer who does not wish to wade through discussions on error counts, error categories, and error monitoring dashboards that are sometimes used to illustrate translation quality.

Defining Machine Translation Output Quality

The MT development community has also had difficulty with establishing a meaningful and widely useful comparative measurement for translation quality. Fortunately, they had assistance from the National Institute of Standards & Technology (NIST) who developed a methodology to compare the translation quality of multiple competing MT systems under carefully controlled evaluation protocols. The NIST used a variant of BLEU scores and other measures of precision, recall, adequacy, and fluency to compare different MT systems rapidly in a standardized and transparent manner.  Their efforts probably helped to establish BLEU as a preferred scoring methodology to rate both evolving and different competing MT systems. 

The competitive evaluation approach works when multiple systems are compared under carefully monitored test protocols, but becomes less useful when an individual developer announces "huge improvements" in BLEU scores as it is easy to make extravagant claims of improvement that are not easily validated. Some independent evaluations used by many today, provide comparisons where several systems may have actually trained on the test sets - this is the equivalent of giving a student the exam with the answers before a formal test. This makes some publicly available comparisons done by independent parties somewhat questionable and misleading.  Other reference test set-based measurements like hLepor, Meteor, chrF, Rouge, and others are also plagued by similar problems. These automated measurements are all useful, but unreliable indicators of absolute quality.

Best practices today suggest that a combination of multiple automated measures needs to be used together with human assessments of MT output to really get a handle on the relative quality of different MT systems. Again, this quickly gets messy as soon as we start asking annoying questions like:
  • What are we testing on?
  • Are we sure that these MT systems have not trained on the test data? 
  • What kind of translators is evaluating the different sets of MT output?  
  • How do these evaluators determine what is better and worse when comparing different correct translations?
  • How many sentences are needed to make a meaningful assessment and draw accurate conclusions when comparing multiple MT systems performance on the same source material?
So, we see that conducting an accurate evaluation is difficult, messy, and it is easy to draw wrong conclusions stemming from easy-to-make errors in the evaluation process.

However, in the last few years, several MT developers have claimed to produce MT systems that have achieved human parity. This has been especially true with the advent of neural machine translation. These claims are useful for creating a publicity buzz among ignorant journalists and fear amongst some translators, but usually disappoint anybody who looks more closely.

I have been especially vocal in challenging the first of these broad human parity claims as seen here: The Google Neural Machine Translation Marketing Deception. The challenge is very specific and related to some specific choices in the research approach and how the supporting data was presented.  A few years later Microsoft claimed they reached human parity on a much narrower focus with their Chinese to English News system but also said: 
Achieving human parity for machine translation is an important milestone of machine translation research. However, the idea of computers achieving human quality level is generally considered unattainable and triggers negative reactions from the research community and end-users alike. This is understandable, as previous similar announcements have turned out to be overly optimistic. 
The goal of achieving human parity has become a way to say that MT systems have gotten significantly better as this Microsoft communication shows. I too was also involved with the SDL claim of having "cracked Russian", which is yet another broad claim stating that human parity has been reached😧. 

Many, who are less skeptical than I am, will interpret that an MT engine that claims to have achieved human parity can ostensibly produce translations of equal quality to those produced by a human translator. This can indeed be true on a small subset of carefully selected test material, but alas we find that it is not usually true in general for much of what we submit with high expectations to these allegedly human parity MT engines. This is the unfortunate history of MT: over-promising and underdelivering. MT promises are so often empty promises 😏. 

While many in the translation and research communities feel a certain amount of outrage over these exaggerated claims (based on MT output they see in the results of their own independent tests) it is useful to understand what supporting documentation is used to make these claims. 

We should understand that at least among some MT experts there is no deliberate intent to deceive, and it is possible to do these evaluations with enough rigor and competence to make a reasonable claim of breakthrough progress, even if it falls short of the blessed state of human parity. 

There are basically two definitions of human parity generally used to make this claim.
Definition 1.If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.

Definition 2.If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
Again the devil is in the details, as the data and the people used in making the determination can vary quite dramatically. There are (50?) shades of grey rather than black and white facts in most cases.  The most challenging issue is that human judges and evaluators are at the heart of the assessment process. These evaluators can vary in competence and expertise and can range from bilingual subject matter experts and professionals to low-cost crowdsourced workers who earn pennies per evaluation. The other big problem is the messy, inconsistent, irrelevant, biased data underlying the assessments.

Ensuring objective, consistent human evaluation is necessary but difficult to do consistently on the required continuous and ongoing basis. If the underlying data used in an evaluation are fuzzy and unclear we actually move to obfuscation and confusion rather than clarity. This can be the scientific equivalent of fake news. MT engines evolve over time and the better the feedback, the faster the evolution of if developers know how to use this feedback to drive continuous improvements.  

Again, as Luigi Muzii states:
The problem with human evaluation is bias. The red-pen syndrome.

Indeed, human evaluation of machine translation is known for being expensive, time-consuming, and often biased, and yet it is supposed to overcome the drawbacks introduced by the limited accuracy and approximation of automatic evaluation. ... Indeed, translation quality metrics have become more and more complex and overly detailed, and always seem to be based on the error-catching [and categorization] approach that has proved costly and unreliable thus far.  


Useful Issues to Understand 

While the parity claims can be roughly true for a small sample of evaluated sentences, it is difficult to extrapolate parity to a broader range of content because it is simply not possible to do machine translation output evaluation on an MT scale (millions of sentences). Some of the same questions that obfuscate quality discussions with human translation services also apply to MT. If we cannot define what a "good translation" is for a human, how is it possible to do this for a mindless, common-sense-free machine, where instruction and direction need to be explicit and clear?  

Here are some validation and claim verification questions that can help an observer to understand the extent to which parity has been reached or also expose deceptive marketing spin that may motivate the claims.

What was the test data used in the assessments? 
MT systems are often tested and scored on news domain data which is most plentiful. This may not correlate well with system performance on the typical content in the global enterprise content domain. A broad range of different types of content needs to be included to make claims as extravagant as having reached human parity. 

What is the quality of the reference test set?
In some cases, researchers found that the test sets had been translated, and then back-translated with MTPE into the original source language. This could mean the content of the test sets would be simplified from a linguistic perspective, and thus easier to machine translate. Ideally, only expert human-created test sets should be used and should contain original source material and should not be translated data from another language.

Who produced the reference human translations being used and compared?
The reference translations against which all judgments will be made should be "good" translations. Easily said but not so easily done. If competent humans are creating the source test set sentences, the test process will be expensive. Thus, it is often more financially expedient to use MT or cheap translators to produce the test material.  This can cause a positive bias for widely used MT systems like Google Translate. 

How much data was used in the test to make the claim? 
Often human assessments are done with as little as 50 sentences, and automated scoring is rarely done with more than 2,000 sentences. Thus, drawing conclusions on how any MT system will handle the next million sentences it will process is risky, and likely to be overly optimistic. For example, when an MT developer says that over 90% of the system’s output has been labeled as a human translation by professional translators, they may be looking at a sample of only 100 or so sentences. To then claim that human parity has been reached is perhaps over-reaching.  

Who is making the judgments and what are their credentials?
It is usually cost-prohibitive to use expert professional translators to make the judgments and thus evaluators are often acquired on crowdsourcing platforms where evaluator and translator competence is not easily ascertained. 

It can be seen that do an evaluation properly would be a significant and expensive task, and MT developers have to do this continuously while building the system. The process needs to be efficient, fast, and consistent. It is often only possible to do such careful tests on the most mission-critical projects and is not realistic to follow all these rigorous protocols for typical low ROI enterprise projects. This is why BLEU and other "imperfect" automated quality scores are so widely used. They provide the developers with continuous feedback in a fast and cost-efficient manner if they are done with care and rigor. Recently there has been much discussion about testing on documents to assess understanding of context rather than just sentences. This will add complexity, cost, and difficulty to an already difficult evaluation process, and IMO will yield very small incremental benefits in evaluative and predictive accuracy. There is a need to balance improved process recommendations with cost, and the benefit from improved predictability. 

The Academic Response

Recently, several academic researchers provided some feedback on their examination of these MT at human parity claims. The study is called “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation” is worth a look to see the many ways in which evaluations can go wrong. The study showed that human evaluation of MT quality depends on three factors: “the choice of raters, the availability of linguistic context, and the creation of reference translations.”

Some findings from this report in summary:

“Professional translators showed a significant preference for human translation, while non-expert [crowdsourced] raters did not”.

“Human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems”

The authors recommend the following design changes to MT developers in their evaluation process:
  • Appoint professional translators as raters
  • Evaluate documents, not sentences
  • Evaluate fluency on top of adequacy
  • Do not heavily edit reference translations for fluency
  • Use original source texts
Most developers would say that implementing all these recommendations would make the evaluation process prohibitively expensive and slow. The researchers here do agree and welcome further studies into “alternative evaluation protocols that can demonstrate their validity at a lower cost.” Process changes need to be practical and reasonably possible, and we see that there is a need to balance improved process benefits with, cost, and improved predictability benefits.  

What Would Human Parity MT Look Like?

MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. I think it would be useful to the user community-at-large that MT developers restrain from making these claims until they can show all of the following:
  • 90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human
  • Catch obvious errors in the source and possibly even correct these before attempting to translate 
  • Handle variations in the source with consistency and dexterity
  • Have at least some nominal amount of contextual referential capability
Note that these are things we would expect without question from an average translator. So why not from the super-duper AI machine? 

Until we reach the point where all of the above is true, it would be useful to CLEARLY state the boundary limits of the claim with key parameters underlying the claim. Such as:
  • How large the test set was (e.g. 90% of 50 sentences where parity was achieved) 
  • Descriptions on what kind of source material was tested
  • How varied the test material was: sentences, paragraphs, phrases, etc...
  • Who judged, scored, and compared the translations
 For example, if we saw an MT developer state a parity claim as follows perhaps:
We found that a sample of 45/50 original human sourced sentences translated by the new MT system were judged by a team of three crowdsourced translator/raters as indistinguishable from the translations produced by two professional human translators.  Based on this data, we claim the system has achieved "limited human parity".

Until the minimum set of capabilities are shown at MT scale (>100,000 or even 1M sentences) we should tell MT developers to STFU and give us the claim parameters in a simple, clear, summarized way, so that we can weigh the reality of the data versus the claim for ourselves.  

I am also skeptical that we will achieve human parity by 2029 as some "singularity" enthusiasts have been saying for over a decade. 
"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not pixie dust that magically solves all your problems." 
Steven Pinker 
Elsewhere, Pinker also says:
"… I’m skeptical, though, about science-fiction scenarios played out in the virtual reality of our imaginations. The imagined futures of the past have all been confounded by boring details: exponential costs, unforeseen technical complications, and insuperable moral and political roadblocks. It remains to be seen how far artificial intelligence and robotics will penetrate into the workforce. (Driving a car is technologically far easier than unloading a dishwasher, running an errand, or changing a baby.) Given the tradeoffs and impediments in every other area of technological development, the best guess is: much farther than it has so far, but not nearly so far as to render humans obsolete."

Recently some in the Singularity community have admitted that "language is hard" as you can see in this attempt to explain why AI has not mastered translation yet.

Michael Housman, a faculty member of Singularity University, explained that the ideal scenario for machine learning and artificial intelligence is something with fixed rules and a clear-cut measure of success or failure. He named chess as an obvious example and noted machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited set of moves.

Housman elaborated, “Language is almost the opposite of that. There aren’t as clearly-cut and defined rules. The conversation can go in an infinite number of different directions. And then of course, you need labeled data. You need to tell the machine to do it right or wrong.”

Housman noted that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”

Perhaps, we need to admit that human parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?

Perhaps what we really need is an MT platform that can rapidly evolve in quality with specialized human feedback. Post-editing (MTPE) today is generally NOT a positive experience for most translators. But human interaction with the machine can be a significantly better and positive experience.  Developing interactive and highly responsive MT systems that can assist, learn, and improve the humdrum elements of translation tasks instantaneously might be a better research focus. This may be a more worthwhile goal than having a God-like machine that can translate anything and everything at human parity. 

Even in the AI-will-solve-all community, we know that "language is hard" so maybe we need more focus on improving the man-machine interface, and the quality of the interaction and find more sophisticated collaborative models. Rapid evolution, intuitive and collaborative interaction, and instant learning seem like a more promising vision to me than crawling all the data on the web and throwing machine learning pixie dust at your ten trillion word TM training data. 

Getting to a point where the large majority of translators ALWAYS WANT TO USE MT because it simply makes the work easier, more pleasant, and more efficient is perhaps a better focus for the future.  I would bet also that this different vision will be a more likely path to better MT systems that consistently produce better output over millions of sentences.