Pages

Showing posts with label Human quality evaluation. Show all posts
Showing posts with label Human quality evaluation. Show all posts

Friday, October 22, 2021

Understanding Machine Translation Quality: A Review

This is a reprint of a post I wrote and already published here with some minor formatting changes made for emphasis. It is the first of a series of ongoing posts that will be published at that site and also shared here if it seems appropriate. 

For those who may seek or insist that I maintain a truly objective viewpoint, I should warn you that these posts will reflect my current understanding that ModernMT is truly a superior MT implementation for enterprise MT use. I will stress this often in future posts as I have not seen a better deployment of MT technology for professional business translation in the 15 years I have been involved with Enterprise MT.

==============

Today we live in a world where machine translation (MT) is pervasive, and increasingly a necessary tool for any global enterprise that seeks to understand, communicate and share information with a global customer base.

It is estimated by experts that trillions of words are being translated daily with the aid of the many “free” generic public MT portals worldwide.

This is the first in a series of posts that will explore the issue of MT Quality in some depth, with several goals:

  • Explain why MT quality measurement is necessary,
  • Share best practices,
  • Expose common misconceptions,
  • Understand what matters for enterprise and professional use.

While much has been written on this subject already, it has not seemed to have reduced the amount of misunderstanding and confusion around this subject. Thus, there is value in continued elucidation to ensure that greater clarity and understanding are achieved.

So let’s begin.




MT Quality and Why Does It Matter?

Machine Translation (MT) or Automated Translation is a process when computer software “translates” text from one language to another without human involvement.

There are ten or more public MT portals available to do this in the modern era, and additionally, many private MT offerings are available to the modern enterprise to address their large-scale language translation needs. For this reason, the modern global enterprise needs an understanding of the relative strengths and weaknesses of the many offerings available in the marketplace.

Ideally, the “best” MT system would be identified by a team of competent translators who would run a diverse range of relevant content through the MT system after establishing a structured and repeatable evaluation process. 

This is slow, expensive, and difficult, even if only a small sample of 250 sentences is evaluated.

Thus, automated measurements that attempt to score translation adequacy, fluency, precision, and recall have to be used. They attempt to do what is best done by competent humans. This is done by comparing MT output to a human translation in what is called a Reference Test set. These reference sets cannot provide all the possible ways a source sentence could be correctly translated. Thus, these scoring methodologies are always an approximation of what a competent human assessment would determine, and can sometimes be wrong or misleading.

Thus, identifying the “best MT” solution is not easily done. Consider the cost of evaluating ten different systems on twenty different language combinations with a human team versus automated scores. Even though it is possible to rank MT systems based on scores like BLEU and hLepor, they do not represent production performance. The scores are a snapshot of an ever-changing scene. If you change the angle or the focus the results would change.

A score is not a stable and permanent rating for an MT system. There is no single, magic MT solution that does a perfect job on every document or piece of content or language combination. Thus, the selection of MT systems for production use based on these scores can often be sub-optimal or simply wrong.

Additionally, MT technology is not static: the models are constantly being improved and evolving, and what was true yesterday in quality comparisons may not be true tomorrow.

For these reasons, understanding how the data, algorithms, and human processes around the technology interact is usually more important than any comparison snapshot.  

In fact, building expertise and close collaboration with a few MT providers is likely to yield better ROI and business outcomes than jumping from system to system based on transient and outdated quality score-based comparisons.

Two primary groups have an ongoing and continuing interest in measuring MT quality. They are:

  1. MT developers
  2. Enterprise buyers and LSPs

They have very different needs and objectives and it is useful to understand why this is so.

Measurements that may make sense for developers can often be of little or no value to enterprise buyers and vice versa. 

 

MT Developers

MT developers typically work on one model at a time, e.g.: English-to-French. They will repeatedly add and remove data from a training set, then measure the impact to eventually determine the optimal data needed.

They may also modify parameters on the training algorithms used, or change algorithms altogether, and then experiment further to find the best data/algorithm combinations using instant scoring metrics like BLEU, TER, hLepor, ChrF, Edit Distance, and Comet.

While such metrics are useful to developers, they should not be used to cross-compare systems, and have to be used with great care.  The quality scores from several (data/algorithm) combinations are calculated by comparing MT output from each of these systems (models) to a Human Reference translation of the same evaluation test data. The highest scoring system is usually considered the best one.

In summary, MT developers use automatically calculated scores that attempt to mathematically summarize overall precision, recall, adequacy, and fluency characteristics of an MT system into a numeric score, This is done to identify the best English-to-French system, as stated in our example, that they can build with available data and computing resources.

However, a professional human assessment may often differ from what these scores say.

In recent years, Neural MT (NMT) models have exposed that using these automated scoring metrics in isolation can lead to sub-optimal choices. Increasingly, human evaluators are also engaged to ensure that there is a correlation between automatically calculated scores and human assessments.

This is because the scores are not always reliable, and human rankings can differ considerably from score-based rankings. Thus, the quality measurement process is expensive, slow, and prone to many procedural errors, and sometimes even deceptive tactics.

Some MT developers test on training data which can result in misleadingly high scores. (I know of a few who do this!) The optimization process described above is essentially how the large public MT portals develop their generic systems, where the primary focus is on acquiring the right data, using the best algorithms, and getting the highest (BLEU) or lowest (TER) scores.



Enterprise Buyers and LSPs

Enterprise Buyers and LSPs usually have different needs and objectives. They are more likely to be interested in understanding which English-to-French system is the “best” among five or more commercially available MT systems under consideration.

Using automated scores like BLEU, hLepor and TER do not make as much sense in this context. The typical enterprise/LSP is also additionally interested in understanding which system can be “best” modified to learn enterprise terminology and language style.

Optimization around enterprise content and subject domain matters much more, and a comparison of generic (stock) systems can often be useless in the considered professional use context.

Many forget that many business problems require a combination of both MT and human translation to achieve the required level of output quality. Thus, a tightly linked human-in-the-loop (HITL) process to drive MT performance improvements has increasingly become a key requirement for most enterprise MT use cases.

Third-party consultants have compared generic (stock or uncustomized) engines and ranked MT solutions using a variety of test sets that may or may not be relevant to a buyer. These rankings are then often being used to dynamically select different MT systems for different languages, but it is possible and even likely, that they are making sub-optimal choices.  

The ease, speed, and cost of tuning and adapting a generic (stock) MT system to enterprise content, terminology, and language style matter much more in this context, and comparisons should only be made after determining this aspect.

However, as generic system comparisons are much easier and less costly to do, TMS systems and middleware that allow MT system selection using these generic evaluation test data scores, often make choices based on irrelevant and outdated data and can thus be sub-optimal. This is a primary reason that so many LSP systems perform so poorly and why MT is so underutilized in this sector.

While NMT continues to gain momentum as the average water level keeps rising, there is still a great deal of naivete and ignorance in the professional translation community about MT quality assessment and MT best practices in general. The enterprise/LSP use of MT is much more demanding in terms of focused accuracy and sophistication in techniques, practices, and deployment variability, and few LSPs are capable or willing to make the investments needed to achieve ongoing competence as the state-of-the-art (SOTA) continues to evolve.


Dispelling MT Quality Misconceptions

1) Google has the “best” MT systems

This is one of the most widely held misconceptions. While Google does have excellent generic systems and broad language coverage, it is not accurate to say that they are always the best.

Google MT is complicated and expensive to customize for enterprise use cases, and there are significant data privacy and data control issues to be navigated.  Also, because Google has so much data underlying their MT systems, they are not easily customized by the relatively meager data volumes that most enterprises or LSPs have available. DeepL is often a favorite of translators, but also has limited customization and adaptation options.

ModernMT is a dynamically adaptive, and continuously learning breakthrough neural MT system. As it is possibly the only MT system that learns and improves with every instance of corrective feedback in real-time, a comparative snapshot based on a static system is even less useful.

A properly implemented ModernMT system will improve rapidly with corrective feedback, and easily outperform generic systems on the enterprise-specific content that matters most. Enterprise needs are more varied, and rapid adaptability, data security, and easy integration into enterprise IT infrastructure typically matter most.

2) MT Quality ratings are static & permanent

MT systems managed and maintained by experts are updated frequently and thus snapshot comparisons are only true for a single test set at a point in time. These scores are a very rough historical proxy for overall system quality and capability, and deeper engagement is needed to better understand system capabilities.

For example, to make proper assessments with ModernMT, it is necessary to actively provide corrective feedback to see the system improve exactly on the content that you are most actively translating now. If multiple editors concurrently provide feedback, ModernMT will improve even faster. These score-based rankings do not tell you how responsive and adaptive an MT system is to your unique data.

TMS systems that switch to different MT systems via API for each language are of dubious value since selections are often based on static and outdated scores. Best practices recommend that efforts to improve an MT systems adaptation to enterprise content, domain, and language style yield higher value than using MT system selection based on embedded scores built into TMS systems and middleware.

3) MT quality ratings for all use cases are the same.

The MT quality discussion needs to evolve beyond targeting linguistic perfection as the final goal, or comparison of BLEU, TER, or hLepor scores, and proximity to human translation.

It is more important to measure the business impact and make more customer-relevant content multilingual across global digital interactions at scale. While it is always good to get as close to human translation quality as possible, this is simply not possible with the huge volumes of content that are being translated today.

There is evidence now that shows that for many eCommerce use scenarios, even gist translations that contain egregious linguistic errors can produce a positive business impact. In information triage scenarios typical in eDiscovery (litigation, pharmacovigilance, national security surveillance) the translation needs to be accurate on key search parameters but not on all the text.  Translation of user-generated content (UGC) is invaluable to improving and understanding the customer experience and is also a primary influence on new purchase activity. None of these scenarios require perfect linguistic quality MT output, to have a positive business impact and drive successful customer engagement.

4) The linguistic quality of MT output is the only way to assess the “best” MT system.

The linguistic quality of MT output is only one of several critical criteria needed for robust evaluation for an enterprise/LSP buyer. Enterprise requirements like the ease and speed of customization to enterprise domain, data security and privacy, production MT system deployment options, integration into enterprise IT infrastructure,  overall MT system manageability, and control also need to be considered.

Given that MT is rapidly becoming an essential tool for a globally agile enterprise, we need new ways to measure the quality and value of MT in global CX scenarios. In the scenarios where MT enables better communication, information sharing, and understanding of customer concerns on a global scale, we need new ways to measure success.  A closer examination of business impact reveals that the metrics that matter the most would be:

  • Increased global digital presence and footprint
  • Enhanced global communication and collaboration
  • Rapid response in all global customer service/support scenarios
  • Productivity improvement in localization use cases to enable more content to be delivered at higher quality
  • Improved conversion rates in eCommerce

And ultimately the measure that matters at the executive level is the measurably improved customer experience of every customer in the world. 

This is often more a result of process and deployment excellence than the reported semantic similarity scores of any individual MT system.

The reality today is that increasingly larger volumes of content are being translated and used with minimal or no post-editing.  The highest impact MT use cases may only post-edit a tiny fraction of the content they translate and distribute.

However, much of the discussion in the industry today still focuses on post-editing efficiency and quality estimation processes that assume all the content will be post-edited.

It is time for a new approach that easily enables tens of millions of words to be translated daily, in continuously learning MT systems that improve by the day and enable new communication, understanding, and collaboration with globally distributed stakeholders.

In the second post in this series, we will dig deeper into BLEU and other automated scoring methodologies and show why competent human assessments are still the most valuable feedback that can be provided to drive ongoing and continuous improvements in MT output quality.

Monday, March 29, 2021

The Quest for Human Parity Machine Translation



The Challenge of Defining Translation Quality 


The subject of  "translation quality" has always been a challenging communication issue for the translation industry. It is particularly difficult to explain this concept in a straightforward way to an industry outsider or a customer whose primary focus is building business momentum in international markets, and who is not familiar with localization industry translation-quality-speak.  Nowadays they tend to focus on creating and managing the dynamic and ever-changing content that enhances a global customer's digital journey, rather than the static content that is the more typical focus of localization managers. Thus, the conventional way in which translation quality is discussed by LSPs is not very useful to these customers. Since every LSP claims to deliver the "best quality " or "high quality" translations", it is difficult for these buyers to tell the difference in this service aspect from one service provider to another. The quality claim between vendors thus essentially cancels out. 

These customers also differ in other ways. They need larger volumes of content to be translated rapidly at the lowest cost possible, but yet at a quality level that is useful to the customer in digital interactions with the enterprise. For millions of digital interactions with enterprise content, the linguistic perfection of translations is not a meaningful and achievable goal given the volume, short shelf-life, and instant turnaround expectations a digital customer will have.  
As industry observer and critic Luigi Muzii describes it:
"Because of the intricacies related to the intrinsic nature of languages, objective measurement of translation quality has always been a much researched and debated topic that has borne very little fruit. The notion of understood quality level remains unsolved, together with any kind of generally accepted and clearly understood quality assessment and measurement."
The industry response to this need for a better definition of translation quality is deeply colored by the localization mindset and thus we see the emergence of approaches like the Dynamic Quality Framework (DQF). Many critics consider it too cumbersome and detailed to implement in translating modern fast-flowing content streams needed for superior digital experience. While DQF can be useful in some limited localization use-case scenarios, it will surely confound and frustrate enterprise managers who are more focused on digital transformation imperatives.  The ability to rapidly handle and translate large volumes of DX-relevant content cost-effectively is increasingly a higher priority and needs a new and different view on monitoring quality. The quality of the translation does matter in delivering superior DX but has a lower priority than speed, cost, and digital agility.

While machines do most of the translation on the planet today, this does not mean that there is no role for higher value-added human translation (increasingly supported by CAT tools). If the content is a critical and high-impact communication, most of us understand that human oversight is critical for success in the business mission. And if translation involves finesse, nuance, and high art, it is probably best to leave the "translating" computers completely out of the picture. 

However, in this age of digitally-driven business transformation and momentum, competent MT solutions are essential to the enterprise's mission. Increasingly, more and more content is translated and presented to target customers without EVER going through any post-editing modificationThe business value of the translation is often defined by its utility to the consumer in a digital journey, basic understandability, availability-on-demand, and the overall CX impact, rather than linguistic perfection. Generally, useable accuracy and timely delivery matter more than perfect grammar and fluency. The phrase "good enough" is used both disparagingly, and as a positive attribute, for the translation output that is useful to a customer even in a less than “perfect” state.


So we have a situation today where the term translation quality is often meaningless even in "human translation" because it cannot be described to an inexperienced buyer of translation services (or regular human beings) in a clear, objective, and consistently measurable way. Comparing different human translation works of the same source material is often an exercise in frustration or subjective preference at best. Every sentence can have multiple correct, accurate translations, so how do we determine what is the best translation?  Since every LSP in the industry claims to provide the "best quality", such a claim is useless to a buyer who does not wish to wade through discussions on error counts, error categories, and error monitoring dashboards that are sometimes used to illustrate translation quality.


Defining Machine Translation Output Quality


The MT development community has also had difficulty establishing a meaningful and widely useful comparative measurement for translation quality. Fortunately, they had assistance from the National Institute of Standards & Technology (NIST) and developed a methodology to compare the translation quality of multiple competing MT systems under carefully controlled evaluation protocols. The NIST used a variant of BLEU scores and other measures of precision, recall, adequacy, and fluency to compare different MT systems rapidly in a standardized and transparent manner.  Their efforts probably helped to establish BLEU as a preferred scoring methodology to rate both evolving and different competing MT systems. 

The competitive evaluation approach works when multiple systems are compared under carefully monitored test protocols, but becomes less useful when an individual developer announces "huge improvements" in BLEU scores as it is easy to make extravagant claims of improvement that are not easily validated. Some independent evaluations used by many today, provide comparisons where several systems may have actually trained on the test sets - this is the equivalent of giving a student the exam with the answers before a formal test. This makes some publicly available comparisons done by independent parties somewhat questionable and misleading.  Other reference Test-set-based measurements like hLepor, Meteor, chrF, Rouge, and others are also plagued by similar problems. These automated measurements are all useful, but unreliable indicators of absolute quality.

Best practices today suggest that a combination of multiple automated measures needs to be used together with human assessments of MT output to really get a handle on the relative quality of different MT systems. Again, this quickly gets messy as soon as we start asking annoying questions like:
  • What kind of content are we testing?
  • Are we sure that these MT systems have not trained on the test data? 
  • What kind of translators is evaluating the different sets of MT output?  
  • How do these evaluators determine what is better and worse when comparing different correct translations?
  • How many sentences are needed to make a meaningful assessment and draw accurate conclusions when comparing multiple MT systems' performance on the same source material?
So, we see that conducting an accurate evaluation is difficult, and messy, and it is easy to draw wrong conclusions stemming from easy-to-make errors in the evaluation process.

However, in the last few years, several MT developers have claimed to produce MT systems that have achieved human parity. This has been especially true with the advent of neural machine translation. These claims are useful for creating a publicity buzz among ignorant journalists and fear amongst some translators, but usually disappoint anybody who looks more closely.

I have been especially vocal in challenging the first of these broad human parity claims as seen here: The Google Neural Machine Translation Marketing Deception. The challenge is very specific and related to some specific choices in the research approach and how the supporting data was presented.  A few years later Microsoft claimed they reached human parity on a much narrower focus with their Chinese to English News system but also said: 
Achieving human parity for machine translation is an important milestone of machine translation research. However, the idea of computers achieving human quality level is generally considered unattainable and triggers negative reactions from the research community and end-users alike. This is understandable, as previous similar announcements have turned out to be overly optimistic. 
The goal of achieving human parity has become a way to say that MT systems have gotten significantly better as this Microsoft communication shows. I too was also involved with the SDL claim of having "cracked Russian", which is yet another broad claim stating that human parity has been reached😧. 

Many, who are less skeptical than I am, will interpret that an MT engine that claims to have achieved human parity can ostensibly produce translations of equal quality to those produced by a human translator. This can indeed be true on a small subset of carefully selected test material, but alas we find that it is not usually true in general for much of what we submit with high expectations to these allegedly human parity MT engines. This is the unfortunate history of MT: over-promising and underdelivering. MT promises are so often empty promises 😏. 

While many in the translation and research communities feel a certain amount of outrage over these exaggerated claims (based on MT output they see in the results of their own independent tests) it is useful to understand what supporting documentation is used to make these claims. 

We should understand that at least among some MT experts there is no deliberate intent to deceive, and it is possible to do these evaluations with enough rigor and competence to make a reasonable claim of breakthrough progress, even if it falls short of the blessed state of human parity. 

There are basically two definitions of human parity generally used to make this claim.
Definition 1.If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.

Definition 2.If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
Again the devil is in the details, as the data and the people used in making the determination can vary quite dramatically. There are (50?) shades of grey rather than black-and-white facts in most cases.  The most challenging issue is that human judges and evaluators are at the heart of the assessment process. These evaluators can vary in competence and expertise and can range from bilingual subject matter experts and professionals to low-cost crowdsourced workers who earn pennies per evaluation. The other big problem is the messy, inconsistent, irrelevant, and biased data underlying the assessments.


Ensuring objective, consistent human evaluation is necessary but difficult to do consistently on the required continuous and ongoing basis. If the underlying data used in an evaluation are fuzzy and unclear we actually move to obfuscation and confusion rather than clarity. This can be the scientific equivalent of fake news. MT engines evolve over time and the better the feedback, the faster the evolution if developers know how to use this feedback to drive continuous improvements.  

Again, as Luigi Muzii states:
The problem with human evaluation is bias. The red-pen syndrome.

Indeed, human evaluation of machine translation is known for being expensive, time-consuming, and often biased, and yet it is supposed to overcome the drawbacks introduced by the limited accuracy and approximation of automatic evaluation. ... Indeed, translation quality metrics have become more and more complex and overly detailed, and always seem to be based on the error-catching [and categorization] approach that has proved costly and unreliable thus far.  

 



Useful Issues to Understand 


While the parity claims can be roughly true for a small sample of evaluated sentences, it is difficult to extrapolate parity to a broader range of content because it is simply not possible to do machine translation output evaluation on an MT scale (millions of sentences). Some of the same questions that obfuscate quality discussions with human translation services also apply to MT. If we cannot define what a "good translation" is for a human, how is it possible to do this for a mindless, common-sense-free machine, where instruction and direction need to be explicit and clear?  

Here are some validation and claim verification questions that can help an observer to understand the extent to which parity has been reached or also expose deceptive marketing spin that may motivate the claims.

What was the test data used in the assessments? 
MT systems are often tested and scored on news domain data which is most plentiful. This may not correlate well with system performance on the typical content in the global enterprise content domain. A broad range of different types of content needs to be included to make claims as extravagant as having reached human parity. 

What is the quality of the reference test set?
In some cases, researchers found that the test sets had been translated, and then back-translated with MTPE into the original source language. This could mean the content of the test sets would be simplified from a linguistic perspective, and thus easier to machine translate. Ideally, only expert human-created test sets should be used and should contain original source material, and should not be translated data from another language.

Who produced the reference human translations being used and compared?
The reference translations against which all judgments will be made should be "good" translations. Easily said but not so easily done. If competent humans are creating the source test set sentences, the test process will be expensive. Thus, it is often more financially expedient to use MT or cheap translators to produce the test material.  This can cause a positive bias for widely used MT systems like Google Translate. 

How much data was used in the test to make the claim? 
Often human assessments are done with as little as 50 sentences, and automated scoring is rarely done with more than 2,000 sentences. Thus, drawing conclusions on how any MT system will handle the next million sentences it will process is risky, and likely to be overly optimistic. For example, when an MT developer says that over 90% of the system’s output has been labeled as a human translation by professional translators, they may be looking at a sample of only 100 or so sentences. To then claim that human parity has been reached is perhaps overreaching.  

Who is making the judgments and what are their credentials?
It is usually cost-prohibitive to use expert professional translators to make the judgments and thus evaluators are often acquired on crowdsourcing platforms where evaluator and translator competence is not easily ascertained. 

It can be seen that doing an evaluation properly would be a significant and expensive task, and MT developers have to do this continuously while building the system. The process needs to be efficient, fast, and consistent. It is often only possible to do such careful tests on the most mission-critical projects and is not realistic to follow all these rigorous protocols for typical low ROI enterprise projects. This is why BLEU and other "imperfect" automated quality scores are so widely used. They provide the developers with continuous feedback in a fast and cost-efficient manner if they are done with care and rigor. Recently there has been much discussion about testing on documents to assess understanding of context rather than just sentences. This will add complexity, cost, and difficulty to an already difficult evaluation process, and IMO will yield very small incremental benefits in evaluative and predictive accuracy. There is a need to balance improved process recommendations with cost, and the benefit from improved predictability. 


The Academic Response


Recently, several academic researchers provided some feedback on their examination of these MT at human parity claims. The study is called “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation” and is worth a look to see the many ways in which evaluations can go wrong. The study showed that human evaluation of MT quality depends on three factors: “the choice of raters, the availability of linguistic context, and the creation of reference translations.”


Some findings from this report in summary:

“Professional translators showed a significant preference for human translation, while non-expert [crowdsourced] raters did not”.

“Human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems”

The authors recommend the following design changes to MT developers in their evaluation process:
  • Appoint professional translators as raters
  • Evaluate documents, not sentences
  • Evaluate fluency on top of adequacy
  • Do not heavily edit reference translations for fluency
  • Use original source texts
Most developers would say that implementing all these recommendations would make the evaluation process prohibitively expensive and slow. The researchers here do agree and welcome further studies into “alternative evaluation protocols that can demonstrate their validity at a lower cost.” Process changes need to be practical and reasonably possible, and we see that there is a need to balance improved process benefits with, cost, and improved predictability benefits.  


What Would Human Parity MT Look Like?


MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. I think it would be useful to the user community at large that MT developers restrain from making these claims until they can show all of the following:
  • 90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human
  • Catch obvious errors in the source and possibly even correct these before attempting to translate 
  • Handle variations in the source with consistency and dexterity
  • Have at least some nominal amount of contextual referential capability
Note that these are things we would expect without question from an average translator. So why not from the super-duper AI machine? 

Until we reach the point where all of the above is true, it would be useful to CLEARLY state the boundary limits of the claim with key parameters underlying the claim. Such as:
  • How large the test set was (e.g. 90% of 50 sentences where parity was achieved) 
  • Descriptions of what kind of source material was tested
  • How varied the test material was: sentences, paragraphs, phrases, etc...
  • Who judged, scored, and compared the translations
 For example, if we saw an MT developer state a parity claim as follows perhaps:
We found that a sample of 45/50 original human sourced sentences translated by the new MT system were judged by a team of three crowdsourced translator/raters as indistinguishable from the translations produced by two professional human translators.  Based on this data, we claim the system has achieved "limited human parity".

Until the minimum set of capabilities is shown at the MT scale (>100,000 or even 1M sentences) we should tell MT developers to STFU and give us the claim parameters in a simple, clear, summarized way, so that we can weigh the reality of the data versus the claim for ourselves.  

I am also skeptical that we will achieve human parity by 2029 as some "singularity" enthusiasts have been saying for over a decade. 
 
"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not pixie dust that magically solves all your problems." 
Steven Pinker 
Elsewhere, Pinker also says:
"… I’m skeptical, though, about science-fiction scenarios played out in the virtual reality of our imaginations. The imagined futures of the past have all been confounded by boring details: exponential costs, unforeseen technical complications, and insuperable moral and political roadblocks. It remains to be seen how far artificial intelligence and robotics will penetrate into the workforce. (Driving a car is technologically far easier than unloading a dishwasher, running an errand, or changing a baby.) Given the tradeoffs and impediments in every other area of technological development, the best guess is: much farther than it has so far, but not nearly so far as to render humans obsolete."

Recently some in the Singularity community have admitted that "language is hard" as you can see in this attempt to explain why AI has not mastered translation yet.

Michael Housman, a faculty member of Singularity University, explained that the ideal scenario for machine learning and artificial intelligence is something with fixed rules and a clear-cut measure of success or failure. He named chess as an obvious example and noted machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited set of moves.

Housman elaborated, “Language is almost the opposite of that. There aren’t as clearly-cut and defined rules. The conversation can go in an infinite number of different directions. And then of course, you need labeled data. You need to tell the machine to do it right or wrong.”

Housman noted that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”

Perhaps, we need to admit that human parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?

Perhaps what we really need is an MT platform that can rapidly evolve in quality with specialized human feedback. Post-editing (MTPE) today is generally NOT a positive experience for most translators. But human interaction with the machine can be a significantly better and positive experience.  Developing interactive and highly responsive MT systems that can assist, learn, and improve the humdrum elements of translation tasks instantaneously might be a better research focus. This may be a more worthwhile goal than having a God-like machine that can translate anything and everything at human parity. 

Even in the AI-will-solve-all community, we know that "language is hard" so maybe we need more focus on improving the man-machine interface, and the quality of the interaction and finding more sophisticated collaborative models. Rapid evolution, intuitive and collaborative interaction, and instant learning seem like a more promising vision to me than crawling all the data on the web and throwing machine learning pixie dust at your ten trillion word TM training data. 

Getting to a point where the large majority of translators ALWAYS WANT TO USE MT because it simply makes the work easier, more pleasant, and more efficient is perhaps a better focus for the future.  I would bet also that this different vision will be a more likely path to better MT systems that consistently produce better output over millions of sentences.   

Friday, May 1, 2020

Evaluating Machine Translation Systems

This post is the first in a series of upcoming posts focusing on the issue of quality evaluation of multiple MT systems. MT system selection has become a more important issue in recent times as users and buyers realize that potentially multiple MT systems can be viable for their needs, but would like to develop better, more informed selection procedures.

I have also just ended my tenure at SDL, and this departure will also allow my commentary and opinion in this blog to be more independent and objective, from this point onwards. I look forward to looking more closely at all the most innovative MT solutions in the market today and providing more coverage on them.  

As NMT technology matures it has become increasingly apparent to many buyers that traditional metrics like BLEU that are used to compare/rank different MT systems and vendors are now often inadequate for this purpose, even though these metrics are still useful to engineers who are focused on building a single MT system.  It is now much more widely understood that best practice involves human evaluations used together with automated metrics. This combined scoring approach is a more useful input in conducting comparative evaluations of MT systems.  To the best of my knowledge, there are very few in the professional translation world who do this well, and it is very much an evolving practice and learning that is happening now. Thus, I invite any readers who might be willing to share their insights into conducting consistent and accurate human evaluations to contact me about doing this here.

Most of the focus in the localization world's use of MT remains on MTPE efficiencies (edit distance, translator productivity), often without consideration of how the volume and useable quality might change and impact the overall process and strategy. While this focus has value, it misses the broader potential of MT and "leaves money on the table" as they say.

We should understand the questions that we are most frequently asking is: 
  • What MT system would work best for our business purposes?
  • Is there really enough of a difference between systems to use anything but the lowest cost vendor?
  • Is there a better way to select MT systems than just looking at generic BLEU scores?
I have covered these questions to some extent in prior posts and I would recommend this post and this post to get some background on the challenges in understanding the MT quality big picture.

The COVID-19 pandemic is encouraging MT-use in a positive way. Many more brands now realize that speed, digital agility, and a greater digital presence matter in keeping customers and brands engaged. As NMT continues to improve, much of the "bulk translation market" will move to a production model where most of the work will be done by MT.  Translators who are specialists and true subject matter experts are unlikely to be affected by the technology in a negative way, but NMT is poised to penetrate standard/bulk localization work much more deeply, driving costs down as it does so.

This is a guest post and an unedited independent opinion from an LSP (Language Service Provider) and it is useful in providing us an example of the most common translation industry perspective on the subject of multiple MT system evaluations. It is interesting to note that the NMT advances over SMT are still not quite understood by some, even though the bulk of the research efforts and most new deployments have shifted to NMT. 

Most LSPs continue to stress that human translation is "better" than MT which most of us on the technology side would not argue against, but this view loses something when we see that the real need today is to "translate" millions of words a day. This view also glosses over the fact that all translation tasks are not the same. Even in 2020 most LSPs continue to overlook that MT solves new kinds of translation problems that involve speed and volume and that new skills are needed to really leverage MT in these new directions. There is also a tendency to position the choice as binary MT vs Human Translation, even though much of the evidence is pointing to new man + machine models that provide an improved production approach. The translation needs of the future are quite different from the past and I hope that more service providers in the industry start to recognize this. 

I also think it is unwise for LSPs to start building their own MT systems, especially with NMT. The complexity, cost and expertise required are prohibitive for most. MT systems development should be left to real experts who do this on a regular and continuing basis. The potential for LSPs adding value is in other areas, and I hope to cover this in the coming posts.


Source: MasterWord



                                                                                                                                                               * =======*



It’s not a secret that machine translation (MT) has taken the world by storm. Almost everyone now has had some experience with MT, mostly in the form of a translation app such as Google Translate being popular. But MT comes in a variety of formats and is heavily utilized by businesses and institutions all over the world.

With that in mind, which MT system is best? Since MT comes in many colors, figuratively speaking, which one should you ought to rely on if you decide to build your own MT system? We’ll also talk more about translation quality and whether or not MT is suitable for specialized translations such as medical translation; a critical field now for any active translation company in light of the current coronavirus pandemic that has the whole world at a standstill.


What is Machine Translation?

Machine Translation, or MT, is software that is capable of translating text from a source language to a translated text of the target language. Over the years, there have been multiple variations of MT, but there are three definitive types; Rules-based Machine Translation (RBMT), Statistical Machine Translation (SMT), and Neural Machine Translation (NMT). Here’s a quick rundown of their characteristics, including their pros and cons between each other;

  1. RBMT

Rules-Based Machine Translation is one of the earliest forms of MT. Its algorithm is language-based, meaning for it to know how to translate one source language to the other, it must rely on input data in the form of a lexicon, grammar rules, and other linguistic fundamentals. The problem with RBMT systems is scaling it efficiently as it becomes more complicated as more language rules are added. Also, RBMT is never ideal for obscure languages with minuscule data. However, with the development of advanced MT systems over the years, RMBT has largely been superseded, in which you'll know more about its successor next.

  1.  SMT

Statistical Machine Translation, compared to RBMT, is designed to translate languages from statistical algorithms. SMT works by being fed with data in the form of bilingual text corpora, SMT is programmed to identify patterns in the data and form its translations from it. Patterns in this context mean how many times a certain word/phrase appears consistently in a certain context. This probability learning model allows SMT systems to render relatively appropriate translations compared. It’s pretty much like ‘If this is how was it done, then this is how it should be done’. 

SMT also must be fed with plenty of data just like RBMT, but MT developers of which includes translation app developers prefer SMT due to its ease of setting up due to numerous open-source SMT systems available, cost-effectiveness due to free quality parallel text corpora that are available online, higher translation accuracy than RMBT, and its ease of scalability as the system grows bigger.

But just like RBMT, SMT can’t function well if it’s fed with insufficient and poorly structured parallel text corpora. That being said, it’s not that ideal to translate obscure languages.

  1. NMT

Neural Machine Translation is the latest development in MT. Think of it as an upgraded version of SMT in which its abilities are now supplemented with artificial intelligence (AI), specifically deep learning. Not only is it capable of coming through data faster, but it can also produce better outputs through constant trial and error. SMT does it the same way as well but the only difference, albeit a definitive one, is that it’s able to do it much faster and more accurately. Google Translate recently made the switch in 2016 to NMT from its old SMT system.

Its deep learning capability is such a real game-changer that it’s able to accomplish what RBMT and SMT; translating obscure regional languages. That’s why Google Translate can cover over 100 languages such as Somalian and Gaelic. But its outputs are questionable, to say the least as it needs some time to learn a language that has little reliable data lying around for it to use. However, the development of NMT just goes to show how far MT overall has evolved over the years.


What Makes A Good Machine Translation (MT) System?

There have been many MT systems over the years and many still in development. The ones that happened to survive the test of time are select variants of RBMT and most variants of SMT. NMT has quickly gained popularity and will slowly replace SMT as the years go by. What’s generally expected out of a good custom-built MT system is reliability and quality of outputs, pretty much like any other product or service out there.

If you’re looking for a reliable metric, then BLEU (Bilingual Evaluation Understudy) is one of the most widely used MT evaluation metrics. BLEU ranks MT systems between 0 being the worst and 1 being the best. It rates how close the translated text is to a human. The more human-like and natural-sounding the translation is, the better the score.

That being said, every MT developer creates their system according to not only the developer’s but also a client’s specifications and linguistic needs. So not one of them is alike. But there are MT platforms that are widely used by multiple clients due to their flexibility of being adapted to the client’s needs and ease of use. But even with a variety of MT systems being developed over the years, one thing remains the same; MT systems have to learn from a lot of quality data and must be given the time to learn.

They say that machines are inherently dumb and that they’re only as good as job or data are given to them. For MT, that notion still rings true up to this day and will most likely keep ringing for decades to come. However, quality data isn’t only what makes a good MT system.

There are platforms in which MT is integrated with other processes for it to render quality or at the very least, passable translations. Indeed, MT itself is a process onto its own, but its outputs, even with deep learning capabilities, is still not up to par with that of a professional translator. MT has to be integrated with other processes, namely computer-assisted translation (CAT) tools.

There are many CAT tools but two of the most essential are a glossary tool and translation memory. A glossary is simply a database of terminologies and approved translations. It’s a very simple feature but very important as it saves up a lot of time for the translator as they don’t need to constantly look back and forth which translation is the perfect choice for the source text at hand.

A translation memory is also like a glossary, but stores phrases and sentences. It also saves the translator valuable time as many translations recycle the same language such as user manuals, marketing collateral, and etc. A translation memory also helps by providing consistent language at a given domain and language pair.


I Now Pronounce You Man and Machine

However, even with all the bells and whistles, developers can equip an MT system with, is MT alone enough? Can MT alone produce accurate and quality translations that are demanded by the clients of language services today? MT is part of the solution but doesn’t comprise the complete picture. It sounds counterintuitive, but MT is best paired with a professional translator as a means of optimizing the translation process.

This unlikely union broke the predictions of many that saw MT giving professional translators a run for their money and driving translation companies out of business. Professional translators work with CAT tools as it helps them be more churn out more words than ever before and helps them be more consistent. Why the need for speed? Domo’s latest report states that “2.5 quintillion bytes of data are created every single day”—that’s a lot of data and most of it is not in English which creates the rising demand for translation services.

Also, by having a translator work together with an MT system, the translator is doing the MT system a favor as well by constantly feeding back revisions for the MT to learn from and render better outputs and suggestions. All in all, it’s a highly productive and beneficial two-way street between a translator and an MT system.

Of course, this ‘relationship’ will be all for moot if the MT system wasn’t developed to a satisfactory standard. That being said, developers have to take into account both translation clients and translators themselves.

They have to ensure that not only will the MT system procure quality translations for clients but can also adapt to the needs of the translators using them. Being convenient to use and having a friendly UX design is one thing, but being able to incorporate the inputs of a translator and accurately replicating it in similar contexts is also another thing.


What Do Professional Translation Services Have Over MT?

Specifically, what can a translation company that hires professional translators to do better than artificial intelligence (AI)? Apart from translation quality and consistency, a professional translator has one advantage; they’re human. It may sound cliche but a human can understand nuances and no MT or AI are light years away from replicating.

Unable to Understand Emotional, Cultural, and Social Nuances

As of now, there is no MT yet that is capable of accurately understanding jokes, slang, creative expressions, and so on. The abilities of MT shine brightly with formulaic sentences and predictable language conventions. But if confronted with linguistic habits that are natural in everyday conversations, MT falls apart. This problem is made more pronounced at a global scale since every culture and society has its own way of speaking all the way down to highly distinct street lingo.

Unable to Process Linguistic Nuances

Parent languages are divided by their regional vernaculars and dialects. When someone’s trying to translate English to Spanish, it’s actually just generic Spanish with no local ‘flavoring’. But if you’re aiming for translations that resonate true to how Spanish people or how Mexican people speak, then a professional translator with native-speaking ability is who you need. No MT system now is able to comprehend, let alone translate linguistic nuances reliably.

Unable to Keep Up With Linguistic Trends

Languages change every day with new words being constantly added and removed to the lexicon of world languages. Humor, slang, and creative expressions are a testament to that notion. Even social media has given rise to new creative expressions in ways human society has never experienced before with meme culture as one of the most notable examples. Even if NMT was somehow capable of keeping up, it would still need time for the data to accumulate for it to start translating. By that time, new slang would have already popped out.

Unable to Render Specialized and Highly Contextual Translations

What we mean by specialized here is text with highly nuanced terminology such as the literary field and also texts belonging to critical fields such as the legal, scientific, medical sector. Authors inherently embed their works with highly nuanced expressions and linguistic ‘anomalies’, so much so that there is no identifiable pattern for any MT that can work with since each author has their own voice.

For the legal, and the medical sector, have their own language conventions that although seem formulaic on the surface, the inherently specialized terminologies and the risk factor involved in these fields means no margin of error can be given to MT. There are MT systems used in these sectors but are always paired with a professional legal translator and professional medical translator.


Developing Your Own MT System

Even with the quality issues and other imperfections associated with MT, the demand for machine translation services. According to a report published in Market Watch, “The Global Machine Translation Market was valued at USD 550.46 million in 2019 and is expected to reach USD 1042.46 million by 2025, at a CAGR of 11.23% over the forecast period 2020 - 2025.”.

However, many are looking to develop their own company MT instead of ‘borrowing’ one from an external provider and for good reason. If a translation company is rendering plenty of niche translations in a given year, then configuring their own MT system is the most cost-effective investment as there will be no need to pay for licensing fees to external MT providers.

Many industries have their language conventions and jargon, in regards to internal communication mostly. For example, legalese is perfectly comprehensible to lawyers but downright alien-sounding to those with little legal knowledge. That being said, even businesses and organizations have their own language conventions that veer off from the industry norm. In that case, they would then have to build their very own MT systems, especially if they’re focusing on specific target foreign markets and audiences. 

So out of the 3 listed earlier, which one should you choose? It’s most likely SMT due to its popularity and how much support it gets. There are who have gone for a Hybrid MT by combining SMT and RBMT but that’s probably too intimidating for first-timers. If you want to make the big leap right from the start, then, by all means, go NMT if it meets your company’s objectives. 

Mind you that investing and training any MT system does come at a price and will take time. It’ll take time for glossaries and translation memories to develop, provided that the data used to feed the system is of standard. For a translation company, that usually isn’t a problem as in tandem with open-source parallel text corpora are the translation company’s own document archives.


Can You Choose MT Over a Translation Company?

Back then, instant language translation belonged to the category of futuristic science fiction gadgets. In fact, it still is today albeit we’ve heightened our standards. What we dream of now is instant voice interpretation. Specifically, being able to conduct a seamless multilingual conversation with anyone without the awkward pauses. But let’s get back to reality now. It’s hard not to be impressed with the abilities of MT today since we can easily witness it from our smartphones.

Even so, there are plenty of flaws associated with MT as discussed earlier that’s actually hindering it from developing serious widespread adoption. Be that as it may, MT as it now nevertheless has its own perks. Although one shouldn’t rely too much on MT at certain thresholds, doesn’t mean that you shouldn’t use it at all at specific situations. Here are some reasons why.

Cost

There are plenty of translation apps out there such as Google Translate as you might know already. All of them are free with the exception of premium access subscription payments to unlock more features. There are plenty of free translation plugins as well for website developers. Keep in mind that we’re talking about generic translators here and not the specialized MT systems from external providers that have licensing fees.

Speed and Convenience

At specific situations, some are just looking to have translation at the very moment they want it. Whether you’re a language student or a traveling businessperson, MT is your answer. It’s free and they can get results the moment they click the translate button. Even if it’s not 100% accurate, it at least gives them an implied meaning behind the translation.

For Generic, Repetitive, and Well-Resourced Languages

*Consider this pointer at your own risk*. One can certainly find MT if they have non-contextual and predictable text at hand such as simple and formulaic phrases. What you decide to do with it is all on you whether you use it only as a reference or actually employ it in a professional setting. That being said, the most quality translations you can get are from well-resourced such as Spanish, German, French, etc. If you tried translating, even a simple phrase from English to Chinese, you’ll unlikely get a similarly accurate translation since English and Chinese have vastly different language rules and an unrelated linguistic history.


A Note on Translation Quality in the Context of the Coronavirus Pandemic

Despite the vast improvements to MT, quality is still a significant issue and as you’re aware, human translators are there to guarantee that. However, in no situation is quality ever more necessary than in global communication in crisis as made evident by the current coronavirus pandemic, specifically in the form of medical translation. Medical translation is a highly specialized niche in translation and critical one too wherein the slightest mistranslation would lead to potentially unfortunate and even fatal consequences.

Medical translation must be provided by specialized medical translators who have complete mastery over their language pair (Ex. English to Spanish, Spanish to English) and extensive familiarity with medical terminology, medical practices, and code of ethics. They must undergo additional lengthy training before they can be classified as certified medical translators. That being said, are MT systems out of the picture?

There are MT systems that translate medical documents and medical research, but it must be under constant supervision from a certified medical translator. Connecting it to today’s crisis, there hasn’t been a recent time in history where a speedy translation of medical research has been more important than ever. Medical scientists all over the world are working together to understand the COVID-19 virus for them to come up with viable treatments and eventually, a vaccine. With that in mind, medical translation is the only bridge that’s making this level of coordination between medical scientists around the world possible.


Final Takeaway

Will there be a future where MT would be so advanced and almost human-like that professional translators would be an endangered species? If you were to judge by the pace of development of MT in such a short period, it would not be that unreasonable to believe in a future like that. However, let’s not put too much thought into it as it doesn’t pay attention enough to what is demanded from translation in the first place.

It’s apparent now that MT is good at servicing the translation speed and optimization needs, but as for quality, much of it belongs to the hands, or should I say the mind of a professional translator. That union would likely last for the next few decades. But let’s not hold ourselves to that prediction. Perhaps a game-changing MT feature is just a few years away or if our prediction holds true decades. But still, that’s considering our standards on translations, particularly on quality and human-ness, haven’t changed.



Author Bio:

Laurence Ian Sumando is a freelance writer penning pieces on business, marketing, languages, and culture.

Thursday, March 2, 2017

Lilt Labs Response to my Critique of their MT Evaluation Study

I had a chat with Spence Green earlier this week to discuss the critique I wrote of their comparative MT evaluation, where I might have been a tad harsh, but anyway I think we were both able to see each other's viewpoints a little bit better, and I summarize the conversation below.  This is followed by an MT Evaluation Addendum that Lilt has added to the published study to provide further detail on the specific procedures they followed in their comparative evaluation tests. These details should be helpful to those who want to replicate or modify and replicate, the test for themselves.

While I do largely stand by what I said, I think it is fair to allow Lilt to respond to the criticism to the degree that they wish to.  I think some of my characterization may have been overly harsh (like the sheep-wolf image for example). My preference would have been that Lilt wrote this response directly rather than having me summarize the conversation, but I hope that I have captured the gist of our chat (which was mostly amicable) accurately and fairly.

The driving force behind the study were ongoing Lilt customer requests wanting to know how the various MT options compared. Spence ( Lilt) said that he attempted to model their evaluation along the lines of the NIST "unrestricted track" evaluations and he stated repeatedly that they tried to be as transparent and open as possible so that others could replicate the tests for themselves. I did point out that one big difference here is that unlike NIST, we have one company here comparing themselves to their competitors also happens to be managing the evaluation. Clearly a conflict of interest, but mild compared to what is going on in Washington DC now. Thus, however well-intentioned and transparent the effort may be, the chances of protest are always going to be high with such an initiative. 

Spence did express his frustration with how little understanding there is of (big) data in the localization industry which does make these kinds of assessments and any discussion on core data issues problematic.

Some of the specific clarifications he provided are listed below:
  • SwissAdmin was chosen as it was the "least bad" data that we could have used to enable us to conduct a test with some level of adaptation that everybody could replicate. Private datasets were not viable because the owners did not want to share the data with the larger community to enable test replication. We did argue over whether this data was really representative of localization content, but given the volumes of data needed and need to have it easily available to all, there was not a better data resource available. To repeat, SwissAdmin was the least bad data available.  Spence pointed out that:
  1. Observe that the LREC14 paper has zero citations according to Google Scholar
  2. Compare adaptation gains to three different genres in their EMNLP15 paper.
  • It is clear that Google NMT is a "really good" system and sets a new bar for all the MT vendors to measure against, but Spence felt that it was not accurate to say that Google is news-focused as it has a much broader data foundation, from the extensive web crawling data acquisition that supports the engine. He also challenged my conclusion that since GNMT was so good it was not worth the effort with other systems. It is clear that an adaptation/ customization effort with only 18,000 segments is unlikely to outperform Google and we both agreed that most production MT systems in use will have much more data to support adaptation. (I will also mention that the odds of do-it-yourself Moses systems being able to compete on quality now are even less likely and that Moses practitioners should assess if DIY is worth the time and resources at all if they have not already realized this. Useful MT systems will almost by definition need an expert foundation and expert steering.)
  • He also pointed out that there are well-documented machine learning algorithms that can assess if the MT systems have certain data in their training sets and that these were used to determine that the SwissAdmin data was suitable for the test.
  • While they were aware of the potential for bias in the evaluation study they made every effort to be as open as possible about the evaluation protocol and process.Others can replicate the test if they choose to.
  • Lilt provides an explanation of the difficulties associated with running the SDL Interactive system in a footnote in the Evaluation Addendum attached below.
  • We also agreed that 18,000 segments (used for adaptation/ customization here) may not be quite enough to properly customize an MT engine and that in most successful MT engines a much larger volume of data is used to produce clear superiority over GNMT and other continuously evolving public MT engines. This again points to the difficulty of doing such a test with "public" data, the odds of finding the right data in sufficient volume, that everyone can use, are generally not very good.
  • Microsoft NMT was not initially included in the test because it was not clear how access was gained and he pointed me to the developer documentation to show how the Google docs were much clearer and easier to determine how to access the NMT systems. This lack of documentation may have been addressed since the test was run.
  • One of my independent observations on who really "invented" Adaptive MT also seemed troublesome to Spence. I chose to focus on Microsoft and SDL patents as proof that others were thinking about this and had very developed ideas on how to implement this long before Lilt came into existence. However, he pointed quite correctly and much more accurately that there were others who were discussing this approach from as early as the 1950's and that Martin Kay and Alan Melby, in particular, were discussing this in the 1970's. He pointed out a paper that details this and provides historical context on the foundational thinking behind Adaptive and Interactive MT. This to me does suggest that any patent in this area is largely built on the shoulders of these pioneers. Another paper by Martin Kay from 1980 provides the basic Adaptive MT concept on page 18. Also, he made me aware of Transtype: (the first statistical interactive/ adaptive system, a project that began in 1997 in Canada). For those interested you can get details on the project here:

Finally, all other vendors are welcome to reproduce, submit, and post results. We even welcome an independent third-party taking over this evaluation. 
 Spence Green


It may be that some type of comparative evaluation will become more important for the business translation industry as users weigh different MT technology options, and possibly could provide some insight on relative strengths and weaknesses. However, the NIST evaluation model is very difficult to implement in the business translation (localization) use case, and I am not sure if it even makes sense here. There may be an opportunity for a more independent body that has some MT expertise to provide a view on comparative options, but we should understand that MT systems can also be tweaked and adjusted to meet specific production goals and that the entire MT system development process is dynamic and evolving in best practice situations. Source data can and should be modified and analyzed to get better results, systems should be boosted in weak areas after initial tests and continuously improved with active post-editor involvement to build long-term production advantage, rather than just doing this type of instant snapshot comparison. What might matter much more in a localization setting is how quickly and easily a basic MT system can be updated and enhanced to be useful in business translation use-case production scenarios. This kind of a quick snapshot view has a very low value in that kind of a user scenario where it is understood that any MT system needs more work than just throwing some TM at it BEFORE putting it into production.



-------------------------



Experimental Design
We evaluate all machine translation systems for English-French and English-German. We report case-insensitive BLEU-4 [2], which is computed by the mteval scoring script from the Stanford University open source toolkit Phrasal (https://github.com/stanfordnlp/phrasal). NIST tokenization was applied to both the system outputs and the reference translations.

We simulate the scenario where the translator translates the evaluation data sequentially from the beginning to the end. We assume that she makes full use of the resources the corresponding solutions have to offer by leveraging the translation memory as adaptation data and by incremental adaptation, where the translation system learns from every confirmed segment.

System outputs and scripts to automatically download and split the test data are available at: https://github.com/lilt/labs.

System Training
Production API keys and systems are used in all experiments. Since commercial systems are improved from time to time, we record the date on which the system outputs were generated.

Lilt
The Lilt baseline system available through the REST API with a production API key. The system can be reproduced with the following series of API calls:
  • POST /mem/create   (create new empty Memory)
  • For each source segment in the test set:
    • GET /tr  (translate test segment)
Date: 2016-12-28

Lilt adapted
The Lilt adaptive system available through the REST API with a production API key. The system simulates a scenario in which an extant corpus of source/target data is added for training prior to translating the test set. The system can be reproduced with the following series of API calls:
  • POST /mem/create   (create new empty Memory)
  • For each source/target pair in the TM data:
    • POST /mem  (update Memory with source/target pair)
  • For each source segment in the test set:
    • GET /tr  (translate test segment)
Date: 2017-01-06

Lilt Interactive
The Lilt interactive, adaptive system available through the REST API with a production API key. The system simulates a scenario in which an extant corpus of source/target data is added for training prior to translating the test set. To simulate feedback from a human translator, each reference translation for each source sentence in the test set is added to the Memory after decoding. The system can be reproduced with the following series of API calls:
  • POST /mem/create   (create new empty Memory)
  • For each source/target pair in the TM data:
    • POST /mem  (update Memory with source/target pair)
  • For each source segment in the test set:
    • GET /tr  (translate test segment)
    • POST /mem (update Memory with source/target pair)
Date: 2017-01-04

Google
Google’s statistical phrase-based machine translation system. The system can be reproduced by querying the Translate API:
  • For each source segment in the test set:
    • GET https://translation.googleapis.com/language/translate/v2?model=base
Date: 2016-12-28

Google neural
Google’s neural machine translation system (GNMT). The system can be reproduced by querying the Premium API:
  • For each source segment in the test set:
    • GET https://translation.googleapis.com/language/translate/v2?model=nmt
Date: 2016-12-28

Microsoft
Microsoft’s baseline statistical machine translation system. The system can be reproduced by querying the Text Translation API:
  • For each source segment in the test set:
    • GET /Translate
Date: 2016-12-28

Microsoft adapted
Microsoft’s statistical machine translation system. The system simulates a scenario in which an extant corpus of source/target data is added for training prior to translating the test set.  We first create a new general category project on Microsoft Translator Hub, then a new system within that project and upload the translation memory as training data. We do not provide any tuning or test data so that they are selected automatically. We let the training process complete and then deploy the system (e.g., with category id CATEGORY_ID). We then decode the test set by querying the Text Translation API, passing the specifier of the deployed system as category id:
  • For each source segment in the test set:
    • GET /Translate?category=CATEGORY_ID
Date: 2016-12-30 (after the migration of Microsoft Translator to the Azure portal)

Microsoft neural
  • For each source segment in the test set:
    • GET /Translate?category=generalnn
Date: 2017-02-20

Systran neural
Systran’s “Pure Neural” neural machine translation system. The system can be reproduced through the demo website. We manually copy-and-pasted the source into the website in batches of no more than 2000 characters. We verified that line breaks were respected and that batching had no impact on the translation result. This comprised considerable manual effort and was performed over the course of several days.
Date(s): en-de: 2016-12-29 - 2016-12-30; en-fr: 2016-12-30 - 2017-01-02

SDL
SDL’s Language Cloud machine translation system. The system can be reproduced through a pre-translation batch task in Trados Studio 2017.
Date: 2017-01-03

SDL adapted
SDL’s “AdaptiveMT” machine translation system, which is accessed through Trados Studio 2017. The system can be reproduced by first creating a new AdaptiveMT engine specific to a new project and pre-translate the test set. The new project is initialized with the TM data. We assume that the local TM data is propagated to the AdaptiveMT engine for online retraining. The pre-translation batch task is used to generate translations for all non-exact matches. Adaptation is performed on the TM content. In the adaptation-based experiments, we did not confirm each segment with a reference translation due to the amount of manual work that would have been needed in Trados Studio 2017. (1)
(1) We were unable to produce an SDL interactive system comparable to Lilt interactive. We first tried confirming reference translations in Trados Studio. However, we found that that model updates often requires a minute or more of processing. Suppose that pasting the reference into the UI requires 15 seconds, and the model update requires 60 seconds. For en-de, 1299 * 75 / 3600 = 27.1 hours would have been required to translate the test set. We then attempted to write interface macros to automate the translation and confirmation of segments in the UI, but the variability of the model updates, and other UI factors such as scrolling prevented successful automation of the process. The absence of a translation API prevented crowd completion of the task with Amazon Mechanical Turk.

The Lilt adapted, Microsoft adapted and SDL adapted systems are most comparable as they were adapted in batch mode,  namely by uploading all TM data, allowing training to complete, and then decoding the test set. Of course, other essential yet non user-modifiable factors such as the baseline corpora, optimization procedures, and optimization criteria can and probably do differ.

Test Corpora
We defined four requirements for the test corpus:
  1. It is representative of typical paid translation work
  2. It is not used in the training data for any of the competing translation systems
  3. The reference translations were not produced by post-editing from one of the competing machine translation solutions
  4. It is large enough to permit model adaptation

Since all systems in the evaluation are commercial production systems, we could neither enforce a common data condition nor ensure the exclusion of test data from the baseline corpora as in requirement (2). Nevertheless, in practice it is relatively easy to detect the inclusion of test data in a system’s training corpus via the following procedure:
  1. Select a candidate test dataset
  2. Decode test set with all unadapted systems and score with BLEU
  3. Identify systems that deviate significantly from the mean (in our case, by two standard deviations)
  4. If a system exists in (3):
    1. Sample a subset of sentences and compare the MT output to the references.
    2. If reference translations are present,
      1. Eliminate the candidate test dataset and go to (1)
  5. Accept the candidate test dataset

Starting in November 2016, we applied this procedure to the eight public datasets described in Appendix B. The ninth corpus that we evaluated was SwissAdmin, which both satisfied our requirements and passed our data selection procedure.  

SwissAdmin [http://www.latl.unige.ch/swissadmin/] is a multilingual collection of press releases from the Swiss government from 1997-2013. We used the most recent press releases. We split the data chronologically, reserving the last 1300 segments of the 2013 articles as English-German test data, and the last 1320 segments as English-French test set. Chronological splits are standard in MT research to account for changes in language use over time. The test sets were additionally filtered to remove a single segment that contained more than 200 tokens. The remainder of articles from 2011 to 2013 were reserved as in-domain data for system adaptation.

SwissAdminen-de
en-fr

TM
test
TM
test
#segments
18,621
1,299
18,163
1,319
#words
548,435 / 482,692
39,196 / 34,797
543,815 / 600,585
40,139 / 44,874


Results

(Updated with Microsoft neural MT)

SwissAdmin
English->German
English->French
Lilt
23.2
30.4
Lilt adapted
27.7
33.0
Lilt interactive
28.2
33.1
Google
23.7
31.9
Google neural
28.6
33.2
Microsoft
24.8
29.0
Microsoft adapted
27.6
29.8
Microsoft neural
23.8
30.7
Systran neural
24.2
31.0
SDL
22.6
30.4
SDL adapted
23.8
30.4

-->-->-->
Appendix B: Candidate Datasets
The following datasets were evaluated and rejected according to the procedure specified in the Test Corpora section:
  • JRC-Acquis
  • PANACEA English-French
  • IULA Spanish-English Technical Corpus
  • MuchMore Springer Bilingual Corpus
  • WMT Biomedical task
  • Autodesk Post-editing Corpus
  • PatTR
  • Travel domain data (from booking.com and elsewhere) crawled by Lilt