Tuesday, January 17, 2017

The Trouble With Competitive MT Output Quality Evaluations

The comparative measurement and quality assessment of the output of different MT systems is a task that has always been something that is difficult to do right. Right, in this context means, fair, reasonable, and accurate. The difficulty is closely related to the problems of measuring translation quality in general, that we discussed in this post. This difficulty is further aggravated when evaluating customized and/or adapted systems, since doing this requires special skills and real knowledge of each MT platform, in addition to time and money. The costs associated with doing this properly make it somewhat prohibitive.

BLEU is the established measurement of choice, we all use, but it is easy to deceive yourself, deceive others, and paint a picture that has the patina of scientific rigor, yet be completely biased and misleadingly false. BLEU, as we know is deeply flawed, but we don't have anything better, especially for longitudinal studies, even though if you use it carefully, it can be useful in providing some limited insight in a comparative evaluation.

In the days of the NIST competitive evaluations, the focus was on Chinese and Arabic to English (News Domain) and there were some clear and well-understood rules on how this should be done to enable fair competitive comparisons. Google was often a winner (i.e. highest BLEU score), but they sometimes "won" by using systems that took an hour to translate a single sentence, because they evaluated 1000X as many translation candidate options as their competitors, to produce their best one. Kind of bullshit, right? More recently, we have the (WMT16) that attempts to go beyond the news domain, does more human evaluations, evaluates PEMT scenarios, and again controls the training data used by participants, to attempt to fairly assess the competitors. Both of these structured evaluation initiatives provide useful information if you understand the data, the evaluation process, the potential bias, but both are also flawed in many ways, especially in the quality and consistency of human evaluations.

One big problem for any MT vendor in doing output quality comparisons, with Google, is that for a test to be meaningful, it has to be with something that the Google MT system does not already have in its knowledge database (training set).  Google crawls news sites extensively (and the internet in general) for bilingual text (TM) data, so the odds of finding data they have not seen are very low. If you give a college student all the questions and the answers that are on the test, before they take the test, the probability is high that they will do well on that test. This is why Google generally scores better on news domain tests against most other MT vendors, as they likely have 10X to 1,000X the news data that anybody except for Microsoft and Baidu has. I have also seen MT vendors and ignorant LSPs show off unnaturally high BLEU scores, by having an overlap between the training and test set data. The excitement dies quickly, once you get to actual data you want to translate, that the system has not seen before.

Thus, when an MT technology vendor tells us that they want to create a lab to address the lack of independent and objective information on quality and performance, and create a place where “research and language professionals meet,”one should be at least a little bit skeptical, because there is a conflict of interest here, as Don DePalma pointed out. But, after seeing the first "fair and balanced" evaluation from the labs, I think it might not be over-reaching to say that this effort is neither fair nor balanced, except in the way that Fox News is. At the very least, we have gross self-interest pretending to be in the public interest, just like we now have with Trump in Washington D.C. But sadly, in this case, they actually point out that, even with customization/adaptation, Google NMT outperforms all the competitive MT alternatives, including their own. This is like shooting yourself in the hand and foot at the same time with a single bullet!

A List of Specific Criticisms

Those who read my blog regularly know that I regard the Lilt technology favorably, and see it as a meaningful MT technology advance, especially for the business translation industry. The people at Lilt seem to be nice, smart, competent people, and thus this "study" is surprising. Is this deliberately disingenuous, or did they just get really bad marketing advice to do what they did here?

Here is a listing of specific problems that would be clear to any observer who did a careful review of this study and its protocol.

Seriously, This is Not Blind Data.

The probability of this specific data being truly blind data is very low. The problem with ANY publicly available data is, that it has a very high likelihood of having been used as training data by Google, Microsoft, and others. This is especially true for data that has been around as long the SwissAdmin corpus has been. Many of the tests typically used to determine if the data has been used previously are unreliable, as the data may have been used partially, or only in the language model. As Lilt says:  "Anything that can be scraped from the web will eventually find its way into some of the (public) systems" and any of the things I listed above happening, will compromise the study. If this data or something very similar is being used by the big public systems, it will skew the results and result in erroneous conclusions. How can Lilt assert with any confidence that this data was not used by others, especially Google? If Lilt was able to find this data, why would Google or Microsoft not be able to as well, especially since the SwissAdmin corpus is described in detail in this LREC 2014 paper

Quality Evaluation Scoring Inconsistencies: Apples vs. Oranges

  • The SDL results and test procedure seems to be particularly unfair and biased. They state that, "Due to the amount of manual labor required, it was infeasible for us to evaluate an “SDL Interactive” in which the system adapts incrementally to corrected translations." However, this unfeasibility does not seem to prevent them from giving SDL a low BLEU score.  The "adaptation" that was conducted was done in a way that SDL does not recommend for best results, thus publishing such sub-optimal results is rude and unsportsmanlike conduct. Would it not be more reasonable to say it was not possible, and leave it blank?
  • Microsoft and Google released their NMT systems on the same day, November 15, 2016. (Click on the links to see). But Lilt chose to only use the Google NMT in their evaluation.
  • SYSTRAN has been updating their PNMT engines on a very regular basis and it is quite possible that the engine tested was not the most current or best-performing one. At this point in time, they are still focusing on improving throughput performance, and this means that lower quality engines may be used for random, free, public access for fast throughput reasons. 
  • Neither SYSTRAN nor SDL seems to have benefited from the adaptation, which is very suspicious, and should they not be given an opportunity to show this adaptation improvement as well?
  • Finally, one wonders how the “Lilt Interactive” score is processed. How many sentences have been reviewed to provide feedback to the engine? I am sure Lilt took great care to put their own best systems forward, but they also seemed to have been less careful and even seem to have executed sub-optimal procedures with all the others, especially SDL. So how can we trust the scores they come up with?

Customization Irregularities

This is still basically news or very similar-to-news domain content. After making a big deal about using content that "is representative of typical paid translation work" they basically choose data that is heavily news-like. Press releases are very news-like and my review of some of the data suggests it also looks a lot like EU data, which is also in the training sets of public systems. News content is the default domain that public systems like Google and Microsoft are optimized for, and it is also a primary focus of the WMT systems. And for those who scour the web for training data, this domain has by far the greatest amount of relevant publicly available data. However, in the business translation world, which was supposedly the focus here, most domains that are relevant for customization are exactly UNLIKE News domain. The precise reason they need to develop customized MT solutions is because their language and vocabulary are different, from what public systems tend to do well (namely news). The business translation world tends to focus on areas where there is very little public data to harvest, either due to domain-specificity – medical, automotive, engineering, legal, eCommerce etc. or due to company-specific terminology. So, basically testing on news-like content does not say anything meaningful about the value of customization in a non-news domain. What it does say is that public generic systems do very well on news, which we already knew from years of WMT evaluations which were done with much with more experimental rigor and more equitable evaluation conditions.

Secondly,  the directionality of the content matters a lot. In “real life”, a global enterprise generates content in a source language where it is usually created from scratch by native speakers of that language and needs it translated into one or more target languages. Therefore, this is the kind of source data that we should test if we are trying to recreate the localization market scenario.  Unfortunately, this study does NOT do that (and to be fair this problem infects WMT and pretty much the whole academic field – I don’t mean to pick on Lilt!). The test data here started out as native Swiss German, and then was translated into English and French. In the actual test conducted, it was evaluated in the English⇒French and EnglishGerman direction. Which means that the source input text was obtained from (human) translations, NOT native text. This matters. Microsoft and others have done many evaluations to show this. Even good human translations are quite different from true native content. In the case of English⇒French, both the source and the reference is translated content. 

There is also the issue of questionable procedural methodology when working with competitive products. From everything I gathered in my recent conversations with SDL, it is clear, that adaptation by importing some TM into Trados is a sub-optimal way to customize an MT engine in their current product architecture. It is even worse when you try and jam a chunk of TM into their adaptive MT system, as Lilt also admitted. One should expect very different, and sub-optimal outcomes from this kind of an effort since the technology is designed to be used in an interactive mode for best results. I am also aware that most customization efforts with phrase-based SMT involves a refinement process, sometimes called hill-climbing. 

Just throwing some data in, and taking a BLEU snapshot, and then concluding that this is a representative outcome for that platform is just wrong and misleading. Most serious customization efforts require days of effort at least, if not weeks to complete, prior to a production release.

Another problem when using human translated content as source or reference is that in today’s world, many human translators start with a Google MT backbone and post-edit. Sometimes the post-edit is very light. This holds true whether you crowd-source, use a low-cost provider such as unBabel (which explicitly specifies that they use Google as a backbone), or a full-service provider (which may not admit this, but that is what their contract translators are doing with or without their permission). The only way to get a 100% from-scratch translation is to physically lock the translator in an internet-free room! We already know for the multi-reference data sets, that there are many equally valid ways to translate a text. When the “human” reference is edited based on Google, the scores naturally favor Google output. 

Finally,  the fact that the source data starts as Swiss German, rather than regular German may also be a minor problem. The differences between these German variants appear to be most pronounced when it is spoken rather than written, but Schriftsprache (written Swiss German) does seem to have some differences with standard high German. Wikipedia does state that: "Swiss German is intelligible to speakers of other Alemannic dialects, but poses greater difficulty in total comprehension to speakers of Standard German. Swiss German speakers on TV or in films are thus usually dubbed or subtitled if shown in Germany."

 Possible Conclusions from the Study

All this suggests that it is rather difficult for any MT vendor to conduct a competitive evaluation in a manner that would be considered satisfactory and fair to, and by, other MT vendor competitors. However, the study does provide some useful information:

  • Do NOT use News domain or news-like domain if you want to understand what the quality implications are for "typical translation work".  
  • Google has very good generic systems, which are also likely to be much better with News domain than with other specialized corporate content.
  • Comparative quality studies sponsored by an individual MT vendor are very likely to have a definite bias, especially on comparing customized systems.
  • According to this study, if these results were indeed ACTUALLY true, there would little point to using anything other than Google NMT.  However, it would be wrong to conclude that using Google would be better than properly using any of the customized options available since except for Lilt, we can presume they have not been optimally tuned. Lilt responded to my post comment on this point saying, "On slightly more repetitive and terminology-heavy domains we can usually observe larger improvements of more than 10% BLEU absolute by adaptation. In those cases, we expect that all adapted systems would outperform Google’s NMT."
  • Go to an independent agent (like me or TAUS) who has no vested interest other than to get accurate and meaningful results, which also means that everybody understands and trusts the study BEFORE they engage. A referee is necessary to ensure fair play, in any competitive sport as we all know from childhood.
  • It appears to me (only my interpretation and not a statement of fact) that Lilt's treatment of SDL was particularly unfair. In the stories of warring tribes in human literature, this usually is a sign that suggests one is particularly fearful of an adversary.  This intrigued me, so I did some exploration and found this patent which was filed and published years BEFORE Lilt came into existence.  The patent summary states: "The present technology relates generally to machine translation methodologies and systems, and more specifically, but not by way of limitation, to personalized machine translation via online adaptation, where translator feedback regarding machine translations may be intelligently evaluated and incorporated back into the translation methodology utilized by a machine translation system to improve and/or personalize the translations produced by the machine translation system."  This clearly shows that SDL was thinking about Adaptive MT long before Lilt. And, Microsoft was thinking about dynamic MT adaptation as far back as 2003. So who really came up with the basic idea of Adaptive MT technology? Not so easy to answer, is it?
  • Lilt has terrible sales and marketing advisors if they were not able to understand the negative ramifications of this "study", and did not try to adjust it or advise against publicizing it in its current form. For some of the people I talked to in my investigation, it even raises some credibility issues for the principals at Lilt.

 I am happy to offer Lilt an unedited guest post on eMpTy Pages if they care to, or wish to, respond to this critique in some detail rather than just through comments. In my eyes, they attempted to do something quite difficult and failed, which should not be condemned per se, but it should be acknowledged that the rankings they produced are not valid for "typical translation work".  We should also acknowledge that the basic idea behind the study is useful to many, even if this particular study is questionable in many ways. I could also be wrong on some of my specific criticisms, and am willing to be educated, to ensure that my criticism in this post is also fair. There is only value to this kind of discourse if it furthers the overall science and understanding of this technology, and my intent here is to question experiment fundamentals, and get to useful results, not bash on Lilt. It is good to see this kind of discussion beginning again, as it suggests that the MT marketplace is indeed evolving and maturing.


P.S. I have added the Iconic comments as a short separate post here to provide the perspective of MT vendors who perform deep, careful, system customization for their clients and who were not included directly in the evaluation.


  1. Perhaps you have offered one explanation of the gulf between the MT perspectives of MT advocates on the one hand and translators on the other.
    As a translator, I regularly get "blind data", i.e. sentences and sometimes even terminology that cannot previously be found on the Internet at all. In my subject domains (law/architecture etc.) this is compounded by the syntactical gap between my source language (German) and my target language (English). So when I run any sentences through an MT resource (such as GT), the result is often barely comprehensible (let alone near-human-quality).
    So thank you for putting your finger on the secret detail which often hoodwinks MT researchers (i.e. that they test systems using allegedly unknown text, but that the computer has often seen it all before anyway).

  2. Nice overview, Kirti. It is very hard to compare MT output for a full text. Even if human translators do the evaluation, sentence per sentence, and in a blind test-setup, there is not always a clear winner. Overall Google performs well, but so do others in some cases. I think it is a waste of time, all these competitive comparisons. For LSPs, what matters, is how the performance of the MT systems is on their own texts. And how the translators relate to it.
    What I do like about LILT, is how it uses the knowledge of the translator to improve the translation while the translator is translating. There's always the translation quality itself (one engine being a bit better than another one, for one sentence but maybe not for the next), but what often matters to the translator, is the help (s)he's getting when using the suggestions. I think LILT has done a very good job caring about a natural user experience for the translator. MT and TM blend smoothly. I must say also SDL has done a very nice job in Studio. The experience is different, but for translators who are used to translate with Trados Studio, it feels natural as well. All I really wanted to add to your overview: there is more to it than just the MT quality.
    When you compare cars, you don't only look how hard they can go, but also how safe they are, how comfortable, how clean...
    It is a pity we focus this much on something that is so very source text dependent! And maybe not that relevant for a translator who must work on it. Only for who wants to publish un-edited MT output this kind of tests maybe relevant. LILT and SDL Trados Studio are both made for editing.

    1. Very good points Gert -- the overall work experience is a much better foundation for choosing (an MT vendor) as this workflow can also be tailored by some of the MT vendors who specialize in deep customization but may do poorly in a superficial MT quality test based on instant customization and an instant BLEU score. But I think that this rapid ability to tune to editing feedback is going to be a big deal in the localization market.

  3. I'm posting the same comment here and in the dedicated space of the first post on this subject leaving to you to pick one.
    In his recent post on this very same topic, Don DePalma complains about the lack of a vendor-neutral MT Labs such as PC Labs that conducts systematic and unbiased testing.
    First of all, testing hardware at PC Labs is made easier by standards, i.e. testing parameters are agreed and shared upon.
    Secondly, hardware is always tested of the same range and type, i.e. technology, users, etc.
    Last, but not least, testing results are always presented in such a way that readers can understand and compare.
    Incidentally, as being an integral part of a tech magazine, PC Labs cannot be said to be above any suspicion, if only because the main source of income of the magazine itself and its publisher is in advertisers, who are the manufacturers of the tested hardware.
    That said, Lilt's effort is, in fact, a try. Does anybody remember the time when CSA's Arle Lommel was LISA's, then GALA's, and DFKI's and... Are standards, by accident, affected by bigwigs in the industry, whether in the academic, the production or the service side?
    Does anybody remember the time when SDL complained for Trados being irrespectful of agreed standards?
    Does anybody remember the time when Jaap van der Meer offered TAUS to be a watchdog for industry standards?
    Where are now those who could have made TAUS become the PC Labs of the translation industry? Maybe networking, shaking hands, smiling, and presenting at industry events. What about collaboration?
    Let me say that the duplicate efforts on DQF and MQM are yet another missed opportunity. Indeed, lost opportunities in the technology arena of the translation industry are countless, and this is yet another reason why the industry is not "appreciated."
    I look forward to finding all MT people, from Google to Microsoft, from Lilt to SDL, from KantanMT to Iconic, from OminiScien to Systran to Tauyou, etc., united in a common effort towards useful and reliable stardards for measuring and assessing all MT-related issues, objectively and productively.