Wednesday, October 5, 2016

Feedback on the Google Neural MT Deception Post

There was an interesting discussion thread in Reddit about the Google deception post with somebody with the alias oneasasum that I thought was worth highlighting here, since it was the most coherent criticism of my original post.

Google makes MASSIVE progress on Machine Translation -- "We show that our GNMT system approaches the accuracy achieved by average bilingual human translators on some of our test sets."

This is a slightly cleaned up version of just our banter from the whole thread that you can see at the link above which also has other fun comments:

KV: Seriously exaggerated -- take a look at this for more accurate overview The Google Neural Machine Translation Marketing Deception

HE: You should have also posted this article, as you did on another Reddit forum:

That's a much better take, in my opinion.
I saw the blog posting myself the other day. This isn't marketing deception, and most of what this guy covers in his piece, I also covered in mine -- with the exception of pointing out the "60%" and "87%" claims as not being meaningful. (My title may have given you a different impression, however.)

People in NLP are not impressed by the advances in theory or algorithm, as the results amount to a repackaging of methods developed over the past two years by the wider academic community; but are impressed by the scale of the effort, and by the results. See, for example, what Yoav Goldberg said on Twitter -- he said he's impressed by the results:

The GNMT results are cool. the BLEU not so much, only the human evals. But this is very hard to compare to other systems.
Another example is Kyunghyan Cho, known for his work on neural machine translation:

“I am extremely impressed by their effort and success in making the inference of neural machine translation fast enough for their production system by quantized inference and their TPU,” Cho says.
The second thing I would say is that the research article is written by researchers, not Google marketing people. The Google marketing people have no sway over how researchers pitch their results in research articles. 

My read of what these researchers have written (and also what a Google software engineer or two wrote on Twitter, before deleting their comments), is that they are very excited by their work, and feel they have made genuine progress. What you are seeing is not "hype", but "excitement". But there is always a price to pay for showing emotion -- somebody will always try to bring you back down to earth.

The third thing I would say is that this is the first example of a large deployment of neural machine translation, according to Cho again:

That, in and of itself, is praiseworthy.

But he confirmed that Google seems to be the first to publicly announce its use of neural machine translation in a translation product.
The fourth thing I would say is to take with a grain of salt comments by people from either a competing product or school of thought. Perhaps this doesn't apply here; but it's still good to keep it in mind. An example of this might be something like the following: say you have one group working on classical knowledge representation using small data. And then say a machine learning method with large amounts of data makes progress on a problem they care about. What are they going to say? Are they going to say, "That's really great that we are now making progress on this old, stubborn problem!"? No, more likely they'll say, "That's just empty hype. They're nowhere near to solving that problem, and if they really want to make progress they'll drop what they're doing and use some classical knowledge representation."

KV: While the sheer scale of the initiative both in terms of training data volume and ability to provide translations to millions of users at production scale is impressive, the actual translation quality results are really not that impressive and certainly do not warrant a claim such as “Nearly Indistinguishable From Human Translation” and “GNMT reduces translation errors by more than 55%-85% on several major language pairs “.

The translation improvement claims based on the human evaluation is where the problem lies. The validity of the human evaluation is the biggest question mark about the whole report. This is well known to people in the MT research community so to make the claims they did is disingenuous and even deceptive.

I agree they are doing it on a massive scale but actually, it is surprising that they seem to have gotten so little benefit in translation quality improvement as Rico Sennrich at the University of Edinburgh says in this post:

HE: Well, I suppose they will work harder next time to find a better way to measure the quality of their system. Again, I don't think they were trying to deceive.
One thing I would say, however, is that BLEU scores have problems, too. One problem is that even human translators sometimes have low BLEU scores (I had a better reference for this, but lost it, so will give this one):

Recent experiments computed so-called human BLEU scores, where a human reference translation scored against other human reference translations. Such human BLEU scores are barely higher (if at all) than BLEU scores computed for machine translation output, even though the human translations are better.

 KV: Absolutely, BLEU scores are deeply flawed but they are UNDERSTOOD and so they continue to be used as all the other metrics are even worse. I have written about this on my blog.

SO here is an example of the "nearly indistinguishable from human translation" GNMT of a Chinese web page that I just did with the new NMT engine, that just happens to talk about work that Baidu, Alibaba and Microsoft are doing. It is definitely better than looking at a page of Chinese characters (for me anyway) but clearly a very long way from human translation.

HE: Yes, I saw those examples. This guy had posted a link to Twitter, before he deleted it:

He is a software engineer at Google, and was very excited by the results. But, yes, those particular examples weren't great. Not clear whether they were a random sample, or a sample showing the range of quality. 

Also another fun thing that he noticed from the press coverage:

Here's a Technology Review article about it: Google’s New Service Translates Languages Almost as Well as Humans Can   

This quote is priceless:

“It can be unsettling, but we've tested it in a lot of places and it just works,” he [Googler and co-author on the paper Quoc Le] says.


Take a look at that Chinese Newspaper sample above, which I ran today. Seriously what are these guys smoking and are they really so deluded? Yes, clearly they have done something that few can do in terms of using thousands of computers and solving a tough computing challenge. But of very little benefit for the guy who does not speak Chinese as the English they produce is STILL pretty hard to follow.  This is the source page. And this is the translation I got today from the super duper  human-like GNMT!  

Original Chinese Text:

 Google GNMT Translation:

Language service support "along the way" and the line and far


  1. Sadly, this Google announcement has overshadowed some announcements with much-needed academic discipline by the Moses team in Edinburgh. Marcin Junczys Dowmunt, a researcher at Edinburgh, published this article on Linkedin moments ago:

    It refers to a previous article that seems to have gone largely overlooked.

    An honest look at the numbers from Edinburgh and Google show that NMT is an incremental improvement over phrase-based SMT. Any improvement is good news. Often improvement does not occur in a straight line. There are ups and downs.

    Are the current increments worth the cost to update or replace existing SMT systems? Only the user can decide, but but they need clarity. Thank you, Kirti, for declaring the emperor wears no clothes.

    1. Thanks for bringing these references to my attention

    2. I'd missed them, too. They both point to an academic paper by Marcin, Tomasz Dwojak, and Hieu Hoang.

      They used a AmuNMT-based decoder to replace Moses. Their tests are sound and significant using the UN corpus (~10+ million segments).

      These reports, if I recall correctly, show improvements roughly at parity with Moses' current Neural Network LM and the Bilingual Neural LM features, dating back to WMT14(?). I suggest we refer to these 2 as SMT/Neural hybrids. These hybrids demand significantly fewer resources and we've had running in our lab for 2 years on Linux desktop systems. Updating the open source foundations for cross-platform use will require significant investment. Maybe we'll get around to it in 2017.

  2. Translation technology has come a long way. Recently it receives an ever greater at attention in the use of translations as its development continues rapidly aiming at greater unification and perfection. Machine Translation itself is at times so well advanced, thus a proof-reading of the text is its only requirement.
    Yet this applies only for the text without jargon.