Friday, October 7, 2016

Real and Honest Quality Evaluation Data on Neural Machine Translation

 I just saw a Facebook discussion on the Google NMT announcements that explores some of the human evaluation issues. And thought I would add one more observation to this charade before I highlight a completely overlooked study that does provide some valuable insight into the possibilities of NMT (which I actually believe are real and substantial) even though it is done in a  "small-scale" University setting.

Does anybody else think that it is strange, that none of the press and the journalists that are gushing about the "indistinguishable from human translation" quality claimed by Google, did not attempt to run even a single Chinese page through the new super duper GNMT Chinese engine? 

Like this post for example where the author seems to have swallowed the Google story, hook, line, and sinker. There are of course 50 more like this. It took me translating just one page to realize that we are really knee deep in bullshit, as I had difficulty getting even a gist understanding with my random sample Chinese web page.

So, is there any honest, unmanipulated data out there, on what NMT can do?  

I have not seen all the details of the SYSTRAN effort but based on the sample output that I asked for, and the general restraint (in spite of their clear enthusiasm) they showed during my conversations, I tend to believe that they have made real progress and can offer better technology to their customers. But this study was just pointed out to me, and I took a look even though the research team has a disclaimer about this being on-going work where it is possible that some results and conclusions might change. I thought it deserved to be more visible and thus I wrote this.

So here we have a reasonable and believable answer to the question:

Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions

This was a study conducted at the University of Edinburgh done entirely with UN corpus. They had about 10+ million sentences each per language in common across the six United Nations core languages which include Chinese. Now, this may sound like a lot of data to some, but the Google and Microsoft scale is probably larger by a factor of 10 or even 20. We should also understand that the computing resources available to this little research team are probably 1% to 5% of what Google can easily get access to. (More GPU sometimes can mean you get an extra BLEU point or two). So here we have the David and Goliath scenario in terms of resources, but interestingly I think the inferences they draw are very similar to what SYSTRAN and Microsoft have also reported. The team reports:

"With the exception of fr-es and ru-en the neural system is always comparable or better than the phrase-based system. The differences where NMT is worse are very small. Especially in cases where Chinese is one of the languages in a language pair, the improvement of NMT over PB-SMT is dramatic with between 7 and 9 BLEU points....
We also see large improvements for translations out of and into Arabic. It is interesting to observe that improvements are present also in the case of the highest scoring translation directions, en-es and es-en."
The research team also admits that it is not clear what the implications might be for in-domain systems and I look forward to their hopefully less deceptive human evaluation:
"Although NMT systems are known to generalize better than phrase-based systems for out-of- domain data, it was unclear how they perform in a purely in-domain setting which is of interest for any larger organization with significant resources of their own data, such as the UN or other governmental bodies. This work currently lacks human evaluation which we would like to supply in future versions."
The comparative PBSMT vs NMT  results is presented graphically below.  The blue bar is SMT and the burgundy bar is NMT, I have highlighted the most significant improvements with arrows below.

It is also interesting to note that when more processing power is applied to the problem, they do get some small improvement but it is clearly a case of diminishing returns. 

"Training the NMT system for another eight days always improves the performance of the NMT system, but gains are rather small between 0.4 and 0.7 BLEU. We did not observe any improvements beyond 2M iterations. It seems that stopping training after 8-10 days might be a viable heuristic with little loss in terms of BLEU. "
They additionally share some research on the NMT decoding throughput problem and resolution which some may find useful. Again, to be clear, (for the benefit of Mr. Mike) the scale described here is minuscule compared to the massive resources that Google, Microsoft, and Facebook probably use for deployment. But they show that NMT can be deployed without using GPUs or the Google TPUs for a fraction of the cost.  If this research team sends me a  translation of my test Chinese page I used on the GNMT,  I will share it with you so you can compare to GNMT.

We can all admit that Google is doing this on a bigger scale, but from my vantage point, it seems that they are really not getting that much better results. As University of Edinburgh’s Rico Sennrich said in his Slator interview: “ Given the massive scale of the models, and the resulting computational cost, it is in fact, surprising that they do not outperform recently published work—unfortunately, they only provide a comparison on an older test set, and against relatively old and weak baselines.” He also adds that the Edinburgh system outperformed the Google system in the WMT16 evaluation (which shows how NMT systems and the University of Edinburgh in particular has been doing very well in comparative evaluations.)

So what does this mean?

NMT is definitely here for real and is likely to continue improving albeit incrementally.  If you are an enterprise concerned about large-scale Chinese, Japanese, Korean and Arabic translation you should be looking at NMT technology or talking to MT vendors who have real NMT products. This technology may be especially valuable for those interested in scientific and technical knowledge content like patent and scientific paper related information.

Hopefully, the improvement claims in future are more carefully measured and honest, so that we don't get translators all up in arms again after they see the actual quality that systems like the "near human quality" GNMT ZH-EN  produce. The new NMT systems that are emerging, however,  are already a definite improvement for the casual internet user who just wants better quality gisting. 

SYSTRAN will shortly start providing examples of "adapted" NMT systems which are instantly tuned versions of a generic NMT engine. If the promise I saw in  my investigation is anywhere close to some of the Adaptive MT capabilities NMT is a real game changer for the professional translation industry as well.

Remember, the real goal is not better NMT systems, rather, it is better quality automated translation, that both, supports production business translation work and allows users to really get an accurate sense of the meaning of foreign language content quickly.

For those who think that this study is not an industrial strength experiment, you may be interested to know that one of the researchers quietly published this which shows that their expertise is very much in play at the WIPO even though the training sets were very small. As he says:
"A few days after Google, WIPO (the World Intellectual Property Organization) just deployed its first two in-house neural machine translation (NMT) systems for Chinese-English and Japanese-English. Both systems are already available through the Patentscope Translation Assistant for Patent Texts."
Even at this initial stage, the NMT system BLEU scores show impressive gains, and these scores can only go up from here.

Japanese to English        SMT = 24.41      NMT = 35.99
Chinese to English          SMT = 28.59      NMT = 37.56

This system is live and is something you can try out right now at this link.

The research team whose work triggered this post and essentially wrote it includes Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang.

P.S. I have some wise words coming up on the translation evaluation issue from a guest author next week. I would love to have more translators step forward with their comments on this issue or even volunteer to write a post on human evaluation of MT output.



  1. Jez, really who cares? We have seen these claims now for the last 30 years. In my opinion these are all just efforts to get some funding either iternally or externally. I don't care I have not seen any MT system that produces any usable results in my domain without serious pre- and post-editing.

    1. Actually many do -- probably every major company with an international business focus cares. As you can see that Facebook, eBay, Amazon, MSFT, GOOG and the National Security groups of every G20 country, and several others have enough of a need that they will fund and sponsor this technology till it reaches a level of acceptability and usability. Most of them have translation needs that go far beyond what could be done with a human only approach and this is not an either/or situation as many translators seem to feel it is. Translators who dismiss the technology as irrelevant seem to miss this basic fact again and again.

      This post may give you reason to care just a tiny bit more or maybe not: