Pages

Thursday, January 26, 2017

An Examination of the Strengths and Weaknesses of Neural Machine Translation

As Neural MT  gains momentum we see more studies that explain why it is being seen as a major step forward, and we are now beginning to understand some of the very specific reasons for this momentum. This summary by Antonio Toral and Víctor M. Sánchez-Cartagena highlights how NMT provides some specific advantages using well-understood data and comparative systems. Their main findings are presented below, but I saw an additional comment in the paper that I am also including here. The paper also provides BLEU scores for all the systems that were used, which are consistent with the scores shown here, and it is interesting that Russian is still a language where rule-based systems still produce the highest scores in tests like this. The fact that NMT systems perform so well on translations going out of English should be especially interesting to the localization industry. Now we need some evidence of how NMT systems can be domain-adapted and SYSTRAN will soon provide some details.

The fact that NMT systems do not do well on very long sentences can be managed by making these sentences shorter. I tend to write really long sentences but 40-45 words in a sentence seems really long to me and in a localization setting I think this can be managed.


"The best NMT system clearly outperforms the best PBMT system for all language directions out of English (relative improvements range from 5.5% for EN > RO to 17.6% for EN > FI) and the human evaluation (Bojar et al., 2016, Sec. 3.4) confirms these results. In the opposite direction, the human evaluation shows that the best NMT system outperforms the best PBMT system for all language directions except when the source language is Russian." 



System

From EN

CS

DE

FI

RO

RU

PBMT

23.7

30.6

15.3

27.4

24.3

NMT

25.9

34.2

18.0

28.9

26.0

 

 

 

 

 

 

Into EN

 

 

 

 

 

PBMT

30.4

35.2

23.7

35.4

29.3

NMT

31.4

38.7

-

34.1

28.2

 

BLEU scores of the best NMT and PBMT systems for each language pair at WMT16’s news translation task.

 

 

------------

A case study on 9 language directions

 

In a paper that will be presented at EACL in April 2017, we aim to shed light on the strengths and weaknesses of the newly introduced neural machine translation (NMT) paradigm. To do so we compare the translations produced by the best neural and phrase-based MT systems submitted to the news translation task at WMT16 for 9 language directions across a number of dimensions. The main findings are as follows:
  • Translations produced by NMT are considerably different than those produced by phrase-based systems. In addition, there is higher inter-system variability in NMT, i.e. outputs by pairs of NMT systems are more different between them than outputs by pairs of phrase-based systems.
  • NMT outputs are more fluent. We corroborate the results of the manual evaluation of fluency at WMT16, which was conducted only for language directions into English, and we show evidence that this finding is true also for directions out of English.

  • NMT systems do more reordering than pure phrase-based ones but less than hierarchical systems. However, NMT reorderings are better than those of both types of phrase-based systems.

  • NMT performs better in terms of inflection and reordering. We confirm that the findings of Bentivogli et al. (2016) apply to a wide range of language directions. Differences regarding lexical errors are negligible. A summary of these findings can be seen in the next figure, which shows the reduction of error percentages by NMT over PBMT. The percentages shown are the averages over the 9 language directions covered.

Reduction of errors by NMT averaged over the 9 language directions covered 

  • NMT performs rather poorly for long sentences. This can be observed in the following figure, where we plot the translation quality obtained by NMT and by phrase-based MT for sentences of different length. Translation quality is measured with chrF, an automatic evaluation metric that operates at the character level. We use it as it has been shown at WMT that it correlates better than BLEU with human judgments for morphologically rich languages (e.g. Finnish), while its correlation is on par with BLEU for languages with poorer morphology, e.g. English. That said, while we only show results based on chrF in the paper, we computed the experiment with BLEU too and the trends are the same, namely NMT degrades with sentence length.  

    Quality of NMT and PBMT for sentences of different length

Antonio Toral and Víctor M. Sánchez-Cartagena. 2017. A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions. arXiv, GitHub


 Antonio Toral
Antonio Toral is an assistant professor in Language Technology at the University of Groningen and was previously a research fellow in Machine Translation at Dublin City University. He has over 10 years of research experience in academia, is the author of over 90 peer-reviewed publications and the coordinator of Abu-MaTran, a 4-year project funded by the European Commission.

Víctor M. Sánchez-Cartagena

Víctor M. Sánchez-Cartagena is a research engineer at Prompsit Language Engineering, where he develops solutions based on automatic inference of linguistic resources and where he also worked on increasing the low industrial adoption of machine translation through Abu-MaTran, a 4-year project funded by the European Commission. He obtained his Ph.D. from University of Alicante in 2015 upon completion of his dissertation "Building machine translation systems from scarce resources". He was a predoctoral researcher at the University of Alicante during 4 years prior.

No comments:

Post a Comment