Thursday, January 18, 2018

Literary Text: What Level of Quality can Neural MT Attain?

Here are some interesting results from guest writer Antonio Toral, who provided us a good broad look at how NMT was doing relative to PBMT last year. His latest research investigates the potential for NMT in assisting with the translation of Literary Texts. While NMT is still a long way from human quality, it is interesting to note that NMT very consistently beats SMT even at the BLEU score level. At th eresearch level this is a big deal. Given that BLEU scores tend to favor SMT systems naturally, this is especially promising, and the results are probably quite strikingly better when compared by human reviewers.

I have also included another short post Antonio did on the detailed human review of NMT vs SMT output to show those who still doubt that NMT is the most likely way forward for any MT project today.


Neural networks have revolutionised the field of Machine Translation (MT). Translation quality has improved drastically over that of the previous dominant approach, statistical MT. It has been shown that this is the case for several content types, including news, TED talks, United Nations documents, etc. At this point, we wonder thus how neural MT fares on what is historically perceived the greatest challenge for MT, literary text, and specifically its most common representative: novels.

We explore this question in a paper that will appear in the forthcoming Springer volume Translation Quality Assessment: From Principles to Practice. We built state-of-the-art neural and statistical MT systems tailored to novels by training them on around 1,000 books (over 100 million words) for English-to-Catalan. We then evaluated these systems automatically on 12 widely known novels that span from the 1920s to the present day; from J. Joyce’s Ulysses to the last Harry Potter. The results (Figure 1) show that neural MT outperforms statistical MT for every single novel, achieving remarkable results: an overall improvement of 3 BLEU points.

Figure 1: BLEU scores obtained by neural and statistical MT on the 12 novels

Can humans notice the difference between human and machine translations?


We asked native speakers to rank blindly human versus machine translations for three of the novels. For two of them, around 33% of the translations produced by neural MT were perceived to be of equivalent quality to the translations by a professional human translator (Figure 2). This percentage is much lower for statistical MT at around 19%. For the remaining book, both MT systems obtain lower results, but they are still favourable for neural MT: 17% for this NMT system versus 8% for statistical MT.

Figure 2: Readers' perceptions of the quality of human versus machine translations for Salinger’s The Catcher in the Rye

How far are we?

Based on these ranks, we derived an overall score for human translations and the two MT systems (Figure 3). We take statistical MT as the departure point and human translation as the goal to be ultimately reached. Current neural MT technology has already covered around one fifth (20%) of the way: a considerable step forward compared to the previous MT paradigm, yet still far from human translation quality. The question now is whether neural MT can be useful [in future] to assist professional literary translators… To be continued.

Figure 3: Overall scores for human and machine translations

 A. Toral and A. Way. 2018. What Level of Quality can Neural Machine Translation Attain on Literary Text? ArXiv.

Fine-grained Human Evaluation of Neural Machine Translation

In a paper presented last month (May 2017) at EAMT we conducted a fine-grained human evaluation of neural machine translation (NMT). This builds upon recent work that has analysed the strengths and weaknesses of NMT using automatic procedures (Bentivogli et al., 2016; Toral and Sánchez-Cartagena, 2017).

Our study concerns translation into a morphologically-rich language (English-to-Croatian) and has a special focus on agreement errors. We compare 3 systems: standard phrase-based MT (PBMT) with Moses, PBMT enriched with morphological information using factored models and NMT. The errors produced by each system are annotated with a fine-grained tag set that contains over 20 error categories and is compliant with the Multidimensional Quality Metrics taxonomy (MQM).
These are our main findings:
  1. NMT reduces the number of overall errors produced by PBMT by more than half (54%). Compared to factored PBMT, the reduction brought by NMT is also notable at 42%.
  2. NMT is especially effective on agreement errors (number, gender, and case), which are reduced by 72% compared to PBMT, and by 63% compared to factored PBMT.
  3. The only error type for which NMT underperformed PBMT is errors of omission, which increased by 40%.
F. Klubicka, A. Toral and V. M. Sánchez-Cartagena. 2017. Fine-grained human evaluation of neural machine translation. The Prague Bulletin of Mathematical Linguistics. [PDF | BibTeX]

This shows that NMT errors are greatly decreased in most categories except for errors of Omission

 Antonio Toral
Antonio Toral is an assistant professor in Language Technology at the University of Groningen and was previously a research fellow in Machine Translation at Dublin City University. He has over 10 years of research experience in academia, is the author of over 90 peer-reviewed publications and the coordinator of Abu-MaTran, a 4-year project funded by the European Commission

No comments:

Post a Comment