Thursday, January 18, 2018

Literary Text: What Level of Quality can Neural MT Attain?

Here are some interesting results from guest writer Antonio Toral, who provided us a good broad look at how NMT was doing relative to PBMT last year. His latest research investigates the potential for NMT in assisting with the translation of Literary Texts. While NMT is still a long way from human quality, it is interesting to note that NMT very consistently beats SMT even at the BLEU score level. At th eresearch level this is a big deal. Given that BLEU scores tend to favor SMT systems naturally, this is especially promising, and the results are probably quite strikingly better when compared by human reviewers.

I have also included another short post Antonio did on the detailed human review of NMT vs SMT output to show those who still doubt that NMT is the most likely way forward for any MT project today.


Neural networks have revolutionised the field of Machine Translation (MT). Translation quality has improved drastically over that of the previous dominant approach, statistical MT. It has been shown that this is the case for several content types, including news, TED talks, United Nations documents, etc. At this point, we wonder thus how neural MT fares on what is historically perceived the greatest challenge for MT, literary text, and specifically its most common representative: novels.

We explore this question in a paper that will appear in the forthcoming Springer volume Translation Quality Assessment: From Principles to Practice. We built state-of-the-art neural and statistical MT systems tailored to novels by training them on around 1,000 books (over 100 million words) for English-to-Catalan. We then evaluated these systems automatically on 12 widely known novels that span from the 1920s to the present day; from J. Joyce’s Ulysses to the last Harry Potter. The results (Figure 1) show that neural MT outperforms statistical MT for every single novel, achieving remarkable results: an overall improvement of 3 BLEU points.

Figure 1: BLEU scores obtained by neural and statistical MT on the 12 novels

Can humans notice the difference between human and machine translations?


We asked native speakers to rank blindly human versus machine translations for three of the novels. For two of them, around 33% of the translations produced by neural MT were perceived to be of equivalent quality to the translations by a professional human translator (Figure 2). This percentage is much lower for statistical MT at around 19%. For the remaining book, both MT systems obtain lower results, but they are still favourable for neural MT: 17% for this NMT system versus 8% for statistical MT.

Figure 2: Readers' perceptions of the quality of human versus machine translations for Salinger’s The Catcher in the Rye

How far are we?

Based on these ranks, we derived an overall score for human translations and the two MT systems (Figure 3). We take statistical MT as the departure point and human translation as the goal to be ultimately reached. Current neural MT technology has already covered around one fifth (20%) of the way: a considerable step forward compared to the previous MT paradigm, yet still far from human translation quality. The question now is whether neural MT can be useful [in future] to assist professional literary translators… To be continued.

Figure 3: Overall scores for human and machine translations

 A. Toral and A. Way. 2018. What Level of Quality can Neural Machine Translation Attain on Literary Text? ArXiv.

Fine-grained Human Evaluation of Neural Machine Translation

In a paper presented last month (May 2017) at EAMT we conducted a fine-grained human evaluation of neural machine translation (NMT). This builds upon recent work that has analysed the strengths and weaknesses of NMT using automatic procedures (Bentivogli et al., 2016; Toral and Sánchez-Cartagena, 2017).

Our study concerns translation into a morphologically-rich language (English-to-Croatian) and has a special focus on agreement errors. We compare 3 systems: standard phrase-based MT (PBMT) with Moses, PBMT enriched with morphological information using factored models and NMT. The errors produced by each system are annotated with a fine-grained tag set that contains over 20 error categories and is compliant with the Multidimensional Quality Metrics taxonomy (MQM).
These are our main findings:
  1. NMT reduces the number of overall errors produced by PBMT by more than half (54%). Compared to factored PBMT, the reduction brought by NMT is also notable at 42%.
  2. NMT is especially effective on agreement errors (number, gender, and case), which are reduced by 72% compared to PBMT, and by 63% compared to factored PBMT.
  3. The only error type for which NMT underperformed PBMT is errors of omission, which increased by 40%.
F. Klubicka, A. Toral and V. M. Sánchez-Cartagena. 2017. Fine-grained human evaluation of neural machine translation. The Prague Bulletin of Mathematical Linguistics. [PDF | BibTeX]

This shows that NMT errors are greatly decreased in most categories except for errors of Omission

 Antonio Toral
Antonio Toral is an assistant professor in Language Technology at the University of Groningen and was previously a research fellow in Machine Translation at Dublin City University. He has over 10 years of research experience in academia, is the author of over 90 peer-reviewed publications and the coordinator of Abu-MaTran, a 4-year project funded by the European Commission

Tuesday, January 16, 2018

2018: Machine Translation for Humans - Neural MT

This is a guest post by Laura Casanellas @LauraCasanellas  describing her journey with language technology. She raises some good questions for all of us to ponder over the coming year.

 Neural MT is all the rage now and it now appears in almost every translation industry discussion we see today. Sometimes depicted as a terrible job-killing force and sometimes as a savior, though I would bet that it is neither. Hopefully, the hype subsides and we start focusing on solving issues that enable high-value deployments. I have been interviewed by a few people about NMT technology in the last month, so expect to see even more on NMT, and we continue to see that GAFA and the Chinese/Korean giants (Baidu, Alibaba, Naver) also introduce NMT offerings. 

Open source toolkits for NMT proliferate, training data is easier to acquire, and hardware options for neural net and deep learning experimentation continue to expand.  It is very likely that we will see even more generic NMT solutions appear in the coming year, but generic NMT solutions are often not suitable for professional translation use.  For many reasons, but especially because of the inability to properly secure data privacy, properly integrate the technology into carefully built existing production workflows, customize NMT engines for very specific subject domains, and implement controls, and feedback cycles that are critical to ongoing NMT use in professional translation scenarios. It is quite likely that many LSPs will waste time and resources with multiple NMT toolkits, only to find out that NMT is far from being a Plug'nPlay technology, and real competence is not easily acquired without significant long-term knowledge building investments. We are perhaps reaching a threshold year for the translation industry where skillful use of MT and other kinds of effective automation are a requirement, both for business survival and for developing a sustainable competitive advantage.

The latest Multilingual magazine (January 2018) contains several articles on NMT technology but unfortunately does not have any contributions from SDL and Systran, who I think are the companies that are probably the most experienced with NMT technology use in the professional translation arena.  I have pointed out many of the challenges that still exist with NMT in previous posts in this blog, but I noted better definition of interesting challenges and some new highlights (for me) listed in the articles in Multilingual, for example:

  • DFKI documented very specifically that even though NMT systems have lower BLEU scores they exhibit fewer errors in most linguistic categories and are thus preferred by humans
  • DFKI also stated that terminology and tag management are major issues for NMT, and need to be resolved somehow to enable more professional deployments
  • Several people reported that using BLEU to compare NMT vs. SMT is unlikely to give meaningful results, but this is still often the means of comparison used in many cases
  • Capita TI reported that the cost of building an NMT engine is 50X that of an SMT engine, and the cost of running it is 70X the cost of an SMT engine
  • Experiments run at this stage of technology exploration by most in the professional translation world, should not be seen as conclusive and final. Their results will often be a reflection of their lack of expertise than of teh actual technology. As NMT expertise deepens and as the obvious challenges are worked out, we should expect that NMT  will become the preferred model even for Adaptive MT implementations.
  •  SMT took several years to mature and develop the ancillary infrastructure needed to enable MT deployments at scale. NMT will do this faster but it still does need some time for support infrastructure and key tools to be put in place. 
  • MT is a strategic technology that can provide long-term leverage but is most often unlikely to promise ROI on a single project, and this, plus the unwillingness to acknowledge the complexity of do-it-yourself options are key reasons that I think many LSPs will be left behind. 

Anyway, these are exciting times and look like things are about to get more exciting.

I am responsible for all text that is in bold in this post.


2017 has been a year of reinvention. We thought we had it good and then, Neural MT came along.

Riding The Wave

I started in localization twenty years ago and I still feel like an outsider; I don’t have a translation degree, neither do I have a technical background; I am somebody who came to live in a foreign country, liked it and had to find a career path there in order to be able to stay. Localization was one of the options, I tried it and it worked for me. This business has had many twists and turns and has been forced to adapt and be flexible with each one of them. I think I have done the same, change and adapt to every new invention, I have tried to ride the wave.

There were already translation memories when I started, but I remember big changes in the way processes worked and, at each turn, more automation was embraced and implemented: I remember the jump from static translation dumps to on-demand localization and delivery, and the implementation of automatic sophisticated quality check-ups. I progressed and evolved mirroring the industry and, from a brief period as a translator, I moved on to work in different positions and departments within the localization workflow. This mobility has given me the opportunity to have a good understanding of the industry’s main needs and problems.

Six years ago, I stumbled upon Machine Translation (MT). At that time, it almost looked like chance, but having seen the evolution of the technology in this short period of time, now I know that I had it coming, we all did, we all do. It happened because a visionary head of localization requested the implementation of an MT program in their account. I was in the privileged position of being involved in that implementation and that meant that myself and my colleagues could experiment and experience Machine Translation output first hand. For somebody who can speak another language and who has a curious mind, this was a golden opportunity. For a couple of years, we evaluated MT output within an inch of its life: from a linguist point of view (error typology, human evaluation), using industry standards (Bleu, yes, Bleu, and others…), setting up productivity tests (how much more productive post-editing effort is when compared with translation effort), etc. We learned to deal with this new tool and we acquired experience that helped us estimate expectations.

It feels like a lifetime ago. During the last few years, industry research has zoomed in on Machine Translation; as a consequence, there has been a colossal amount of research and studies done by industry and academia on the subject ever since. As we all know.

And I still haven’t mentioned Neural MT (NMT).

The Wondrous NMT

Geeky as it sounds, from the point of view of Machine Translation, I can consider myself quite privileged, as I have experienced directly the change from Statistical Machine Translation (SMT) to Neural while working for a Machine Translation provider. Again, I was able to compare the linguistic output produced by the previous system (SMT) and the new one (NMT) and see the sometimes very subtle, but significant differences. 2017 was a very exciting year.

NMT has really begun to be commercially implemented the last year but, after all the media attention (including in blogs like this one) and focus on industry and research forums, it feels as if it has been here forever. Everything goes very quick these days, proof of it is that most (if not all) Machine Translation providers have adopted this new technology in one way or another.

Technology Steals The Show

Technology is all around us, and it is stealing the show. I would love to do an experiment and ask an outsider to read articles and blog posts related to the localization industry for a month and then ask them, based on what they had read, what the level of technology adoption is in their opinion. I think they would say that the level of adoption, let’s focus on MT, is very high.

I see a different reality though; from my lucky position, I see that many companies in the industry are still hesitant, and maybe one of the reasons for it is fear. Fear of not fully understanding the implications of the implementation, the logistics of it, and of course, fear of not really grasping how the technology works. Because it is easy to understand how Translation Memory (TM) leverage works, but Machine Translation is a different thing.

I have no doubt in my mind that in five years’ time the gap will be closed; but at the moment there is still a large, not so vocal, group of people who are still not sure of how to start. For them, it might feel a bit like a flu jab, it is painful, may not really work, but most people are adopting it, it kind of has to the done. All other companies seem to be adopting it, they feel they need to do the same, but how? And when we ask how it should include questions like how is this technology going to connect with my own workflow; do I use TMs as well, how do I make it profitable, what is my ROI going to be, how do I rate post-edited words, what if my trusted translators refuse to post-edit, how many engines do I need, one per language, one per language and vertical, one per language and domain…?

MT for Humans

Many of the humans I have worked and dealt with are putting on a brave face, but sometimes they struggle with the concepts; a few years ago it was Bleu, now it is perplexity, epochs… Concepts and terms change very fast. For the industry to fully embrace this new technology a bigger effort might need to be done to bring it to the human level. The head of a language company will probably know by now that NMT is the latest option, but might not really care to comprehend what the intrinsic differences between one type of MT and the others are. They might prefer to know what the output is like, how to implement it, how to train their workforce (translators and everybody else in the company) on the technology from a practical point of view; is it going to affect the final quality, what does a Quality Manager or a Language lead need to know about it, what about rates, can a Vendor Manager negotiate a blanket reduction for all languages and content types? How is it going to be incorporated into the production workflow?

I think 2018 is going to be the year of mass adoption and more and more professionals are going to try to figure out all these questions. Artificial intelligence is all around us, the new generations are growing with it, but today this new bridge created by progress is still being crossed by very many people. Not everybody is on the other side. Yet.

Dublin, 12.I.18

Laura Casanellas is a localization consultant specialised in the area of Machine Translation deployment. Originally from Spain, she has been living in Ireland for the last 20 years. During that time, Laura has worked in a variety of roles (Language Quality, Vendor Management, Content Management) and verticals (Games, Travel, IT, Automotive, Legal) and acquired extensive experience in all aspects related to Localization. Since 2011, Laura has specialized in Language Technology and Machine Translation; until last year, Laura worked as a Product Manager and head of Professional Services in KantanMT.

Outside of her professional life, she is interested in biodiversity, horticulture, apiculture, and sustainability.

The result of some of the evaluations mentioned on the blog are collected in a number of papers:

Empirical evaluation of NMT and PBSMT quality for large-scale translation production
(2017) Shterionov, D., Nagle, P., Casanellas, L., Superbo, R., and O’Dowd, T.

Assumptions, expectations, and outliers in post-editing
 (2014) Laura Casanellas & Lena Marg: Assumptions, expectations, and outliers in post-editing. EAMT 2014, Dubrovnik

Connectivity, adaptability, productivity, quality, price: getting the MT recipe right
(2013) Laura Casanellas & Lena Marg: Connectivity, adaptability, productivity, quality, price: getting the MT recipe right XIV Machine Translation Summit, Nice