This is largely a guest post by Manuel Herranz of Pangeanic, slightly abbreviated and edited from the original, to make it more informational and less promotional. Last year we saw FaceBook announce that they were going to shift all their MT infrastructure to a Neural MT foundation as rapidly as possible, this was later followed by NMT announcements from SYSTRAN, Google, and Microsoft. In the months since we have seen that many MT technology vendors have also jumped onto the NMT wagon. Some with more conviction than others. The view for those who can go right into the black box and modify things (SDL, MSFT, GOOG, FB and possibly SYSTRAN) is, I suspect, quite different from those who use open source components and have to perform a "workaround" on the output of these black box components. Basically, I see there are two clear camps amongst MT vendors:
When SMT first emerged, many problems were noticed (relative to the old RBMT model), and it has taken many years to resolve some of them. The solutions that worked for SMT will not necessarily work for NMT and in fact, there is a good reason to believe they clearly will not. Mostly because the pattern matching technology in SMT is quite different, even though it is much better understood, and more evident than in NMT. The pattern detection and learning that happens in NMT is much more mysterious and unclear at this point. We are still learning what levers to pull to make adjustments and fix weird problems that we see. What can be carried forward easily are data preparation, data and corpus analysis and data quality measures that have been built over time. NMT is a machine learning (pattern matching) technology that learns from data that you show it. Thus far it is limited to translation memory and glossaries.
I am somewhat skeptical about the "hybrid NMT" stuff being thrown around by some vendors. The solutions to NMT problems and challenges are quite different (from PB-SMT) and to me, it makes much more sense to me to go completely one way or the other. I understand that some NMT systems do not yet exceed PB-SMT performance levels, and thus it is logical and smart to continue using the older systems in such a case. But given the overwhelming evidence with NMT research and actual user experience in 2017, I think the evidence is pretty clear that NMT is the way forward across the board. It is a question of when, rather than if, for most languages. Adaptive MT might be an exception in the professional use scenario because it is learning in real time if you work with SDL or Lilt. While hybrid RBMT and SMT made some sense to me, hybrid SMT+NMT does not make any sense to me and triggers blips on my bullshit radar, as it reeks of marketing-speak rather than science. However, I do think that Adaptive MT built with an NMT foundation might be viable, and could very well be the preferred model for MT for years to come, in post-editing and professional translator use scenarios in future. It is also my feeling that as these more interactive MT/TM capabilities become more widespread the relative value of pure TM tools will decline dramatically. But I am also going to bet that an industry outsider will drive this change, simply because real change rarely comes from people with sunk costs and vested interests. And surely somebody will come up with a better workbench for translators than standard TM matching, one which provides translation suggestions continuously, and learns from ongoing interactions.
I am going to bet that the best NMT systems will come from those who go "all in" with NMT and solve NMT deficiencies without resorting to force-fitting old SMT paradigm remedies on NMT models or trying to go "hybrid", whatever that means.
The value of the research data of all those who are sharing their NMT experience is immense to all, as it provides data that is useful to everybody else in moving forward faster. I have summarized some of this in previous posts: The Problem with BLEU and Neural Machine Translation, An Examination of the Strengths and Weaknesses of Neural Machine Translation, and Real and Honest Quality Evaluation Data on Neural Machine Translation.The various posts on SYSTRAN's PNMT and the recent review of SDL's NMT also describe many of the NMT challenges.
In addition to the research data from Pangeanic in this post, there is also this from Iconic and ADAPT where they basically state that mature a PB-SMT systems will still outperform NMT systems in the use-case scenarios they tested, and finally, the reconstruction strategy pointed out by Lilt, whose results are shown in the chart below. This approach apparently improves overall quality and also seems to handle long sentences better in NMT than others have reported. I have seen other examples of "evidence" where SMT outperforms NMT but I am wary of citing references where the research is not transparent or properly identified.
This excerpt from a recent TAUS post is also interesting, and points out that finally, the data is essential to making any of this work:
We selected training sets from our SMT engines as clean data to train the same engines with the same data and run parallel human evaluation between the output of each system (existing statistical machine translation engines) and the new engines produced by neural systems. We are aware that if data cleaning was very important in a statistical system, it is even more so with neural networks. We could not add additional material because we wanted to be certain that we were comparing exactly the same data but trained with two different approaches.
A small percentage of bad or dirty data can have a detrimental effect on SMT systems, but if it is small enough, statistics will take care of it and won’t let it feed through the system (although it can also have a far worse side effect, which is lowering statistics all over certain n-grams).
We selected the same training data for languages which we knew were performing very well in SMT (French, Spanish, Portuguese) as well as those that have been known to researchers and practitioners as “the hard lot”: Russian as the example of a very rich morphologically language and Japanese as a language with a radically different grammatical structure where re-ordering (that’s what hybrid systems have done) has proven to be the only way to improve.
We used a large training corpus of 4.6 million sentences (that is nearly 60 million running words in English and 76 million in Japanese). In vocabulary terms, that meant 491,600 English words and 283,800 character-words in Japanese. Yes, our brains are able to “compute” all that much and even more, if we add all types of conjugations, verb tenses, cases, etc. For testing purposes, we did what is supposed to do not to inflate percentage scores and took out 2,000 sentences before training started. This is a standard in all customization – a small sample is taken out so the engine that is generated translates what is likely to encounter. Any developer including the test corpus in the training set is likely to achieve very high scores (and will boast about it). But BLEU scores have always been about checking domain engines within MT systems, not across systems (among other things because the training sets have always been different so a corpus containing many repetitions or the same or similar sentences will obviously produce higher scores). We also made sure that no sentences were repeated and even similar sentences had been stripped out of the training corpus in order to achieve as much variety as possible. This may produce lower scores compared to other systems, but the results are cleaner and progress can be monitored very easily. This has been the way in academic competitions and has ensured good-quality engines over the years.
However, WER was showing a new and distinct tendency.
And this new distinct tendency is what we picked up when the output was evaluated by human linguists. We used Japanese LSP Business Interactive Japan to rank the output from a conservative point of view, from A to D, A being human quality translation, B a very good output that only requires a very small percentage of post-editing, C an average output where some meaning can be extracted but serious post-editing is required and D a very low-quality translation without no meaning. Interestingly, our trained statistical MT systems performed better than the neural systems in sentences shorter than 10 words. We can assume that statistical systems are more certain in these cases when they are only dealing with simple sentences with enough n-grams giving evidence of a good matching pattern.
We created an Excel sheet (below) for human evaluators with the original English to the left with the reference translation. The neural translation followed. Two columns were provided for the rating and then the statistical output was provided.
This trend is not particular to Pangeanic only. Several presenters at TAUS Tokyo pointed to ratings around 90% for Japanese using off-the-shelf neural systems compared to carefully crafted hybrid systems. Systran, for one, confirmed that they are focusing only on neural research/artificial intelligence and throwing away years of rule-based work, statistical and hybrid efforts.
Systran’s position is meritorious and very forward thinking. Current papers and some MT providers still resist the fact that despite all the work we have done over the years, Multimodal Pattern Recognition has got the better hand. It was only computing power and the use of GPUs for training that was holding it behind.
Machine translation companies and developers face a dilemma as they have to do without the research, connectors, plugins and automatic measuring techniques and build new ones. Building connectors and plugins are not so difficult. Changing the core from Moses to a neural system is another matter. NMT is producing amazing translations, but it is still pretty much a black box. Our results show that some kind of hybrid system using the best features of an SMT system is highly desirable and academic research is moving in that direction already – as it happened with SMT itself some years ago.
Yes, the translation industry is at the peak of the neural networks hype. But looking at the whole picture and how artificial intelligence (pattern recognition) is being applied in several other areas, in order to produce intelligent reports, tendencies, and data, NMT is here to stay – and it will change the game for many, as more content needs to be produced cheaply with post-edition, at light speed when good machine translation is good enough. Amazon and Alibaba are not investing millions in MT for nothing – they want to reach people in their language with a high degree of accuracy and at a speed, human translators cannot.
- Those who are shifting to NMT as quickly as possible (e.g. SYSTRAN)
- Those who are being much more selective and either "going hybrid = SMT+NMT" or building both PB-SMT and NMT engines and choosing the better one.(e.g. Iconic).
When SMT first emerged, many problems were noticed (relative to the old RBMT model), and it has taken many years to resolve some of them. The solutions that worked for SMT will not necessarily work for NMT and in fact, there is a good reason to believe they clearly will not. Mostly because the pattern matching technology in SMT is quite different, even though it is much better understood, and more evident than in NMT. The pattern detection and learning that happens in NMT is much more mysterious and unclear at this point. We are still learning what levers to pull to make adjustments and fix weird problems that we see. What can be carried forward easily are data preparation, data and corpus analysis and data quality measures that have been built over time. NMT is a machine learning (pattern matching) technology that learns from data that you show it. Thus far it is limited to translation memory and glossaries.
I am somewhat skeptical about the "hybrid NMT" stuff being thrown around by some vendors. The solutions to NMT problems and challenges are quite different (from PB-SMT) and to me, it makes much more sense to me to go completely one way or the other. I understand that some NMT systems do not yet exceed PB-SMT performance levels, and thus it is logical and smart to continue using the older systems in such a case. But given the overwhelming evidence with NMT research and actual user experience in 2017, I think the evidence is pretty clear that NMT is the way forward across the board. It is a question of when, rather than if, for most languages. Adaptive MT might be an exception in the professional use scenario because it is learning in real time if you work with SDL or Lilt. While hybrid RBMT and SMT made some sense to me, hybrid SMT+NMT does not make any sense to me and triggers blips on my bullshit radar, as it reeks of marketing-speak rather than science. However, I do think that Adaptive MT built with an NMT foundation might be viable, and could very well be the preferred model for MT for years to come, in post-editing and professional translator use scenarios in future. It is also my feeling that as these more interactive MT/TM capabilities become more widespread the relative value of pure TM tools will decline dramatically. But I am also going to bet that an industry outsider will drive this change, simply because real change rarely comes from people with sunk costs and vested interests. And surely somebody will come up with a better workbench for translators than standard TM matching, one which provides translation suggestions continuously, and learns from ongoing interactions.
I am going to bet that the best NMT systems will come from those who go "all in" with NMT and solve NMT deficiencies without resorting to force-fitting old SMT paradigm remedies on NMT models or trying to go "hybrid", whatever that means.
The value of the research data of all those who are sharing their NMT experience is immense to all, as it provides data that is useful to everybody else in moving forward faster. I have summarized some of this in previous posts: The Problem with BLEU and Neural Machine Translation, An Examination of the Strengths and Weaknesses of Neural Machine Translation, and Real and Honest Quality Evaluation Data on Neural Machine Translation.The various posts on SYSTRAN's PNMT and the recent review of SDL's NMT also describe many of the NMT challenges.
In addition to the research data from Pangeanic in this post, there is also this from Iconic and ADAPT where they basically state that mature a PB-SMT systems will still outperform NMT systems in the use-case scenarios they tested, and finally, the reconstruction strategy pointed out by Lilt, whose results are shown in the chart below. This approach apparently improves overall quality and also seems to handle long sentences better in NMT than others have reported. I have seen other examples of "evidence" where SMT outperforms NMT but I am wary of citing references where the research is not transparent or properly identified.
Source: Neural Machine Translation with Reconstruction
This excerpt from a recent TAUS post is also interesting, and points out that finally, the data is essential to making any of this work:
Google Director of Research Peter Norvig said recently in a video about the future of AI/ML in general that although there is a growing range of tools for building software (e.g. the neural networks), “we have no tools for dealing with data." That is: tools to build data, and correct, verify, and check them for bias, as their use in AI expands. In the case of translation, the rapid creation of an MT ecosystem is creating a new need to develop tools for “dealing with language data” – improving data quality and scope automatically, by learning through the ecosystem. And transforming language data from today’s sourcing problem (“where can I find the sort of language data I need to train my engine?”) into a more automated supply line.For me this statement by Norvig is a pretty clear indication that perhaps the greatest value-add opportunities for NMT come from understanding, preparing and tuning the data that ML algorithms learn from. In the professional translation market where MT output quality expectations are the highest, it makes sense that data is better understood and prepared. I have also seen that the state of the aggregate "language data" within most LSPs is pretty bad, maybe even atrocious. It would be wonderful if the TMS systems could help improve this situation and provide a richer data management environment to enable data to be better leveraged for machine learning processes. To do this we need to think beyond organizing data for TM and projects, but at this point, we are still quite far from this. Better NMT systems will often come from better data, which is only possible if you can rapidly understand what data is most relevant (using metadata) and can bring it to bear in a timely and effective way. There is also an excessive focus on TM in my opinion. Focus on the right kind of monolingual corpus can also provide great insight, and help to drive strategies to generate and manufacture the "right kind" of TM to drive MT initiatives further. But this all means that we need to get more comfortable working with billions of words and extracting what we need when a customer situation arises.
===============
The Pangeanic Neural Translation Project
So, time to recap and describe our experience with neural machine translation with tests into 7 languages (Japanese, Russian, Portuguese, French, Italian, German, Spanish), and how Pangeanic has decided to shift all its efforts into neural networks and leave the statistical approach as a support technology for hybridization.We selected training sets from our SMT engines as clean data to train the same engines with the same data and run parallel human evaluation between the output of each system (existing statistical machine translation engines) and the new engines produced by neural systems. We are aware that if data cleaning was very important in a statistical system, it is even more so with neural networks. We could not add additional material because we wanted to be certain that we were comparing exactly the same data but trained with two different approaches.
A small percentage of bad or dirty data can have a detrimental effect on SMT systems, but if it is small enough, statistics will take care of it and won’t let it feed through the system (although it can also have a far worse side effect, which is lowering statistics all over certain n-grams).
We selected the same training data for languages which we knew were performing very well in SMT (French, Spanish, Portuguese) as well as those that have been known to researchers and practitioners as “the hard lot”: Russian as the example of a very rich morphologically language and Japanese as a language with a radically different grammatical structure where re-ordering (that’s what hybrid systems have done) has proven to be the only way to improve.
Japanese neural translation tests
Let’s concentrate first on the neural translation results in Japanese as they represent the quantum leap in machine translation we all have been waiting for. These results were presented at TAUS Tokyo last April. (See our previous post TAUS Tokyo Summit: improvements in neural machine translation in Japanese are real).
The standard automatic metric in SMT did not detect much difference between the output in NMT and the output in SMT.
However, WER was showing a new and distinct tendency.
NMT shows better results in longer sentences in Japanese. SMT seems to be more certain in shorter sentences (training a 5 n-gram system) |
And this new distinct tendency is what we picked up when the output was evaluated by human linguists. We used Japanese LSP Business Interactive Japan to rank the output from a conservative point of view, from A to D, A being human quality translation, B a very good output that only requires a very small percentage of post-editing, C an average output where some meaning can be extracted but serious post-editing is required and D a very low-quality translation without no meaning. Interestingly, our trained statistical MT systems performed better than the neural systems in sentences shorter than 10 words. We can assume that statistical systems are more certain in these cases when they are only dealing with simple sentences with enough n-grams giving evidence of a good matching pattern.
We created an Excel sheet (below) for human evaluators with the original English to the left with the reference translation. The neural translation followed. Two columns were provided for the rating and then the statistical output was provided.
Neural-SMT EN>JP ranking comparison showing the original English, the reference translation, the neural MT output and the statistical system output to the right |
German, French, Spanish, Portuguese and Russian Neural MT results
The shocking improvement came from the human evaluators themselves. The trend pointed to 90% of sentences being classed as perfect translations (naturally flowing) or B (containing all the meaning, with only minor post-editing required). The shift is remarkable in all language pairs, including Japanese, moving from an “OK experience” to a remarkable acceptance. In fact, only 6% of sentences were classed as a D (“incomprehensible/unintelligible”) in Russian, 1% in French and 2% in German. Portuguese was independently evaluated by translation company Jaba Translations.This trend is not particular to Pangeanic only. Several presenters at TAUS Tokyo pointed to ratings around 90% for Japanese using off-the-shelf neural systems compared to carefully crafted hybrid systems. Systran, for one, confirmed that they are focusing only on neural research/artificial intelligence and throwing away years of rule-based work, statistical and hybrid efforts.
Systran’s position is meritorious and very forward thinking. Current papers and some MT providers still resist the fact that despite all the work we have done over the years, Multimodal Pattern Recognition has got the better hand. It was only computing power and the use of GPUs for training that was holding it behind.
Neural networks: Are we heading towards the embedment of artificial intelligence in the translation business?
BLEU may be not the best indication of what is happening to the new neural machine translation systems, but it is an indicator. We were aware of other experiments and results by other companies pointing in a similar direction. Still, although the initial results may have made us think that there was no use to it, BLEU is a useful indicator – and in any case, it was always an indicator of an engine’s behavior not a true measure of an overall system versus another. (See the Wikipedia article https://en.wikipedia.org/wiki/Evaluation_of_machine_translation).Machine translation companies and developers face a dilemma as they have to do without the research, connectors, plugins and automatic measuring techniques and build new ones. Building connectors and plugins are not so difficult. Changing the core from Moses to a neural system is another matter. NMT is producing amazing translations, but it is still pretty much a black box. Our results show that some kind of hybrid system using the best features of an SMT system is highly desirable and academic research is moving in that direction already – as it happened with SMT itself some years ago.
Yes, the translation industry is at the peak of the neural networks hype. But looking at the whole picture and how artificial intelligence (pattern recognition) is being applied in several other areas, in order to produce intelligent reports, tendencies, and data, NMT is here to stay – and it will change the game for many, as more content needs to be produced cheaply with post-edition, at light speed when good machine translation is good enough. Amazon and Alibaba are not investing millions in MT for nothing – they want to reach people in their language with a high degree of accuracy and at a speed, human translators cannot.
Manuel Herranz is the CEO of Pangeanic. Collaboration with Valencia’s Polytechnic research group and the Computer Science Institute led to the creation of the PangeaMT platform for translation companies. He worked as an engineer for Ford machine tool suppliers and Rolls Royce Industrial and Marine, handling training and documentation from the buyer’s side when translation memories had not yet appeared in the LSP landscape. After joining a Japanese group in the late 90’s, he became Pangeanic’s CEO in 2004 and began his machine translation project in 2008 creating the first, command-line versions of the first commercial application of Moses (Euromatrixplus) and was the first LSP in the world to implement open source Moses successfully in a comercial environment, including re-training features and tag handling before they became standard in the Moses community.