Monday, July 24, 2017

The Ongoing Neural Machine Translation Momentum

This is largely a guest post by Manuel Herranz of Pangeanic, slightly abbreviated and edited from the original, to make it more informational and less promotional. Last year we saw FaceBook announce that they were going to shift all their MT infrastructure to a Neural MT foundation as rapidly as possible, this was later followed by NMT announcements from SYSTRAN, Google, and Microsoft. In the months since we have seen that many MT technology vendors have also jumped onto the NMT wagon. Some with more conviction than others. The view for those who can go right into the black box and modify things (SDL, MSFT, GOOG, FB and possibly SYSTRAN) is, I suspect, quite different from those who use open source components and have to perform a "workaround" on the output of these black box components. Basically, I see there are two clear camps amongst MT vendors: 
  1. Those who are shifting to NMT as quickly as possible (e.g. SYSTRAN)
  2. Those who are being much more selective and either "going hybrid = SMT+NMT" or building both PB-SMT and NMT engines and choosing the better one.(e.g. Iconic).
Pangeanic probably falls in the first group based on the enthusiasm in this post. Whenever there is a paradigm shift in MT methodology the notion of "hybrid" invariably comes up. A lot of people who don't understand the degree of coherence needed in the underlying technology generally assume this is a better way. Also, I think that sometimes, the MT practitioner has too much investment sunk into the old approach and is reluctant to completely abandon the old for the new. SMT took many years to mature and what we see today is an automated translation production pipeline that includes multiple models (translation, language, reordering etc..) together with pre and post processing of translation data. The term hybrid is sometimes used to describe this overall pipeline because data can be linguistically-informed on some of these pipeline steps. 

When SMT first emerged, many problems were noticed (relative to the old RBMT model), and it has taken many years to resolve some of them. The solutions that worked for SMT will not necessarily work for NMT and in fact, there is a good reason to believe they clearly will not. Mostly because the pattern matching technology in SMT is quite different, even though it is much better understood, and more evident than in NMT. The pattern detection and learning that happens in NMT is much more mysterious and unclear at this point. We are still learning what levers to pull to make adjustments and fix weird problems that we see. What can be carried forward easily are data preparation, data and corpus analysis and data quality measures that have been built over time. NMT is a machine learning (pattern matching) technology that learns from data that you show it. Thus far it is limited to translation memory and glossaries.

I am somewhat skeptical about the "hybrid NMT" stuff being thrown around by some vendors. The solutions to NMT problems and challenges are quite different (from PB-SMT) and to me, it makes much more sense to me to go completely one way or the other. I understand that some NMT systems do not yet exceed PB-SMT performance levels, and thus it is logical and smart to continue using the older systems in such a case. But given the overwhelming evidence with NMT research and actual user experience in 2017, I think the evidence is pretty clear that NMT is the way forward across the board. It is a question of when, rather than if, for most languages. Adaptive MT might be an exception in the professional use scenario because it is learning in real time if you work with SDL or Lilt. While hybrid RBMT and SMT made some sense to me, hybrid SMT+NMT does not make any sense to me and triggers blips on my bullshit radar, as it reeks of marketing-speak rather than science. However, I do think that Adaptive MT built with an NMT foundation might be viable, and could very well be the preferred model for MT for years to come, in post-editing and professional translator use scenarios in future. It is also my feeling that as these more interactive MT/TM  capabilities become more widespread the relative value of pure TM  tools will decline dramatically. But I am also going to bet that an industry outsider will drive this change, simply because real change rarely comes from people with sunk costs and vested interests. And surely somebody will come up with a better workbench for translators than standard TM matching, one which provides translation suggestions continuously, and learns from ongoing interactions.

I am going to bet that the best NMT systems will come from those who go "all in" with NMT and solve NMT deficiencies without resorting to force-fitting old SMT paradigm remedies on NMT models or trying to go "hybrid", whatever that means.

The value of the research data of all those who are sharing their NMT experience is immense to all, as it provides data that is useful to everybody else in moving forward faster. I have summarized some of this in previous posts:  The Problem with BLEU and Neural Machine Translation, An Examination of the Strengths and Weaknesses of Neural Machine Translation, and Real and Honest Quality Evaluation Data on Neural Machine Translation.The various posts on SYSTRAN's PNMT and the recent review of SDL's NMT also describe many of the NMT challenges.

In addition to the research data from Pangeanic in this post, there is also this from Iconic and ADAPT where they basically state that mature a PB-SMT systems will still outperform NMT systems in the use-case scenarios they tested, and finally, the reconstruction strategy pointed out by Lilt, whose results are shown in the chart below. This approach apparently improves overall quality and also seems to handle long sentences better in NMT than others have reported. I have seen other examples of "evidence" where SMT outperforms NMT but I am wary of citing references where the research is not transparent or properly identified.

Source: Neural Machine Translation with Reconstruction

This excerpt from a recent TAUS post is also interesting, and points out that finally, the data is essential to making any of this work:
Google Director of Research Peter Norvig said recently in a video about the future of AI/ML in general that although there is a growing range of tools for building software (e.g. the neural networks), “we have no tools for dealing with data." That is: tools to build data, and correct, verify, and check them for bias, as their use in AI expands. In the case of translation, the rapid creation of an MT ecosystem is creating a new need to develop tools for “dealing with language data” – improving data quality and scope automatically, by learning through the ecosystem. And transforming language data from today’s sourcing problem (“where can I find the sort of language data I need to train my engine?”) into a more automated supply line.
For me this statement by Norvig is a pretty clear indication that perhaps the greatest value-add opportunities for NMT come from understanding, preparing and tuning the data that ML algorithms learn from. In the professional translation market where MT output quality expectations are the highest, it makes sense that data is better understood and prepared. I have also seen that the state of the aggregate "language data" within most LSPs is pretty bad, maybe even atrocious. It would be wonderful if the TMS systems could help improve this situation and provide a richer data management environment to enable data to be better leveraged for machine learning processes. To do this we need to think beyond organizing data for TM and projects, but at this point, we are still quite far from this. Better NMT systems will often come from better data, which is only possible if you can rapidly understand what data is most relevant (using metadata) and can bring it to bear in a timely and effective way. There is also an excessive focus on TM in my opinion. Focus on the right kind of monolingual corpus can also provide great insight, and help to drive strategies to generate and manufacture the "right kind" of TM to drive MT initiatives further. But this all means that we need to get more comfortable working with billions of words and extracting what we need when a customer situation arises.


The Pangeanic Neural Translation Project

So, time to recap and describe our experience with neural machine translation with tests into 7 languages (Japanese, Russian, Portuguese, French, Italian, German, Spanish), and how Pangeanic has decided to shift all its efforts into neural networks and leave the statistical approach as a support technology for hybridization.

We selected training sets from our SMT engines as clean data to train the same engines with the same data and run parallel human evaluation between the output of each system (existing statistical machine translation engines) and the new engines produced by neural systems. We are aware that if data cleaning was very important in a statistical system, it is even more so with neural networks. We could not add additional material because we wanted to be certain that we were comparing exactly the same data but trained with two different approaches.

A small percentage of bad or dirty data can have a detrimental effect on SMT systems, but if it is small enough, statistics will take care of it and won’t let it feed through the system (although it can also have a far worse side effect, which is lowering statistics all over certain n-grams).

We selected the same training data for languages which we knew were performing very well in SMT (French, Spanish, Portuguese) as well as those that have been known to researchers and practitioners as “the hard lot”: Russian as the example of a very rich morphologically language and Japanese as a language with a radically different grammatical structure where re-ordering (that’s what hybrid systems have done) has proven to be the only way to improve.


Japanese neural translation tests

Let’s concentrate first on the neural translation results in Japanese as they represent the quantum leap in machine translation we all have been waiting for. These results were presented at TAUS Tokyo last April. (See our previous post TAUS Tokyo Summit: improvements in neural machine translation in Japanese are real).

 We used a large training corpus of 4.6 million sentences (that is nearly 60 million running words in English and 76 million in Japanese). In vocabulary terms, that meant 491,600 English words and 283,800 character-words in Japanese. Yes, our brains are able to “compute” all that much and even more, if we add all types of conjugations, verb tenses, cases, etc. For testing purposes, we did what is supposed to do not to inflate percentage scores and took out 2,000 sentences before training started. This is a standard in all customization – a small sample is taken out so the engine that is generated translates what is likely to encounter. Any developer including the test corpus in the training set is likely to achieve very high scores (and will boast about it). But BLEU scores have always been about checking domain engines within MT systems, not across systems (among other things because the training sets have always been different so a corpus containing many repetitions or the same or similar sentences will obviously produce higher scores). We also made sure that no sentences were repeated and even similar sentences had been stripped out of the training corpus in order to achieve as much variety as possible. This may produce lower scores compared to other systems, but the results are cleaner and progress can be monitored very easily. This has been the way in academic competitions and has ensured good-quality engines over the years.

The standard automatic metric in SMT did not detect much difference between the output in NMT and the output in SMT. 

However, WER was showing a new and distinct tendency.
NMT shows better results in longer sentences in Japanese. SMT seems to be more certain in shorter sentences (training a 5 n-gram system)

And this new distinct tendency is what we picked up when the output was evaluated by human linguists. We used Japanese LSP Business Interactive Japan to rank the output from a conservative point of view, from A to D, A being human quality translation, B a very good output that only requires a very small percentage of post-editing, C an average output where some meaning can be extracted but serious post-editing is required and D a very low-quality translation without no meaning. Interestingly, our trained statistical MT systems performed better than the neural systems in sentences shorter than 10 words. We can assume that statistical systems are more certain in these cases when they are only dealing with simple sentences with enough n-grams giving evidence of a good matching pattern.

We created an Excel sheet (below) for human evaluators with the original English to the left with the reference translation. The neural translation followed. Two columns were provided for the rating and then the statistical output was provided.

Neural-SMT EN>JP ranking comparison showing the original English,  the reference translation,  the neural MT output and the statistical system output to the right

German, French, Spanish, Portuguese and Russian Neural MT results

The shocking improvement came from the human evaluators themselves. The trend pointed to 90% of sentences being classed as perfect translations (naturally flowing) or B (containing all the meaning, with only minor post-editing required). The shift is remarkable in all language pairs, including Japanese, moving from an “OK experience” to a remarkable acceptance. In fact, only 6% of sentences were classed as a D (“incomprehensible/unintelligible”) in Russian, 1% in French and 2% in German. Portuguese was independently evaluated by translation company Jaba Translations.

This trend is not particular to Pangeanic only. Several presenters at TAUS Tokyo pointed to ratings around 90% for Japanese using off-the-shelf neural systems compared to carefully crafted hybrid systems. Systran, for one, confirmed that they are focusing only on neural research/artificial intelligence and throwing away years of rule-based work, statistical and hybrid efforts.

Systran’s position is meritorious and very forward thinking. Current papers and some MT providers still resist the fact that despite all the work we have done over the years, Multimodal Pattern Recognition has got the better hand. It was only computing power and the use of GPUs for training that was holding it behind.

Neural networks: Are we heading towards the embedment of artificial intelligence in the translation business?

BLEU may be not the best indication of what is happening to the new neural machine translation systems, but it is an indicator. We were aware of other experiments and results by other companies pointing in a similar direction. Still, although the initial results may have made us think that there was no use to it, BLEU is a useful indicator – and in any case, it was always an indicator of an engine’s behavior not a true measure of an overall system versus another.  (See the Wikipedia article

Machine translation companies and developers face a dilemma as they have to do without the research, connectors, plugins and automatic measuring techniques and build new ones. Building connectors and plugins are not so difficult. Changing the core from Moses to a neural system is another matter. NMT is producing amazing translations, but it is still pretty much a black box. Our results show that some kind of hybrid system using the best features of an SMT system is highly desirable and academic research is moving in that direction already – as it happened with SMT itself some years ago.

Yes, the translation industry is at the peak of the neural networks hype. But looking at the whole picture and how artificial intelligence (pattern recognition) is being applied in several other areas, in order to produce intelligent reports, tendencies, and data, NMT is here to stay – and it will change the game for many, as more content needs to be produced cheaply with post-edition, at light speed when good machine translation is good enough. Amazon and Alibaba are not investing millions in MT for nothing – they want to reach people in their language with a high degree of accuracy and at a speed, human translators cannot.

Manuel Herranz is the CEO of Pangeanic. Collaboration with Valencia’s Polytechnic research group and the Computer Science Institute led to the creation of the PangeaMT platform for translation companies. He worked as an engineer for Ford machine tool suppliers and Rolls Royce Industrial and Marine, handling training and documentation from the buyer’s side when translation memories had not yet appeared in the LSP landscape. After joining a Japanese group in the late 90’s, he became Pangeanic’s CEO in 2004 and began his machine translation project in 2008 creating the first, command-line versions of the first commercial application of Moses (Euromatrixplus) and was the first LSP in the world to implement open source Moses successfully in a comercial environment, including re-training features and tag handling before they became standard in the Moses community.


  1. The problem with language data is that most linguists (especially translators and even terminologists) are still not trained to deal with it, to handle it, to process it, to understand it. This is a serious issue that is braking the so-called translation industry and undermining all translation-related professions.

    I'm afraid I can't agree with you on Andrew Joscelyne's vision of the unescapable success of a language data market. Even Microsoft has recently abandoned the idea of a data market for its Azure platform. For a data market to succeed, its prospect players should be aware of the importance of data and capable of estimating its value. Unfortunately, as for too many corporations with people, LSPs (and their customers too) look at (their) data as an asset, maybe a valuable asset, but they don't treat it as such. And a major fact for consolidation in the industry is the acquisition of a company also for its data, even though I'd rather say for its resources as a whole (including the customer and the vendor base.)

    All this should partly explain why "the state of the aggregate language data within most LSPs is pretty bad, maybe even atrocious." Of course, this also depends on the insufficient grasping of technology of too many LSPs. In fact, the exploitation of most translation technologies, in most cases, is delegated to vendors (i.e. freelancers.) The conceptual backwardness of many TMSs is a reflex of this attitude. There is only one company, in my own experience, that pays attention to project data to run the kind of estimates, measurements and analysis needed today.

    Also, I've found a few more pitfalls in the TAUS article: The requisites for a "conceptual shift from Big Data to language data" are missing, and would possibly be late now, even because language data has never been "big" in the Big Data sense. Also, to help this shift the many language data repositories offering versions of the same data sets should consolidate, but this requires time and investments that I really can't see now. Another point is the assumption that data is not a commodity. Well, a data market place would make it exactly that.

    Finally, what I find really interesting in Manuel Herranz's reasoning is the idea that NMT is expected to be the killer application. Let me remain skeptical about this.

    1. Luigi, I am not sure how you concluded that I am in agreement with Andrew's post about the language data market. I am more than skeptical about "a language data market" because the notion of value around TM data is so vague and ambiguous as to make it an impossibility. With data, One man's meat is another man's poison.

      However, I am in complete agreement with Norvig on the importance of better data analysis, management and manufacturing tools. Especially language tools that go beyond TM and TMS, and assist in making the right kind of data available. Data that supports a growing range of machine learning initiatives. Basic TM is an increasingly low value proposition.This broader scope data has been completely ignored by the translation industry and this may result in "the industry" becoming much less relevant in the market of business driven translation work in future.

      For those who have not realized yet, free, generic MT is already a killer application that does more than 99% of the translation done on the planet. IMO NMT will likely take it to 99.9%

      This does not mean that HT disappears -- as BLS statistics show, the rise of widely used MT is also accompanied by a doubling of the employment in translation. However, I think there will definitely be a bi-modal market - a premium market where real SME and deep translation competence matters and a bulk volume market where price is the key driver.

    2. Kirti, I'm afraid I haven't made myself clear.
      To make "the right kind of data available" a specific know-how is needed that is not provided to translation students, and often to language students either.
      At the recent TAUS Industry Summit, Jaap van der Meer was impressed with what he named "the Bodo dilemma" summing up the frustration at not finding talents willing to work with the fantastic localization technology suites available today. Those talents should obsviously be "young," while no one in the industry is still doing much to educate and train young talents, so certain skills can be found only in experienced people who acquired them on their own, possibly out of curiosity (or despair, possibly) and are not interesting for the labor market. And we both know why.
      Basic TMs are what most translation industry players still ask for, especially LSPs. This means that they are making the industry doomed.
      I don't think that translation is going to vanish, while most translation-related professions will.
      Please, don't get caught in the "premium market hoax." There is no "premium market;" there could be highly-remunerative customers, in few vertical segments (this is the key concept,) but they will increasingly be fascinated and won by MT.
      This will be NMT, of course, but definitely not now. Not even in 2018, maybe in five years from now or more. And we both now that nowadays this is a very long time. May Keynes forgive me for this.
      Marketing is not the right side to view the future from, that's why I don't think NMT is or will be the killer application any time soon. And the Big Ones will make a clean sweep of most MT companies.

  2. Manuel, simple question. Your dev set is 2,000 segment pairs. You calculated TER and WER scores for each segment and report their cumulative scores for both SMT and NMT. Can you please report the simple count of how many dev pairs scored TER=0/WER=0? That is, these are predictive models. So, how many times in 2,000 segments did the respective engines exactly predict the expected result?

  3. This article makes some good points, but loses credibility by being unduly scornful of the advantages of SMT compared to NMT. Putting scare quotes around "evidence" where SMT outperforms NMT, as well as around "hybrid NMT" is unnecessarily pejorative. There is plenty of evidence that SMT will continue to beat NMT, especially in resource-poor language pairs, and criticising (all) "hybrid" models of NMT/SMT as "trigger[ing] blips on my bullshit radar" betrays a lack of scholarliness, and ignores the many cases which have shown hybrid models to work better than NMT (or SMT). I'm surprised at you, Manuel, I have to say ...

    1. Andy,

      Firstly, I have never claimed that this blog is anything but a forum for opinions (both mine and others), so it is possible that we may use words that don’t belong in scholarly papers. So it goes. I fully understand that I am wrong sometimes, and generally do admit it when I see that is so e.g. when NYT pointed out that Schuster (Google) was not happy with the exuberant marketing claims about GNMT.

      I am not sure how you determined that our statements imply there is scorn when SMT is compared with NMT. It is clear to all of us that we are at a transition stage where many SMT systems do indeed outperform NMT. But we are also aware that there are a growing number of cases where given exactly the same data resources, NMT is CLEARLY outperforming many mature and carefully refined SMT systems. At this point in time both are equally valid, (as is RBMT) and perhaps, in anything less than total customization, maybe SMT is superior for domain adapted systems. We don’t have enough data to say with any great certainty. Systran uses them all (RBMT, SMT, (RBMT+SMT hybrid) and NMT) to meet client requirements but are heavily focused on moving completely to NMT since they have enough evidence to convince themselves that NMT is the best way forward.

      The term “workaround” is in quotes because it is exactly the word that is often used by vendors. I think that we may also differ on what “hybrid” means. To me it would have to be at the training level to truly count as hybrid. Otherwise, it is what I would consider a workaround. My issue with “hybrid” is its use for marketing purposes, when in fact we are talking about a workaround.

      It is also my sense that the major MT developers who are “all in” are using pure NMT strategies (which make more sense in a deep learning environment) rather than using the techniques that worked in SMT but make less sense with NMT models. It is possible that this will take longer, but I would expect that it is also likely to produce more robust solutions.

      Finally, it is also my sense that you are defending something that I am not attacking. You may wish to look at some of my older posts on NMT to see that my view is not so different from yours.

    2. Andy, I agree that there are advantages. That's why I asked my question. I expect Manuel's answer could demonstrate one of them. We tested NMT vs SMT cooperation with Terence Lewis starting from virtually identical NL-EN corpora and the exact same test set of 2300 segments. Slate (SMT) generated 219 edit-distance zero (9% ed-0) segments while his NMT engine generated 25 (1% ed-0). To our customers, the ed-0 segments are important because they represent the engine's capacity (potential) to reduce the translator's work to a cognitive exercise with zero mechanical effort. There are other real and consequential benefits of SMT-over-NMT in this bleeding-edge environment. I think Chris Wendt gave an excellent description of their current relationship in a podcast a few months ago. Paraphrasing, he said that big-data SMT has reached its quality limits with potential for improvement. NMT picks up at about where SMT is today and holds the promise of continued improvement in the future.

    3. Kirti, as it relates to this discussion, I offer my updated comment from another other article to relate to this article.

      Our customers are translators who use SMT themselves. This is a feat that many in the MT community said was impossible. In their world, the only test that matters is how much effort MT increases or reduces their work today... not in some uncertain future.

      BLEU is a "closeness" score much like fuzzy matches. Fuzzy reports how closely the source segment matches a source segment in the TM. We use the BLEU score post-facto to report how closely the MT suggestion matches the translator's finished work. In the graphs above, the BLEU scores jump roughly 10% from SMT to NMT. Yet, would a translator be satisfied with source fuzzy closeness scores in these ranges?

      We embedded SMT into a desktop application without a requirement for big data. The translator converts his own TMs to make an SMT engine that serves only him. In this use-case, the translator regularly experiences BLEU scores in the high 70's to high 80's between the MT suggestion and their final work. Furthermore, the percent of ed-0 segments jumps to 25%, 30% and sometimes 45%. We've never seen the percent fall below 25% with a translator's personal TMs for any language pair. It's the same fundamental SMT technology, totally different user experience.

      For us, MT -- of any kind -- is not about the quality of the translation. The translator is responsible for the quality. For us, MT is about enhancing the individual translator's experience. A reduced workload is one way to measure their experience. The MT output's quality is another aspect of that experience, and there are many other aspects to consider.

      NMT algorithms will mature. Hardware support is also improving as Intel embeds GPU co-processors into the CPU, much like they did with floating-point co-processors in the early-1990's. When NMT technology is viable in a desktop application use case, it is certainly possible (probable?) that it could push the closeness scores even higher than we experience with today's desktop SMT. As of today, however, a change to NMT would degrade our customers' experience and that's not acceptable.

  4. We've seen "hybrid", we've seen "workaround", now I'm going to bring in the term "pragmatic" in reference to my use of NMT.
    Being a translator-cum-coder who - in addition to licensing access to our Dutch-English/English-Dutch neural MT engines to translators - earns a living as an occasional LSP or an insourced project manager, I'm always looking for the most cost-effective way to complete a project by means of automated translation. Like Systran (and I have some fond memories of Peter Toma) whose commitment to NMT is borne out by the energy they devote to the OpenNMT project, I have enough evidence in the language pair of which I have specialist knowledge to come to the conclusion that NMT is probably the best way forward. "Forward" is the operative word here as we are certainly not there yet.
    I also use a variety of approaches to MT to meet my clients' needs, and have been known to run a complete job through RBMT & NMT and pick the best bits from each. In practice, translation jobs can be cross-domain. I am currently working out a strategy to handle a vast project that is multi-domain in terms of content. Much of it involves documents containing legal and administrative jargon, but there are also documents containing technical specifications, while another couple of documents contain lists of protected flora and fauna. I'm very happy with the results of my test runs and will be using NMT to process the legal and adminisatrative documents. I know from experience that our RBMT + PBSMT set-up will currently cope better with the technical documents and there will certainly be no time to do specialisation of our baseline NMT engine. As for the lists of protected flora and fauna, they are likely to go to a human translator! All the translated documents will find their way into a translation memory program where they will be reviewed by a professional translator.
    "End-to-end" neural machine translation of every kind of document is a noble research goal but in daily translation practice NMT is just one of the tools in the LSP's armoury. Our translation server logs show me that one specialist translator accessed our Neural MT Dutch-English server via our memoQ plugin around 6 hours a day last week, although I know that our NMT model has not been trained for her specific subject area. Whether she accepted, rejected or modified the MT proposals I have no idea, but she definitely kept coming back for more so something was useful. The best translation memory programs allow the display of results from numerous sources, and another translator has specifically asked me to display RBMT and NMT results alongside each other so that he can choose the best proposal (or reject them both).
    Customers want translations of the agreed quality in the required time frame at the specified cost. Whether these translations are hand-crafted by a hundred Cistercian monks toiling in their dimly lit cells or shot down fiber optic by an array of GPU's, they don't give a simian's posterior. RBMT, SMT, NMT and HT are all welcome at my party.