Friday, September 30, 2016

The Google Neural Machine Translation Marketing Deception

Recently I came upon this little tidbit and initially thought how wonderful, (NMT is surely rising!) and decided to take a closer look and read through the research paper. This exercise left me a little uncertain as I now felt doubt, and began to suspect that this is just another example of that never-ending refrain of the MT world, the empty promise. Without a doubt Google had made some real progress, but “Nearly Indistinguishable From Human Translation” and “GNMT reduces translation errors by more than 55%-85% on several major language pairs “. Hmmm, not really, not really at all, was what my pesky brain kept telling me, especially as I saw this announcement coming up again and again through many news channels, probably pushed heavily by the Google marketing infrastructure.

Surely the great Google of the original “Don’t Be Evil” ethos would not bullshit us thus. (In their 2004 founders' letter prior to their initial public offering, Larry Page and Sergey Brin explained that their "Don't be evil" culture prohibited conflicts of interest, and required objectivity and an absence of bias.) Apparently, Gizmodo already knew about the broken promise in 2012. My friend Roy told me that: Following Google's corporate restructuring under the conglomerate Alphabet Inc. in October 2015, the slogan was replaced in the Alphabet corporate code of conduct by the phrase "Do the right thing". However, to this day, the Google code of conduct still contains the phrase "Don't be evil”. This ability to conveniently bend the rules (but not break the law) and make slippery judgment calls which are convenient to corporate interests, is well described by Margaret Hodge in this little snippet. Clearly, Google knows how to push self-congratulating, mildly false content through the global news gathering and distribution system by using terms like research and breakthrough with somewhat shaky research data that includes math, sexy flowcharts, and many tables showing “important research data”. They are after all the kings of SEO. However, I digress.

The basic deception I speak of, and yes I do understand that those might be strong words, is the overstatement of the actual results, using questionable methodology, in what I would consider an arithmetical manipulation of the basic data, to support corporate marketing messaging spin (essentially to bullshit the casual, trusting, but naïve reader who is not aware of the shaky foundations at play here and of statistics in general). Not really a huge crime, but surely just a little bit evil and sleazy. Not quite the Wells Fargo, Monsanto, Goldman Sachs and Valiant & Turing Pharmaceuticals (medical drug price gouging) level of evil and sleazy but give them time and I am sure they could rise to this level, and they quite possibly will step up their sleaze game if enough $$$s and business advantage issues are at stake. AI and machine learning can be used for all kinds of purposes both sleazy or not as long as you have the right power and backing.

So basically I see three problems with this announcement:
  1. Arithmetic manipulation to create the illusion of huge progress. (Like my use of font size and bold to make the word huge seem more important than it is.)
  2. Questionable human evaluation methodology which produces rating scores that are then further arithmetically manipulated and used to support the claims of “breakthrough” progress. Humans are unlikely to rate 3 different translations of the same thing on a scale of 0 to 6 (crap to perfect) accurately and objectively. Ask them to do it 500 times and they are quite likely to give you pretty strange and illogical results. Take a look at just four pages of this side-by-side comparison and see for yourself. The only case I am aware of where this kind of a translation quality rating was done with any reliability and care was by Appen Butler Hill for Microsoft. But translation raters there were not made to make a torturous comparison of several versions of the same translation, and then provide an independent rating for each. Humans do best (if reliable and meaningful deductions are sought from the exercise) when asked to compare two translations and asked a simple and clear question like: “Which one is better?” Interestingly even the Google team noted that: “We observed that human raters, even though fluent in both languages, do not necessarily fully understand each randomly sampled sentence sufficiently”. Yes, seriously dude, are you actually surprised? Methinks that possibly there is a kind of stupor that sets in when one spends too much time in the big data machine learning super-special room, that numbs the part of the brain where common sense resides.
  3. Then taking this somewhat shaky research and making the claim that the GNMT is “Nearly Indistinguishable From Human Translation” and figuring out a way to present this as “55% to 85%” improvements.
The overall magic formula to make this happen here seems to be:

Marketing Deception = fn (small BLEU improvements, shaky human evaluation on a tiny amount of data that sort of support our claim, math formulas, very sexy multimedia flow chart, lots of tables and data to make it look like science, HUGE amount of hyperbole, SEO marketing so everybody publishes it, like it was actually a big breakthrough and meaningful research.)


Arithmetic Manipulation

So onto some specifics. Here is the table that provides the “breakthrough results” to make the over-the-top claims. They did not even bother to do the arithmetic correctly on the first row! Seriously, are they not allowed to use Excel or at least a calculator? :
Take a look at the bar chart they provide below and tell me if any of the bars looks like there was a “55% to 85%” improvement from the blue line (PBMT) to the top of the green line (GNMT).  Do the green chunks look like they could possibly be 55% or more of the blue chunk? I surely don’t see it until I put my Google glasses on.

I think it is worth restating the original data in Table 10, in what to me is a much more forthright, accurate, reasonable, and less devious presentation shown below. Even though I remain deeply skeptical about the actual value of humans rating multiple translations of the same source sentence on a scale on 0 to 6. These restated results are also very positive so I am not sure why one would need to overstate these unless there was some marketing directive behind it. Also, these results point out that the English <> Chinese system experienced the biggest gains which both Microsoft and SYSTRAN have already confirmed and also possibly explains why it is the only GNMT system in production. For those who believe that Google is the first to do this, this is not so, both Facebook and Microsoft have production NMT systems running for some time now.

When we look at how these systems are improving with the commonly used BLEU score metric, we see that the progress is much less striking. MT has been improving slowly and incrementally as you can see from the EN > FR system data that was provided below. To put this is some context, Systran had an average of 5 BLEU points improvement on their NMT systems over the previous generation V8 systems. Of course, not the same training/test data and technically not exactly equivalent for absolute comparison, but the increase is relative to their old production systems and is thus a hyperbole-free statement of progress.


The Questionable Human Side-by-Side Evaluation 

I have shared several articles in this blog about the difficulties of doing human evaluations of MT output. It is especially hard when humans are asked to provide some kind of a (purportedly objective) score to a candidate translation. While it may be easier to rank multiple translations from best to worst (like they do in WMT16), the research shows that this is an area plagued with problems. Problems here means it is difficult to obtain results that are objective, consistent and repeatable. This also means that one should be very careful about drawing sweeping conclusions from such questionable side-by-side human evaluations. This is also shown by some of the Google research results as shown below which are counter-intuitive.
The Google research team uses human rating since BLEU is not always reliable and has many flaws and most of us in the MT community feel that competent human assessment is a way to keep it real. But this is what Google say about the human assessment result: “Note that we have observed that human raters, even though fluent in both languages, do not necessarily fully understand each randomly sampled sentence sufficiently and hence cannot necessarily generate the best possible translation or rate a given translation accurately.” They provide a table to show some samples of where they disagree. Would that not be a clue to suggest that the side-by-side comparison is flawed? So what does this mean? The human rating was not really competent? The raters don’t understand the machine’s intent and process? That this is a really ambiguous task so maybe the results are kind of suspect even though you have found a way to get a numerical representation of a really vague opinion? Or all of the above? Maybe you should not show humans multiple translations of the same thing and expect them to score them consistently and accurately.

Could it be that they need a more reliable human assessment process and maybe they should call Juan Rowda and Silvio Picinini at eBay and ask them how to do this correctly or at least read their posts in this blog. Or maybe they can hire competent translators to guide this human evaluation and assessment process instead of assigning “human raters” a task that simply does not make sense, no matter how competent they are as translators.

In the grand scheme of things, the transgressions and claims made in this Google announcement are probably a minor deception but I still think they should be challenged and exposed if possible, and if it is actually fair criticism. We live in a world where corporate malfeasance has become the norm of the day. Here we have a small example which could build into something worse. Monsanto, Well Fargo, Goldman Sachs do not have evil people (maybe just at the top) but they have a culture that rewards certain kinds of ethically challenged behavior if it benefits the company or helps you “make your numbers”. To me, this is an example in-kind and tells you something about the culture at Google.

We are still quite a long way from “Nearly Indistinguishable From Human Translation”. We need to be careful about overstating the definite and clear progress that actually has been made in this case. For some reason, this (overstatement of progress) is something that happens over and over again in MT. Keep in mind that drawing such sweeping conclusions on a sample of 500 is risky with big data applications (probably 250 Million+ sentence pairs) even when the sample is really well chosen and the experiment has an impeccable protocol which in this case is SIMPLY NOT TRUE for the human evaluation. The rating process is flawed to such an extent that we have to question some or many of the conclusions drawn here.The most trustworthy data presented here are the BLEU scores assuming it is truly a blind test set and no Google glasses were used to peek into the test.

This sentence was just pointed out to me after this post went live, thus, I am adding an update as an in place postscript. Nature (Journal of Science) provides a little more detail on the human testing process.
"For some other language pairs, the accuracy of the NMTS approached that of human translators, although the authors caution that the significance of the test was limited by its sample of well-crafted, simple sentences."
Would this not constitute academic fraud? So now we have them saying both that the "raters"  did not understand the sentences and that the sentences were essentially made simple i.e. rigged. To get a desirable and publishable result? For most in the academic community this would be enough to give pause and reason to be very careful about making any claims, but of course not for Google who looks suspiciously like they did engineer "the results" at a research level.

These results are consistent with what SYSTRAN has reported (described here ), and actually are slightly less compelling at a BLEU score level than the results SYSTRAN has had as I explained above. (Yes Mike I know it is not the same training and test set.)

Now that I have gotten the rant off my chest, here are my thoughts on what this “breakthrough” might mean:
  • NMT is definitely proving to be a way to drive MT quality upward and forward but for now, is limited to those with deep expertise and access to huge processing and data resources.
  • NMT problems (training and inference speed, small vocabulary problem, missing words etc..) will be solved sooner rather than later.
  • Experiment results like these should be interpreted with care, especially if they are based on such an ambiguous human rating score system. Don’t just read the headline and believe them, especially when they come from people with vested interests e.g. Google.
  • Really good MT always looks like human translation, but what matters is how many segments in a set of 100,000 look like a human translation. We should save our "nearly indistinguishable" comments for when we get closer to 90% or at least 70% of all these segments being almost human.
  • The success at Google, overstated though it is, has just raised the bar for both the Expert and especially the Moses DIY practitioners, which makes even less sense now since you could almost always do better with generic Google or Microsoft who also has NMT initiatives underway and in production.
  • We now have several end-to-end NMT initiatives underway and close to release from Facebook, Microsoft, Google, Baidu, and Systran. For the short term, I still think that Adaptive MT is more meaningful and impactful to users in the professional translation industry, but as SYSTRAN has suggested, NMT "adapts" very quickly with very little effort with small volumes of human corrective input. This is a very important requirement for MT use in the professional world. If NMT is as responsive to corrective feedback as SYSTRAN is telling us, I think we are going to see a much faster transition to NMT.
As I said in a previous post on the Systran NMT: They have reason to be excited, but all of us in MT have been down this path before and as many of us know, the history of MT is filled with empty promises.

The marketing guy at Google who pushed this announcement in its current form should be asked to watch this video (NSFW link, do NOT click on it if you are easily offended) at least 108 times. The other guys should also watch it a few times too. Seriously, let’s not get carried away just yet. Let’s wait to hear from actual users and let’s wait to see how it works in production use scenarios before we celebrate.

As for the Google corporate motto, I think it has been already true for some time now that Google is at least slightly evil, and I recommend you watch the three-minute summary by Margaret Hodge to see what I mean. Sliding down a slippery slope is a lot easier than standing on a narrow steep ledge in high winds. In today's world, power and financial priorities rule over ethics, integrity and principal and Google is just following the herd that includes their friends at Well Fargo, Goldman Sachs, VW, and Monsanto. I said some time ago that a more appropriate motto for Google today might actually be: “You Give, We Take.” Sundar should walk around campus in a T-shirt (preferably one made by underpaid child labor in Bangladesh) with this new motto boldly emblazoned on it in some kind of a cool Google font. At least then Google marketing would not have to go through the pretense of having any kind of ethical, objective or non-biased core which the current (original) motto forces them to contend with repeatedly. The Vedic long view of the human condition across eons, says we are currently living in the tail end of the Kali Yuga, an age of darkness and falsehood and wrong values. An age when charlatans (Goldman Sachs, VW, Monsanto, Wells Fargo and Google) are revered and even considered prophets. Hopefully, this era ends soon.

Florian Faes has made a valiant effort to provide a fair and balanced view on these claims from a variety of MT expert voices. I particularly enjoyed the comments by Rico Sennrich of the University of Edinburgh who cuts through the Google bullshit most effectively. For those who think that my rant is unwarranted, I suggest that you read the Slator discussion, as you will get a much more diverse opinion. Florian even has rebuttal comments from Mike Schuster at Google whose responses sound more than a little bit like the spokespersons at Well Fargo, VW and Goldman Sachs to me. Also, for the record, I don’t buy the Google claim “Our system’s translation quality approaches or surpasses all currently published results,” unless you consider only their own results. I am willing to bet $5 that both Facebook and Microsoft (and possibly Systran and Baidu) have equal or better technology. Slator is the best thing that has happened to the “translation industry” in terms of relevant current news and investigative journalism and I hope that they will thrive and succeed.

I remain willing to stand corrected if my criticism is unfounded or unfair, especially if somebody from Google sets me straight. But I won’t hold my breath, not until the end of the Kali Yuga anyway.


Thursday, September 22, 2016

Comparing Neural MT, SMT and RBMT – The SYSTRAN Perspective

This is the second part of an interview with Jean Senellart (JAS) , Global CTO and SYSTRAN SAS Director General. The first part can be found here: A Deep Dive into SYSTRAN’s Neural Machine Translation (NMT) Technology .

Those translators who accuse MT vendors of stealing or undermining their jobs should take note that SYSTRAN is the largest independent MT vendor. A position it has held for most of its existence, but has never generated more than €20 million in annual revenue. Which to me suggests that MT is mostly being used for different kinds of  translation tasks and is hardly taking any jobs away. The problem has been much more related to unscrupulous or incompetent LSPs who used MT improperly in rate negotiations. MT is hugely successful for those companies who find other ways to monetize the technology, as I pointed out in  The Larger Context Translation Market.This does suggest that MT has huge value in enabling global communication and commerce, and even in its less than perfect state is considered valuable by many who might otherwise need to acquire human translation services. If anything, MT vendors are the ones that are trying hardest to develop technology that is actually useful to professional translators as the Lilt offering shows and as this new SYSTRAN offering also promises to be. The new MT technology is reaching a point where it is becoming rapidly responsive to corrective feedback and thus much more useful to professional translation use case scenarios. The forces affecting translator jobs and work quality are much more complex and harder to pin down as I have also mentioned in the past.

In my conversations with Jean, I realized that he is one of the few people around in the "MT industry", who has deep knowledge and production use experience with all three MT paradigms. Thus, I tried to get him to share both his practical experience-based and philosophical perspectives about the three approaches. I found his comments fascinating and thought that it would be worth highlighting them separately in this post. Jean (through SYSTRAN) is unique in being one of the rare practitioners around, who has produced commercial release versions, of say French <> English MT systems, using all three MT paradigms. More if you count the SPE and NPE “hybrid” variants where more than one approach is used in a single production process.

Again in this post, I have tried as much as possible to keep Jean Senellart’s direct quotes in here to avoid any misinterpretation. My comments are in italics when they do occur within his quotes.


Comparing The Three Approaches; The Practical View

Some interesting comparative comments made by Jean about his actual implementation experience with the three methodologies (RBMT, SMT, NMT):

“The NMT approach is extremely smart to learn the language structure but is not as good at memorizing long lists of terminology as RBMT or SMT was. With RBMT, the terminology coverage was right, but the structure was clumsy – with SMT or SPE, we had an “in-between” situation where we got the illusion of fluency, but sometimes at the price of a complete mistranslation, and a strange ability to memorize huge lists but without any consideration of their linguistic nature. With NMT, the language structure is excellent, as if the neural network really deeply understands the grammar of the language – and introducing [greater] support of terminology was the missing link to the previous technologies.”

(This inability of NMT to handle large vocabulary lists is considered one of the main weakness of the technology currently. Here is another reference discussing this issue. However, it appears that SYSTRAN has developed some kind of a solution to address this issue.)

What is interesting with NMT is it seems far more tolerant than PBSMT (Phrase-Based SMT) to noisy data. However, it does need far less data than PBSMT to learn – so we can afford to provide only data for which we know we have a good alignment quality. Regarding the domain or the quality [of MT output of] the languages, we are for the moment trying to be as broad as possible [
rather than focusing on specialized domains].”

In terms of training data volume, JAS said: “This is still very empirical, but we can outperform Google or Microsoft MT on their best language pairs using only 5M translation units – and we have a very good quality (BLEU score about 45** ) for languages like EN>FR with only 1M TU. I would say we need 1/5 of the data necessary to train SMT. Also, generic translation engines like Google or Bing Translate are using billions of words for their language models, here we need probably less than 1/100th.”

(**I think it bears saying that I fully expect that the BLEU here is measured with great care and competence, unlike what we see so often with Moses practitioners and LSPs in general who assume scores of 75+ are needed for the technology to be usable.

The ability of the new MT technology to improve rapidly with small amounts of good quality training data and small amounts of corrective feedback suggests that we may be approaching new thresholds in the use of MT for professional use.)

Comparing RbMT, SMT, and NMT: The Philosophical View

When I probed more deeply into the differences between these MT approaches, (since really SYSTRAN is the only company who has real experience in all 3), JAS said: “I would more compare them in terms of what they are trying to do, and on their ability to learn.” His explanation reflects his long-term experience and expertise and is worth careful reading. I have left the response in his own words as much as possible.

RBMT, fundamentally (unlike the other two), has an ulterior motive: it attempts to describe the actual translation process. And by doing that, it has been trying to solve a far more complicated challenge than just machine translation; it tries to decompose [and deconstruct and analyze] and explain how a translation is produced. I still believe that this goal is the ultimate goal of any [automated translation] system. For many applications, in particular, language learning, but also for post-editing, the [automated ] system would be far more valuable if it could produce not only the translation but also explain the translation.”

“In addition, RBMT systems are facing three main limitations in language [modeling] which are:

1) the intrinsic ambiguity of language for a machine, which is not the same for a human who has access to meaning [and a sense for the underlying semantics],

2) the exception-based grammar system, and

3) the huge, contextual and always expanding volume of terminology units.”

“Technically, RBMT systems might have different levels of complexity depending on the linguistic formalism being used, making it hard to compare with the others (SMT, NMT), so I would rather say that one of the main reasons for the limitations of a pure RBMT system lies in its [higher reaching] goal. The fact is that fully describing a language is a very complicated matter, and there is no language today for which we can claim a full linguistic description.”

“SMT came with an extraordinary ability to memorize from its exposure to existing translations – and with this ability, it brought a partial solution to the first challenge mentioned above, and a very good solution to the third one – the handling of terminology – however, it mostly ignores the modeling of the grammar. Technically, I think SMT is the most difficult of the 3 approaches, it combines many algorithms to optimize the MT process, and it is hard work to deal with the huge database of [relevant training ] corpus.”

“NMT has access to the meaning and is dealing well with modeling of human language grammar. In terms of difficulty, NMT engines are probably the simplest to implement, a full training of an NMT engine involve only several thousands of lines of code. The simplicity of implementation is, however, hiding the fact that we/nobody knows why it is so effective.”

“I would use an analogy of a human learning to drive a car to explain this more fully:

- The Rule-based approach will attempt to provide a full modeling of the car dynamic, on how the engine is connected to the wheel, on the effect of acceleration in the trajectory, etc. This is very complicated. (And possibly impossible to model in totality.)

- The Statistical approach, will use data from past experience and will try to compare a new situation with a past situation and will decide on the action based on this large database [of known experience]. This is a huge task and very difficult to implement. (And can only be as good as the database it learns from.)

- The Neural approach, with a limited access to the phenomenon involved, or with limited ability to remember, will build its own “thinking” system to optimize the driving experience, it will actually learn to drive the car, build reflexes – but will not be able to explain why and how such decisions are being made, and will not be able to leverage local knowledge – for instance that at a specific bend on the road in very specific weather condition, it had to anticipate [braking] because it is particularly dangerous, etc... This approach is surprisingly very simple and thanks to computation power evolution have become more accessible.”

“Today, this last approach (NMT) is clearly the most promising but will need to integrate the second (SMT) to be more robust, and eventually to deal with the first one (RBMT) to be able to not only make choices but also explain them.”


Comparing NMT to Adaptive MT

When probed about this, JAS said: “Adaptive MT is an innovative concept based on the current SMT paradigm – it is, however, a concept that is quite naturally embedded in the NMT paradigm; of course, there will be work needed to be done to make it work as nicely as the Lilt product does it. But, my point is that NMT (and not just from Systran) will bring a far more intuitive solution to this issue of continuous adaptive learning, because it is built for that: on a trained model, we can tune the model without any tricks with feedback of one single sentence – and produce a translation which immediately adapts to user input.”

The latest generation MT technology, especially NMT and Adaptive MT look like a major step forward to enabling the expanding use of MT in professional translation settings. With continuing exploration and discovery in the fields of NLP, artificial intelligence and machine intelligence, I think we may be in for some exciting times ahead as these discoveries benefit MT research. Hopefully, the focus will shift to making new content multilingual and solve new kinds of translation challenges, especially in speech and video. I believe that we will see more of the kinds of linguistic steering activities we are seeing in motion at eBay and that there will always be a role for competent linguists and translators.

Jean Senellart, CEO, SYSTRAN SA

The first part of the SYSTRAN interview can be found at A Deep Dive into SYSTRAN’s Neural Machine Translation (NMT) Technology 

Post Script: There is a detailed article that describes the differences between these approaches on the SYSTRAN website  released a few weeks after this post was published.

Wednesday, September 21, 2016

A Deep Dive into SYSTRAN’s Neural Machine Translation (NMT) Technology

One of the wonderful things about my current independent status is the ability to engage deeply with other MT experts who were previously off limits because competing MT vendors don't usually chat with open hearts and open cloaks. MT is tough to do well and I think the heavy lifting should be left to people who are committed for the long run, and who are willing to play, invest and experiment in spite of regular failure. This is how humans who endure and persist, learn and solve complex problems.

This is Part 1 of a two-part post on the SYSTRAN NMT product announcement. The second part will focus on comparing NMT with RBMT and SMT and also with the latest Adaptive MT initiatives. It can be found here: Comparing Neural MT, SMT and RBMT – The SYSTRAN Perspective

Press releases are so filled with marketing-speak as to be completely useless to most of us. They have a lot of words but after you read them you realize you really don't know much more than you got from the headline. So, I recently had a conversation with Jean Senellart , Global CTO and SYSTRAN SAS Director General, to find out more about their new NMT technology. He was very forthcoming and responded to all my questions with useful details, anecdotes, and enthusiasm. The conversation only reinforced in my mind that "real MT system development" is something best left to experts, and not something that even large LSPs should dabble with. The reality and complexity of NMT development push the limits of MT even further away from the DIY mirage.

In the text below, I have put quotes around everything that I have gotten directly from SYSTRAN material or from Jean Senellart (JAS) to make it clear that I am not interpreting. I have done some minor editing to facilitate readability and "English flow" and added comments in italics within his quotes where this is done.


The New Product Line

JAS clarified several points about the overall evolution of the SYSTRAN product line.
SYSTRAN intends to keep all the existing MT system configurations they have in addition to the new NMT options. So they will have all of the following options:
  • RBMT :- the rule-based legacy technology
  • SMT :- Moses-based generation of engines that they have released for some language pairs over the last few years
  • SPE :- Statistical Post-Editing translation engines - that were introduced in 2007 as the first implementation combining Rule-Based plus Phrase-Based Statistical systems.
  • NMT:- is the purely neural machine translation engines that they just announced.
  • NPE :- stands for « Neural Post-Editing » and it is the replication of what they did in SPE using Phrase-Based machine translation, but now using Neural Machine Translation instead of SMT for the second step in the process. They are now using a neural network to correct and improve the output of a rule-based engine.

They will preserve exactly the same set of APIs and features (like the support of a user dictionary) around these new NMT modules so that these historical linguistic investments are fully interchangeable across the product line.

JAS said: "From my intuition, there will still be situations where we will prefer to continue to offer the older solutions: for instance, when we will need high-throughput on a standard CPU server, or for low-resource languages for which we already have some RBMT solution, or for customers currently using heavily customized engines." However, they expect that NMT will proliferate even in the small memory footprint environment, and even though they expect that NMT will eventually prevail, they will keep the other options available for their existing customer base.

The NMT initiative focused on languages that were most important to their customers, or was known to be difficult historically, or was currently presenting special challenges not easily solved with the legacy solutions. So as expected the initial focus was on EN<>FR, EN<>AR, EN<>ZH, EN<>KO, FR<>KO. All of these already show promise, especially the KO <> EN, FR combinations which showed the most dramatic improvements and can be expected to improve further as the technology matures.

However, DE<>EN is one of the most challenging language pairs, as Jean said: "we have found the way to deal with the morphology, but the compounding is still problematic. Results are not bad, though, but we don't have the same jump in quality yet for this language pair."


The Best Results

So where have they seen the most promising results? As Jean said: "The most impressive results I have seen are in complicated language pairs like English-Korean, however, even for Arabic-English, or French-English the difference of quality between our legacy engines, online engines, and this new generation is impressive.

What I found the most spectacular is that the translation is naturally fluent at the full sentence level - while we have been (historically used) to some feeling of local fluency but not sounding fully right at the sentence level. Also, there are some cases, where the translation is going quite away from the source structure - and we can see some real "rewriting" going on."

Here are some examples comparing KO>EN sentences with NMT, SYSTRAN V8 (the current generation) and Google:

And here are some examples of where the NMT seems to make linguistically informed decisions and changes the sentence structure away from the source to produce a better translation.


The Initial Release

When the NMT technology is released in October, SYSTRAN expects to release about 40 language pairs (mostly European and Major Asian languages related to English and French) with an additional 10 still in development to be released shortly after.

As JAS stated: "We will be delivering high-quality generic NMT engines that will be instantly ready for "specialization" (I am making a difference with customization (which implies training) because the nature of the adaptation to the customer domain is very different with NMT)."

Also very important for the existing customer base is that all the old dictionaries developed over many years for RBMT / SMT systems will be useful for NMT systems. As Jean confirmed: "Yes - all of our existing resources are being used in the training of the NMT engines. It is worth noting that, dictionaries are not the only components from our the legacy modules we are re-using, the morphological analysis or named entity recognition are also key parts of our models."

With regard to the User Interface for the new NMT products, JAS confirmed: "the first generation will fully integrate into the current translation infrastructure we have - we had to replace of course the back-end engines, but also some intermediate middle components. However, the GUI is preserved. We have started thinking about the next generation of UI which will fully leverage the new features of this technology, and will be targeting a release next year."

The official SYSTRAN marketing blurb states the following:
"SYSTRAN exploits the capacity NMT engines have to learn from qualitative data by allowing translation models to be enriched each time the user submits a correction. SYSTRAN has always sought to provide solutions adjusted to the terminology and business of its customers by training its engines on customer data. Today SYSTRAN offers a self-specialized engine, which is continuously learning on the data provided."


Driving MT Engine Improvements

Jean also informed me that NMT has a simple architecture but the number of options available to tune the engines are huge and he has not found one single approach that is suitable for all languages. Options that can make a significant difference include, "type of tokenization, the introduction of additional features for instance for guiding the alignment, etc...

So far we have not found one single paradigm that works for all languages, and each language pair seems to have its own preference. What we can observe is that unlike SMT where the nature of the parameters was numerical and not really intuitive, here it seems that we can get major improvements by really considering the nature of the language pair we are dealing with."

So do these corrective changes require re-training or is there an instant dictionary-like capability that works right away? "Yes - this is a cool new feature.We can introduce feedback to the engine, sentence by sentence. It does not need retraining, we are just feeding the extra sentence and the model instantly adapts. Of course, the user dictionary is also a quick and easy option. The ability of an NMT engine to "specialize" very easily and even to adapt from one single example is very impressive."


Detailed MT Quality Metrics

"What is interesting is that we get major score improvement for systems that have not been tuned for the metrics they are evaluated against - for instance, here are some results on English-Korean using the RIBES metric."


"In general, we have results in the BLEU range of generally above 5 points improvement over current baselines."

"The most satisfying result, however, is that the human evaluation is always confirming the results - for instance for the same language pair shown below - when doing pair-wise human ranking we obtained the following results. (RE is human reference translation, NM is NMT, BI is Bing, GO is Google, NA is Naver, and V8 our current generation). It reads "when a system A was in a ranking comparison with a system B - or reference), how many times was it preferred by the human?"
"What is interesting in the cross comparison is that when we rank engines by the pair - When we blindly show a Google and V8 translation we see which one the user prefers. The most interesting row, however, is the second one:

         RE             BI       GO       NA      V8
NM  46.4           74.5    73.9    72     63.1

When comparing NMT output with the human reference translation, 46% of the time NMT is preferred (which is not bad, that means about one sentence out of two, the human does not prefer the Reference HT over NMT!), when comparing NMT and Google - 74% of the time, the preference goes to NMT, etc..."


The Challenges

The computing requirements have been described by many as a particular challenge. Even with GPUs, training an NMT engine is a long task. As Jean says: "and when we have to wait 3 weeks for a full training, we do need to be careful with the training workflow and explore as many options as possible in parallel."

"Artificial neural networks have a terrific potential but they also have limitations, particularly to understand rare words. SYSTRAN mitigates this weakness by combining artificial neural network and its current terminology technology that will feed the machine and improve its ability to translate."

"It is important to point out that graphic processing units (GPUs) are required to operate the new engine. Also, to quickly make this technology available, SYSTRAN will provide the market with a ready-to-use solution using an appliance (that is to say hardware and software integrated into a single offering). In addition, the overall trend is that desktops will integrate GPUs in the near future as some smartphones already do (the latest iPhone can manage neural models). As [server] size is becoming less and less of an issue, NMT engines will easily be able to run locally on an enterprise server."

As mentioned earlier there are still some languages where the optimal NMT formula is still being unraveled e.g. DE <> EN but these are still early days and I think we can expect that the research community will zero in on these tough problems, and at some point at least small solutions will be available even if complete solutions are not.


Production User Case Studies

When asked about real life production use of any of the NMT systems Jean provided two key examples.

"We have several beta-users - but two of them are most significant. For the first one, our goal is to translate a huge tourism related database from French to English, Chinese, Korean, and Spanish. We intend to use and publish the translation without post-editing. The challenge was to introduce support for named entity recognition in the model - since geographical entities were quite frequent [in the content] and a bit challenging for NMT. The best model was a generic model, meaning that we did not even have to adapt to a tourism model - and this seems to be a general rule, while in previous generation MT, the customization was doing 80% of the job, for NMT, the customization is only interesting and useful for slight final adaptation.

The second [use case]- is about technical documentation in English>Korean for an LSP. The challenge was that the available "in-domain" data was only 170K segments, which is not enough to train a full engine, but seems to be good enough to specialize a generic engine."

From everything I understand from my conversations, SYSTRAN is far along the NMT path, and miles ahead in terms of actually having something to show and sell, relative to any other MT vendor . They are not just writing puff pieces about how cool NMT is, to suggest some awareness of the technology. They have tested scores of systems and have identified many things that work and many that don't. Like many innovative things in MT, it takes at least a thousand or more attempts before you start developing real competence.They have been carefully measuring the relative quality improvements with competitive alternatives, which is always a sign that things are getting real. The product is not out yet, but based on my discussions so far, I can tell they have been playing for awhile. They have reason to be excited, but all of us in MT have been down this path before and as many of us know, the history of MT is filled with empty promises. As the Wolf character warns us (NSFW link, do NOT click on it if you are easily offended) in the movie Pulp Fiction after fixing a somewhat impossible problem, let's not get carried away just yet. Let's wait to hear from actual users and let us wait to see how it works in more production use scenarios before we celebrate.

The goal of the MT developer community has always been to get a really useful automated translation, in a professional setting, since perfection it seems is a myth. SYSTRAN has seriously upped their ability in being able to do this. They are getting continuously better translation output from the machine.  If I were working with an enterprise with a significant interest in CJK <> E content, I would definitely take a closer look, as I have also gotten validation from Chris Wendt at Microsoft on their own success with NMT on J <>E content. I look forward to hearing more feedback about the NMT initiative at SYSTRAN, and if they keep me in the loop I will share it on this blog in future. I encourage you to come forward with your questions as it is a great way to learn and get to the truth, and Jean Senellart seems willing and able to share his valuable insights and experience.


Jean Senellart, CEO, SYSTRAN SA

Thursday, September 15, 2016

LSP Perspective - MT and Translators

This is a guest post by Deepan Patel at Milengo Ltd. This post is part of a series of upcoming posts that will provide varying LSP perspectives on MT technology. I would invite other LSPs who may have an interest in sharing a view to come forward and contact me to discuss different potential subjects. I may not always share the opinions of the guest writers, but I am a proponent of sharing differing views, and letting readers decide for themselves what makes most sense to them. I do not edit these opinions except for minor typos and basic reading flow related edits. I may sometimes highlight some statements in bold to highlight something I think is central to the view. I have added a presentation made by Deepan at TAUS, which is in the public domain, to provide more detail on his user case context. 




One of the key factors in Milengo’s success in establishing a robust and sustainable framework for machine translation (MT) has been to make translators an integral part of all our MT-related activities. Machine translation continues to be a divisive topic for translators for various reasons, and it is paramount for organizations that offer MT-based localization solutions to engage in respectful dialogue with translators. It is a prerequisite to ensuring successful implementation of any MT strategy. From my point of view, respectful dialogue entails addressing the very valid concerns that translators may have, especially on topics such as post-editing, in a balanced manner. 


There are many translators who for whatever reason do not want to post-edit machine-translated content at all and it is important to respect their reasons for this. I have spoken to many translators who simply do not like the discipline of post-editing because they feel that it introduces a sort of negative mindset into their work. Translation becomes less of a creative activity for them because the overt focus of post-editing is on error analysis and correction. 

A corollary to this is that translators can feel that post-editing requires them to lower their own expectations in terms of producing a highly polished and stylish translation. Whilst post-editing they find themselves fighting the urge to completely rewrite sentences so that the style of language corresponds to their own preferences, even in cases where only minor adjustments would be needed to produce a perfectly adequate sentence. To me these are perfectly reasonable viewpoints and we never seek to coerce translators into performing post-editing work, if they do not want to. 

A lot of translators have also expressed frustrations with the attitude of some of the language service providers (LSPs) with whom they work; having post-editing work thrust at them without any discussion on whether the original content selected for post-editing is actually suitable for MT, not being provided with clear enough directives on the key aspects to focus on during post-editing, and most importantly, no spirit of negotiation when it comes to establishing fair remuneration for the task at hand. 

I am surprised if LSPs do choose to engage with translators on post-editing assignments in the manner just described. In my opinion, such an approach can only serve to thoroughly alienate translators and is ultimately detrimental to the objective of successful MT implementation. 

When we started our own processes several years ago of introducing MT into our service spectrum, we were (and still are) heavily dependent on the guidance of our translators in establishing the parameters under which MT could help to increase translator throughput. And perhaps even more importantly, our translators help us to recognize the scenarios where MT does not really add much, if any, value to the translation process. As a result, our own approach is much more focused on those scenarios where we are confident that MT makes sense, and we are consequently handling ever-increasing volumes of localization work from our clients with post-editing workflows. For all of this we have our translators to thank – not only for helping to us shape a consultative approach regarding MT with our clients, but of course without them we would never get the post-editing work completed! 

The key has been to involve translators at every stage of a given testing scenario. Much of the work that we undertake during testing and evaluation phases relates to the careful selection of bilingual content to be used during MT engine creation. Although we use and appreciate automated mechanisms for extraction, consolidation and ‘cleansing’ of engine training data, it should never be forgotten that we are still dealing with language after all, and that highly proficient linguists should also play a very valuable role in the data selection process. 

For example, we ask our translators to help us design bilingual test, tuning, and terminology sets to be used for engine creation. These are constructed based on analysis of the actual source content that is eventually intended for machine translation, and are really vital for us in being able to effectively benchmark the performance of any engine that we train. Once an initial working engine is in place, our translators help us to verify the automated evaluation metric scores generated during the training process, and to identify patterns of errors in the output which we seek to remedy as much as possible in subsequent engine re-trainings. Eventually, the trial post-editing runs with our translators help us to agree on reasonable throughput expectations and consequently a consensus on fair compensation for post-editing. 

Ultimately we are strong advocates of highly collaborative working models with translators when it comes to testing and eventually implementing MT for a given scenario. Having translators participate at every stage of a lengthy testing process means that they are in full possession of all relevant facts to make informed decisions about whether a given MT engine adds value to their translation work or not. Similarly, we (Milengo) are able to shape our own approach towards evaluating whether MT could work effectively or not for a given localization scenario based on the expert guidance of our translators. I really cannot overstate the value of translators for us in all our MT-related activities. 

About the writer 

Deepan Patel is Milengo Ltd’s MT Specialist and has been working in the localization industry for seven years. He is a Modern Languages graduate from the University of Oxford and a certified Memsource trainer.