Wednesday, October 19, 2016

SYSTRAN Releases Their Pure Neural MT Technology

SYSTRAN announced earlier this week that they are doing a “first release” of their Pure Neural™ MT technology for 30 language pairs. Given how good the Korean samples that I saw were, I am curious why Korean is not one of the languages that they chose to release.

"Let’s be clear, this innovative technology will not replace human translators. Nor does it produce translation which is almost indistinguishable from human translation"  ...  SYSTRAN BLOG

The languages pairs being initially released are 18 in and out of English, specifically EN<>AR, PT-BR, NL, DE, FR, IT, RU, ZH, ES  and 12 in and out of French  FR<>AR, PT-BR, DE, IT, ES, NL. They claim these systems are the culmination of over 50,000 hours of GPU trainings but are very careful to say that they are still experimenting and tuning these systems and that they will adjust them as they find ways to make them better.

They have also enrolled ten major customers in a beta program to validate the technology at the customer level, and I think this is where the rubber will meet the road and we will find how it really works in practice.

The boys at Google (who should still be repeatedly watching that Pulp Fiction clip), should take note of their very pointed statement about this advance in the technology:

Let’s be clear, this innovative technology will not replace human translators. Nor does it produce translation which is almost indistinguishable from human translation – but we are convinced that the results we have seen so far mark the start of a new era in translation technologies, and that it will definitely contribute to facilitating communication between people.
Seriously Mike (Schuster) that’s all that people expect; a statement that is somewhat close to the reality of what is actually true.

They have made a good effort at explaining how NMT works, and why they are excited, which they say repeatedly through their marketing materials. (I have noticed that many who work with Neural net based algorithms are still somewhat mystified by how it works.) They plan to try and explain NMT concepts in a series of forthcoming articles which some of us will find quite useful, and they also provide some output examples which are interesting to understand how the different MT methodologies approach language translation.

 CSA Briefing Overview

In a recent briefing with Common Sense Advisory they shared some interesting information about the company in general:
  • The Korean CSLi Co. ( acquisition has invigorated the technology development initiatives.
  • They have several large account wins including Continental, HP Europe, PwC and Xerox Litigation Services. These kinds of accounts are quite capable of translating millions of words a day as a normal part of their international operational needs.
  • Revenues are up over 20% over 2015, and they have established a significant presence in eDiscovery area which now accounts for 25% of overall revenue.
  • NMT technology improvements will be assessed by an independent third party (CrossLang) with long term experience in MT evaluation, and who are not likely to say misleading things like "55% to 85% improvements in quality" like the boys at Google.
  • SYSTRAN is contributing to an open-source project on NMT with Harvard University and will share detailed information about their technology there. 

Detailed Technical Overview

They have also supplied a more detailed technical paper which I have yet to review carefully, but what struck me immediately on initial perusal was that the data volumes they are building their systems with are minuscule compared to what Google and Microsoft have available. However, the ZH > EN results did not seem substantially different from the amazing-NOT GNMT system. Some initially interesting observations are highlighted below, but you should go to the paper to see the details:

Domain adaptation is a key feature for our customers — it generally encompasses terminology, domain and style adaptation, but can also be seen as an extension of translation memory for human post-editing workflows. SYSTRAN engines integrate multiple techniques for domain adaptation, training full new in-domain engines, automatically post-editing an existing translation model using translation memories, extracting and re-using terminology. With Neural Machine Translation, a new notion of “specialization” comes close to the concept of incremental translation as developed for statistical machine translation like (Ortiz-Martınez et al., 2010 )

What is encouraging is that adaptation or “specialization” is possible with very small volumes of data, and this can be run in a few seconds which suggests this has possibilities to be an Adaptive MT model equivalent.

 Our preliminary results show that incremental adaptation is effective for even limited amounts of in-domain data (nearly 50k additional words). Constrained to use the original “generic” vocabulary, adaptation of the models can be run in a few seconds, showing clear quality improvements on in-domain test sets .

Of course the huge processing requirements of NMT remain a significant challenge and perhaps they are going to have to follow Google and Microsoft who both have new hardware approaches to address this issue with the TPU (Tensor Processing Units) and programmable FPGAs that Microsoft recently announced to deal with this new class of AI based machine learning applications.

For those who are interested,  I ran a paragraph from my favorite Chinese News site and compared the Google “nearly indistinguishable from human translation”  GNMT output with the SYSTRAN PNMT output and I really see no big differences in quality from my rigorous test, and clearly we can safely conclude that humanity is quite far from human range MT quality at this point in time.

 The Google GNMT Sample 


The SYSTRAN Pure NMT Sample

Where do we go from here?

I think the actual customer experience is what will determine the rate of adoption and uptake. Microsoft and a few others are well along the way with NMT too. I think SYSTRAN will provide valuable insights in December from the first beta users who actually try to use it in a commercial application. There is enough evidence now to suggest that if you want to be a long-term player in MT you had better have actual real experience with NMT and not just post how cool NMT is and use SEO words like machine learning and AI on your website.

The competent third party evaluation SYSTRAN has planned is a critical proof statement that hopefully provides valuable insight on what works and what needs to be improved at the MT output level. It will also give us more meaningful comparative data than the garbage that Google has been feeding us. We should note that while BLEU score jumps are not huge the human evaluations show that NMT output is often preferred by many who look at the output.

The ability of serious users to adapt and specialize the NMT engines for their specific in-domain needs I think is really a big deal – if this works as well as I am being told, I think it will quickly push PBSMT-based Adaptive MT (my current favorite) to the sidelines, but it is still too early to really to say this with anything but Google MT Boys certainty.

But after a five-year lull in the MT development world and seemingly little to no progress, we finally have some excitement in the world of machine translation and NMT is still quite nascent. It will only get better and smarter.

Tuesday, October 11, 2016

The Importance & Difficulty of Measuring Translation Quality

This is another, timely post describing the challenges of human quality assessment by Luigi Muzii. As we saw from the recent deceptive Google NMT announcements that while there is a lot of focus on new machine learning approaches we are still using the same quality assessment approach of yesteryear: BLEU. Not much has changed. It is well understood that this metric is flawed but there seems to be no useful replacement coming forward. This necessitates that some kind of human assessment also has to be made and invariably this human review is also problematic. The best practices for these human assessments that I have seen are at Microsoft and eBay. The worst at many LSPs and Google. The key to effective procedures seems to be, the presence of invested and objective linguists on the team, and a culture that has integrity and rigor without the cumbersome and excessively detailed criteria that the "Translation Industry" seems to create (DQF & MDM for example). Luigi offers some insight on this issue that I feel is worth note as we need to make as much more progress on the human assessment of MT output as well. Not only to restrain Google from saying stupid shite like “Nearly Indistinguishable From Human Translation” but also to really understand if we are making progress and understand better what needs to be improved. MT systems can only improve if competent linguistic feedback is provided as the algorithms will always need a "human" reference. The emphasis below is all mine and was not present in the original submission.


Dimensionally speaking, quality is a measurement, i.e. a figure obtained by measuring something.

Because of the intricacies related to the intrinsic nature of languages, objective measurement of translation quality has always been a much researched and debated topic that has borne very little fruit. The notion of understood quality level remains unsolved, together with any kind of generally accepted and clearly understood quality assessment and measurement.

Then along came machine translation and, since it’s inception, we have been facing the central issue of estimating the reliability and quality of MT engines. Quite obviously, this was done by comparing the quality of machine translated outputs to that of human reference data using statistical methods and models, or by having bilingual humans, usually, linguists, evaluate the quality of machine translated output.

Ad hoc algorithms based on specific metrics, like BLEU, were developed to perform automatic evaluation and produce an estimate of the efficiency of the engine for tuning and evaluation purposes. The bias implicit in the selection of the reference model remains a major issue, though, as there is not only one single correct translation. There can be many correct translations.

Human evaluation of machine translation has always been done in the same way as for human translations, with the same inconsistencies, especially when results are examined over time and when these evaluations are done by different people. The typical error-catching approach of human evaluation results is irredeemably biased, as long as errors are not defined uniquely and unambiguously, and if care is not taken to curb giving too much scope to the evaluator’s subjective preferences.

The problem with human evaluation is bias. The red-pen syndrome.

Indeed, human evaluation of machine translation is known for being expensive, time-consuming and often skewed, and yet it is supposed to overcome the drawbacks introduced by the limited accuracy and approximation of automatic evaluation. However, the complications of the many new quality measurement metrics proposed over the years have not yet reduced this rough approximation that we are still faced with. They have instead added to the confusion with these new metrics, which are not well understood and introduce new kinds of bias. In fact, despite the many efforts made over the last few years, the overall approach has remained the same, with a disturbing inclination to move in the direction of too much detail rather than move to more streamlined approaches. For example, the new complexity rising from the integration of DQF and MDM has proven to be expensive and unreliable so far and of limited value. Many know about the inefficiency and ineffectiveness of the SAE metrics once applied to the real world, with many new errors introduced by reviewers, together with many false positives. Indeed, translation quality metrics have become more and more complex and overly detailed, and always seem to be based on the error-catching approach that has proved costly and unreliable thus far. Automatic metrics can be biased too, especially when we assume that the human reference samples are human translation perfection, but at least they are fast, consistent, and convenient. And their shortcomings are widely known and understood. 

People in this industry— and especially academics—seem to forget or ignore that every measurement must be of functional value to business, and that the plainer and simpler the measurement the better it is, enabling it to be easily grasped and easily used in a production mode.

On the other hand, just like human translation, machine translation is always of unknown quality, especially when rendered in a language unknown to the buyer, but it is intrinsically much more predictable and consistent, when compared to human translated projects with large batches of content, where many translators are possibly involved.

Effective upfront measurement helps to provide useful prior knowledge, thus reducing uncertainty, leading to well-informed decisions, and lessening the chance of deployment error. Ultimately, effective measurement helps to save money. Therefore, the availability of clear measures for rapid deployment are vital for any business using machine translation.

Also, any investment in machine translation is likely to be sizeable. Implementing a machine translation platform is a mid- to long-term effort requiring specialized resources and significant resilience to potentially frustrating outcomes in the interim. Effective measurements, including evaluation of outputs, provide a rational basis for selecting what improvements to make first.
In most cases, any measurement is only an estimate, a guess based on available information, made by approximation: it is almost correct and not intended to be exact.

In simple terms, the logic behind the evaluation of machine translation output is to get a few basic facts pinned down:

  1. The efficiency and effectiveness of the MT engines;

  2. The size of the effort required for further tuning the MT engine;

  3. The extent and nature of the PEMT effort.

Each measure is related to one or more strategic decisions. 

Automatic scores give at least some idea of the efficiency and effectiveness of engines. This is crucial to estimate the distance from the required and expected level of performance, and the time for filling the gap.

For example, if using BLEU as the automatic assessment reference, 0.5-0.8 could be considered acceptable for full post-editing, 0.8 or higher for light post-editing.

Full post-editing consists in fixing machine-induced meaning (semantic) distortion, making grammatical and syntactic adjustments, checking terminology for untranslated terms that could possibly be new terms, partially or completely rewriting sentences for target language fluency. It is reserved for publishing and providing high quality input for engine training.

Light post-editing consists in adjusting mechanical errors, mainly for capitalization and punctuation, replacing unknown words, possibly misspelled in the source text, removing redundant words or inserting missing ones, and ignoring all stylistic issues. It is generally used for content to be re-used in different contexts, possibly through further adaptation.

Detailed analytics can also offer an estimate of where improvements, edits, adds, replacements, etc. must be made and this in turn helps in assessing and determining the effort required.

After a comprehensive analysis of automatic evaluation scores has been accomplished, machine translation outputs can then undergo human evaluation.

When coming to human evaluation, a major issue is sampling. In fact, to be affordable, human evaluation must be done on small portions of the output, which must be homogeneous and consistent with the automatic score. 

Once consistent samples have been selected, human evaluation could start with fluency, which is affected by grammar, spelling, choice of words, and style. To prevent bias, evaluators must be given a predefined restricted set of criteria to comply with when voting/rating whether samples are fluent or not.

Fluency refers to the target only, without taking the source into account and its evaluation does not always require evaluators to be bilingual; indeed, it is often better that they are not. However, always consider that monolingual evaluation of target text only generally takes relatively short time, and judgments are generally consistent across different people, but that the more of instructions are provided to evaluators, the longer they take to complete their task, and the less consistent results are. Then the same samples would be passed to bilingual evaluators for adequacy evaluation.

Adequacy is defined as the amount of source meaning preserved in translation. This necessarily requires a comparative analysis of source and target texts, as adequacy can be affected by completeness, accuracy, and cleanup of training data. Consider using a narrow continuous measurement scale.

A typical pitfall of statistical machine translation is terminology. Human evaluation is useful to detect terminology issues. However, that could mean that hard work is required normalizing training data to realign terminology in each segment and analyze and amend translation tables.

Remember that, the number and magnitude of defects (errors) are not the best or the only way to assess quality in a translation service product. Perception can be equally important. When working with MT, in particular, the type and frequency of errors are pivotal, even though all these errors could not be all resolved. Take the Six Sigma model: what could be a reasonably expected level for an MT platform? Now take terminology in SMT, and possibly, in a near future, NMT. Will amending a very large training dataset be convenient to have the correct term(s) always used? Implementing and running an MT platform is basically a cost effectiveness problem. As we know, engines perform differently according to language pairs, amount, and quality of training data, etc.. This means that a one-size-fits-all approach for TQA is unsuitable, and waiving an engine from production use might be better than insisting in trying to use or improve it because the PEMT effort could be excessive. I don’t think that the existing models and metrics, including DQF, can be universally applied.

However, they could be helpful once automatic scores prove the engine could perform acceptably. In this case, defining specific categories for errors emerging from testing and operating engines that could potentially occur repeatedly is the right path to further engine tuning and development. And this can’t be made based on abstract and often abstruse (at least to non-linguists) metrics.

Finally, to get a useful PEMT effort indicator that provides an estimate of the work required for an editor to do to get the content over a predetermined acceptance quality level (AQL,) a weighted combination of correlation and dependence, precision and recall and edit distance scores can be computed. Anyway, the definition of AQLs is crucial for the effective implementation of a PEMT effort indicator, together with a full grasp of analytics, which requires an extensive understanding of the machine translation platform and the training data.

Many of these aspects, from a project management perspective, are covered in more detail in the TAUS PE4PM course.

This course also covers another important element of a post-editing project, the editor’s involvement and remuneration. Especially in the case of full post-editing, post-editors could be asked to contribute to train an engine, and editors could prove extremely valuable on the path to achieve better performances.

Last but not least, the suitability of source text for machine translation and the tools to use in post-editing can make the difference between success and failure in the implementation of a machine translation initiative.

When a post-editing job comes to an LSP or a translator, nothing can at that point be done on the source text or the initial requirements. Any action that can be taken must be taken upstream, earlier in the process. In this respect, while predictive quality analysis at a translated file level has already been implemented, although not fully substantiated yet, predictive quality analysis at source text level is still to come. It would be of great help to translation buyers in general who could base their investment on reasonable measures, possibly in a standard business logic, and possibly improve their content for machine translation and translatability in general. NLP research is already evolving to provide feedback on a user’s writing, reconstruct story lines or classify content, and assess style.

In terms of activities going on in the post-editing side of the world, adaptive machine translation will be a giant leap forward when every user’s edits are made available to an entire community, by permanently incorporating each user’s evolving datasets into the master translation tables. Thus the system is continuously improving with ongoing use in a way that other MT systems do not. At the moment, Adaptive MT is restricted to Lilt and SDL (all talk so far) users. This means that it won’t be available in corporate settings where MT is more likely to be implemented unless SDL software is already in use and/or IP is not an issue. Also, being very clear and objective about the rationale for implementing MT is essential to avoid being misled when interpreting and using analytics. Unfortunately, in most situations, this is not the case. For instance, if my main goal is speed, I should look into analytics for something other than what I should look for if my goal is cutting translation costs or increasing consistency. Anyway, understanding the analytics is no laughing matter. But this is another kettle of fish.

Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm . He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts. 


Friday, October 7, 2016

Real and Honest Quality Evaluation Data on Neural Machine Translation

 I just saw a Facebook discussion on the Google NMT announcements that explores some of the human evaluation issues. And thought I would add one more observation to this charade before I highlight a completely overlooked study that does provide some valuable insight into the possibilities of NMT (which I actually believe are real and substantial) even though it is done in a  "small-scale" University setting.

Does anybody else think that it is strange, that none of the press and the journalists that are gushing about the "indistinguishable from human translation" quality claimed by Google, did not attempt to run even a single Chinese page through the new super duper GNMT Chinese engine? 

Like this post for example where the author seems to have swallowed the Google story, hook, line, and sinker. There are of course 50 more like this. It took me translating just one page to realize that we are really knee deep in bullshit, as I had difficulty getting even a gist understanding with my random sample Chinese web page.

So, is there any honest, unmanipulated data out there, on what NMT can do?  

I have not seen all the details of the SYSTRAN effort but based on the sample output that I asked for, and the general restraint (in spite of their clear enthusiasm) they showed during my conversations, I tend to believe that they have made real progress and can offer better technology to their customers. But this study was just pointed out to me, and I took a look even though the research team has a disclaimer about this being on-going work where it is possible that some results and conclusions might change. I thought it deserved to be more visible and thus I wrote this.

So here we have a reasonable and believable answer to the question:

Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions

This was a study conducted at the University of Edinburgh done entirely with UN corpus. They had about 10+ million sentences each per language in common across the six United Nations core languages which include Chinese. Now, this may sound like a lot of data to some, but the Google and Microsoft scale is probably larger by a factor of 10 or even 20. We should also understand that the computing resources available to this little research team are probably 1% to 5% of what Google can easily get access to. (More GPU sometimes can mean you get an extra BLEU point or two). So here we have the David and Goliath scenario in terms of resources, but interestingly I think the inferences they draw are very similar to what SYSTRAN and Microsoft have also reported. The team reports:

"With the exception of fr-es and ru-en the neural system is always comparable or better than the phrase-based system. The differences where NMT is worse are very small. Especially in cases where Chinese is one of the languages in a language pair, the improvement of NMT over PB-SMT is dramatic with between 7 and 9 BLEU points....
We also see large improvements for translations out of and into Arabic. It is interesting to observe that improvements are present also in the case of the highest scoring translation directions, en-es and es-en."
The research team also admits that it is not clear what the implications might be for in-domain systems and I look forward to their hopefully less deceptive human evaluation:
"Although NMT systems are known to generalize better than phrase-based systems for out-of- domain data, it was unclear how they perform in a purely in-domain setting which is of interest for any larger organization with significant resources of their own data, such as the UN or other governmental bodies. This work currently lacks human evaluation which we would like to supply in future versions."
The comparative PBSMT vs NMT  results is presented graphically below.  The blue bar is SMT and the burgundy bar is NMT, I have highlighted the most significant improvements with arrows below.

It is also interesting to note that when more processing power is applied to the problem, they do get some small improvement but it is clearly a case of diminishing returns. 

"Training the NMT system for another eight days always improves the performance of the NMT system, but gains are rather small between 0.4 and 0.7 BLEU. We did not observe any improvements beyond 2M iterations. It seems that stopping training after 8-10 days might be a viable heuristic with little loss in terms of BLEU. "
They additionally share some research on the NMT decoding throughput problem and resolution which some may find useful. Again, to be clear, (for the benefit of Mr. Mike) the scale described here is minuscule compared to the massive resources that Google, Microsoft, and Facebook probably use for deployment. But they show that NMT can be deployed without using GPUs or the Google TPUs for a fraction of the cost.  If this research team sends me a  translation of my test Chinese page I used on the GNMT,  I will share it with you so you can compare to GNMT.

We can all admit that Google is doing this on a bigger scale, but from my vantage point, it seems that they are really not getting that much better results. As University of Edinburgh’s Rico Sennrich said in his Slator interview: “ Given the massive scale of the models, and the resulting computational cost, it is in fact, surprising that they do not outperform recently published work—unfortunately, they only provide a comparison on an older test set, and against relatively old and weak baselines.” He also adds that the Edinburgh system outperformed the Google system in the WMT16 evaluation (which shows how NMT systems and the University of Edinburgh in particular has been doing very well in comparative evaluations.)

So what does this mean?

NMT is definitely here for real and is likely to continue improving albeit incrementally.  If you are an enterprise concerned about large-scale Chinese, Japanese, Korean and Arabic translation you should be looking at NMT technology or talking to MT vendors who have real NMT products. This technology may be especially valuable for those interested in scientific and technical knowledge content like patent and scientific paper related information.

Hopefully, the improvement claims in future are more carefully measured and honest, so that we don't get translators all up in arms again after they see the actual quality that systems like the "near human quality" GNMT ZH-EN  produce. The new NMT systems that are emerging, however,  are already a definite improvement for the casual internet user who just wants better quality gisting. 

SYSTRAN will shortly start providing examples of "adapted" NMT systems which are instantly tuned versions of a generic NMT engine. If the promise I saw in  my investigation is anywhere close to some of the Adaptive MT capabilities NMT is a real game changer for the professional translation industry as well.

Remember, the real goal is not better NMT systems, rather, it is better quality automated translation, that both, supports production business translation work and allows users to really get an accurate sense of the meaning of foreign language content quickly.

For those who think that this study is not an industrial strength experiment, you may be interested to know that one of the researchers quietly published this which shows that their expertise is very much in play at the WIPO even though the training sets were very small. As he says:
"A few days after Google, WIPO (the World Intellectual Property Organization) just deployed its first two in-house neural machine translation (NMT) systems for Chinese-English and Japanese-English. Both systems are already available through the Patentscope Translation Assistant for Patent Texts."
Even at this initial stage, the NMT system BLEU scores show impressive gains, and these scores can only go up from here.

Japanese to English        SMT = 24.41      NMT = 35.99
Chinese to English          SMT = 28.59      NMT = 37.56

This system is live and is something you can try out right now at this link.

The research team whose work triggered this post and essentially wrote it includes Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang.

P.S. I have some wise words coming up on the translation evaluation issue from a guest author next week. I would love to have more translators step forward with their comments on this issue or even volunteer to write a post on human evaluation of MT output.