Pages

Tuesday, October 8, 2019

Post-editese is real

Ever since machine translation was introduced into the professional translation industry, there have been questions about what the impact would be on a final delivered translation service product. For much of the history of MT many translators claimed that while translation production work using a post-edited MT (PEMT) process was faster, the final product was not as good. The research suggests that this has been true from a strictly linguistic perspective, but many of us also know that PEMT worked quite successfully with technical content especially with terminology and consistency even in the days of SMT and RBMT. 

As NMT systems proliferate, we are at a turning point, and I suspect that we will see many more NMT systems that are in fact seen as providing useful output that clearly enhances translator productivity, especially on output from systems built by experts. NMT will also quite likely have an influence on the output quality and the difference is also likely to become less prominent. This is what is meant by developers who make claims of achieving human parity. If competent human translators cannot tell that segments they review came from MT or not, we can make a limited claim of having achieved human parity. This does not mean that this will be true for every new sentence submitted to this system. 

We should also understand that MT  provides the greatest value in use scenarios where you have large volumes of content (millions rather than thousands of words), short turnaround times, and limited budgets. Increasingly MT is used in scenarios where little or no post-editing is done, and by many informed estimates, we are already at a run rate of a trillion words a day going through MT engines. While post-editese may be an important consideration in localization use scenarios, this is likely no more than 2% of all MT usage.  

Enterprise MT use is rapidly moving into a phase where it is an enterprise-level IT resource. The modern global enterprise needs to enable and allow millions of words to be translated on demand in a secure and private way and needs to be integrated deeply into critical communication, collaboration, and content creation and management software.

The research presented by Antonio Toral below documents the impact of post-editing on the final output across multiple different language combinations and MT systems. 



==============

This is a summary of the paper “Post-editese: an Exacerbated Translationese” by Antonio Toral, which was presented at MT Summit 2019, where it won the best paper award.


Introduction


Post-editing (PE) is widely used in the translation industry, mainly because it leads to higher productivity than unaided human translation (HT). But, what about the resulting translation? Are PE translations as good as HT? Several research studies have looked at this in the past decade and there seems to be consensus: PE is as good as HT or even better (Koponen, 2016).

Most of these studies measure the quality of translations by counting the number of errors therein. Taking into account that there is more to quality than just the number of mistakes, we ask ourselves the following question instead: are there differences between translations produced with PE vs HT? In other words, does the final output created via PEs and HTs have different traits?

Previous studies have unveiled the existence of translationese, i.e. the fact that HTs and original texts exhibit different characteristics. These characteristics can be grouped along with the so-called translation universals (Baker, 1993) and fundamental laws of translation (Toury, 2012), namely simplification, normalization, explicitation and interference. Along this line of thinking, we aim to unveil the existence of post-editese (i.e. the fact that PEs and HTs exhibit different characteristics) by confronting PEs and HTs using a set of computational analyses that align to the aforementioned translation universals and laws of translation.

Data

We use three datasets in our experiments: Taraxü (Avramidis et al., 2014), IWSLT (Cettolo et al., 2015; Mauro et al., 2016) and Microsoft “Human Parity” (Hassan et al., 2018). These datasets cover five different translation directions and allow us to assess the effect of machine translation (MT) systems from 2011, 2015-16 and 2018 on the resulting PEs.

Analyses

Lexical Variety

We assess the lexical variety of a translation (HT, PE or MT) by calculating its type-token ratio:

In other words, given two translations equally long (number of words), the one with bigger vocabulary (higher number of unique words) would have a higher TTR, being therefore considered lexical richer, or higher in lexical variety.

The following figure shows the results for the Microsoft dataset for the direction Chinese-to-English (zh–en, the results for the other datasets follow similar trends and can be found in the paper). HT has the highest lexical variety, followed by PE, while the lowest value is obtained by the MT systems. A possible interpretation is as follows: (i) lexical variety is low in MT because these systems prefer the translation solutions that are frequent in the training data used to train such systems and (ii) a post-editor will add lexical variety to some degree (difference in the figure between MT and PE), but because MT primes him/her (Green et al., 2013), the resulting PE translation will not achieve the lexical variety of HT.


Lexical Density

The lexical density of a text indicates its amount of information and is calculated as follows:
where content words correspond to adverbs, adjectives, nouns, and verbs. Hence, given two translations equally long, the one with the higher number of content words would be considered to have higher lexical density, in other words, to contain more information.

The following figure shows the results for the three translation directions in the Taraxü dataset: English-to-German, German-to-English and Spanish-to-German. The lexical density in HT is higher than in both PE and MT and there is no systematic difference between the latter two.

Length Ratio

Given a source text (ST) and a target text (TT), where TT is a translation of ST (HT, PE or MT), we compute a measure of how different in length the TT is with respect to the ST:
This means that the bigger the difference in length between the ST and the TT (be it because TT is shorter or longer than the ST), the higher the length ratio.

The following figure shows the results for the Taraxü dataset. The trend is similar to the one in lexical variety; this is, HT obtains the highest result, MT the lowest and PE lies somewhere in between. We interpret this as follows: (i) MT results in a translation of similar length to that of the ST due to how the underlying MT technology works and PE is primed by the MT output while (ii) a translator working from scratch may translate more freely in terms of length.

Part-of-speech Sequences

Finally, we assess the interference of the source language on a translation (HT, PE and MT) by measuring how close the sequence of part-of-speech tags in the translation is to the typical part-of-speech sequences of the source language and to the typical part-of-speech sequences of the target language. If the sequences of a translation are similar to the typical sequences of the source language that would indicate that there is an inference from the source language in the translation.

The following figure shows the results for the IWSLT dataset. The metric used is perplexity difference; the higher it is the lower the interference (full details on the metric can be found in the paper). Again, we find a similar trend as in some of the previous analyses: HT gets the highest results, MT the lowest and PE somewhere in between. The interpretation is again similar: MT outputs exhibit a large amount of interference from the source language, a post-editor gets rid of some of that interference but the resulting translation still has more interference than an unaided translation.


Findings

The findings from our analyses can be summarised as follows in terms of HT vs PE:
  • PEs have lower lexical variety and lower lexical density than HTs. We link these to the simplification principle of translationese. Thus, these results indicate that post-editese is lexically simpler than translationese.
  • Sentence length in PEs is more similar to the sentence length of the source texts, than sentence length in HTs. We link this finding to interference and normalization: (i) PEs have
interference from the source text in terms of length, which leads to translations that follow the typical sentence length of the source language; (ii) this results in a target text whose
length tends to become normalized.
  • Part-of-speech (PoS) sequences in PEs are more similar to the typical PoS sequences of the source language than PoS sequences in HTs. We link this to the interference principle: the sequences of grammatical units in PEs preserve to some extent the sequences that are typical of the source language.

In terms of the role of MT: we have not considered only HTs and PEs but also MT outputs, from the MT systems that were the starting point to produce the PEs. This to corroborate a claim in the literature (Greenet al., 2013), namely that in PE the translator is primed by the MT output. We expected then to find similar trends to those found in PEs also in MT outputs and this was indeed the case in all four analyses. In some experiments, the results of PE were somewhere in between those of HT and MT. Our interpretation is that a post-editor improves the initial MT output, but due to being primed by the MT output, the result cannot attain the level of HT, and the footprint of the MT system remains in the resulting PE.

Discussion

As said in the introduction, we know that PE is faster than HT. The question I wanted to address was then: can PE not only be faster but also be at the level of HT quality-wise? In this study, this is looked at from the point of view of translation universals and the answer is clear: no. However, I'd like to point out three additional elements:
  1. The text types in the 3 datasets that I have used are news and subtitles, both are open-domain and could be considered to a certain extent "creative". I wonder what happens with technical texts, given their relevance for industry, and I plan to look at that in the future.
  2. As mentioned in the introduction, previous studies have compared HT vs PE in terms of the number of errors in the resulting translation. In all the studies I've encountered PE is at the level of HT or even better. Thus, for technical texts where terminology and consistency are important, PE is probably better than HT. I find thus the choice between PE and HT to be a trade-off between consistency on one hand and translation universals (simplification, normalization and interference) on the other.
  3. PE falls behind HT in terms of translation universals because MT falls behind HT in those terms. However, this may not be the case anymore in the future. For example, the paper shows that PE-NMT has less interference than PE-SMT, thanks to the better reordering in the former.




Antonio Toral is an Assistant Professor at the Computational Linguistics group, Center for Language and Cognition, Faculty of Arts, University of Groningen (The Netherlands). His research is in the area of Machine Translation. His main topics include resource acquisition, domain adaptation, diagnostic evaluation and hybrid approaches.


Related Work

Other work has previously looked at HT vs PE beyond the number of errors. The most related papers to this paper are Bangalore et al. (2015), Carl and Schaeffer (2017), Czulo and Nitzke (2016), Daems et al. (2017) and Farrell (2018).

Bibliography


Avramidis, Eleftherios, Aljoscha Burchardt, Sabine Hunsicker, Maja Popovic, Cindy Tscherwinka, David Vilar, and Hans Uszkoreit. 2014. The taraxü corpus of human-annotated machine translations. In LREC, pages 2679–2682.

Baker, Mona. 1993. Corpus linguistics and translation studies: Implications and applications. Text and technology: In honor of John Sinclair, 233:250.

Bangalore, Srinivas, Bergljot Behrens, Michael Carl, Maheshwar Gankhot, Arndt Heilmann, Jean Nitzke, Moritz Schaeffer, and Annegret Sturm. 2015. The role of syntactic variation in translation and post-editing. Translation Spaces, 4(1):119–144.

Carl, Michael and Moritz Jonas Schaeffer. 2017. Why translation is difficult: A corpus-based study of non-literality in post-editing and from-scratch translation. Hermes, 56:43–57.

Cettolo, Mauro, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The iwslt 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation.

Green, Spence, Jeffrey Heer, and Christopher D Manning. 2013. The efficacy of human post-editing for language translation. Chi 2013, pages 439–448.

Hassan, Hany, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, Will Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Zhuangzi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving Human Parity on Automatic Chinese to English News Translation. https://arxiv.org/abs/1803.05567.

Koponen, Maarit. 2016. Is machine translation post-editing worth the effort? A survey of research into post-editing and effort. Journal of Specialised Translation, 25(25):131–148.

Mauro, Cettolo, Niehues Jan, Stüker Sebastian, Bentivogli Luisa, Cattoni Roldano, and Federico Marcello. 2016. The iwslt 2016 evaluation campaign. In International Workshop on Spoken Language Translation.

Toury, Gideon. 2012. Descriptive translation studies and beyond: Revised edition, volume 100. John Benjamins Publishing.

Wednesday, September 25, 2019

In a Funk about BLEU

This is a more fleshed-out version of a blog post by Pete Smith and Henry Anderson of the University of Texas at Arlington already published on SDL.com. They describe initial results from a research project they are conducting on MT system quality measurement and related issues. 

MT quality measurement, like human translation quality measurement, has been a difficult and challenging subject for both the translation industry and for many MT researchers and systems developers as the most commonly used metric BLEU, is now quite widely understood to be of especially limited value with NMT systems. 

Most of the other text-matching NLP scoring measures are just as suspect, and practitioners are reluctant to adopt them as they are either difficult to implement, or the interpretation pitfalls and nuances of these other measures are not well understood. They all can generate a numeric score based on various calculations of Precision and Recall that need to be interpreted with great care. Most experts will say that the only reliable measures are those done by competent humans and increasingly best practices suggest that a trust-but-verify approach is better. There are many variations of superficially accurate measures available today, but on closer examination, they are all lacking critical elements to make them entirely reliable and foolproof.

So, as much as BLEU scores suck, we continue to use them since some, or perhaps even many of us understand them. Unfortunately, many still don't have a real clue, especially in the translation industry. 

I wonder sometimes if all this angst about MT quality measurement is much ado about nothing. We do in fact, need very rough indicators of MT quality to make judgments of suitability in business use cases, but taking these scores as final indicators of true quality is problematic. It is likely that the top 5, or even top 10 systems are essentially equivalent in terms of the MT quality impact on the business purpose. The real difference in business impact comes from other drivers: competence, experience, process efficiency and quality of implementation.

I would argue that even for localization use cases, the overall process design and other factors matter more than the MT output quality.

 As we have said before, technology has value when it produces favorable business outcomes, even if these outcomes can be somewhat challenging to measure with a precise and meaningful grade. MT is a technology that is seldom perfect, but even in its imperfection can provide great value to an enterprise with a global presence. MT systems with better BLEU or Lepor scores do not necessarily produce better business outcomes. I would argue that an enterprise could use pretty much any "serious" MT system without any impact on the final business outcome. 

This is most clear with eCommerce and global customer service and support use cases, where the use of MT can very rapidly yield a significant ROI. 

"eBay’s push for machine translation has helped the company increase Latin American exports by nearly 20%, according to researchers from the Massachusetts Institute of Technology, and illustrates the potential for increased commercial activity as translation technologies gain wider adoption in business."
MT deployment use case presentations shared by practitioners who have used MT to translate large volumes of knowledgebase support content show that what matters is whether the content helps customers across the globe get to answers that solve problems faster. Translation quality matters but only if it helps understandability. In the digital world, speed is crucial and often more important.

Some 100,000 buyers exchange a total of 2 billion translated text messages every week on the Alibaba.com global-trade platform. This velocity and volume of communication that is enabled by MT enable new levels of global commerce and trade. How many of these messages do you think are perfect translations? 

A monolingual Live Support Agent who can service thousands of global customers a week because he/she can quickly understand the question and send back relevant and useful support content back to a customer using MT  is another example. The ability to do this in volume matters more than perfect linguistic quality.

So then the selection of the right MT technology or solution will come down to much more enterprise relevant issues like:

  • Data Security & Privacy 
  • Adaptability to enterprise unique terminology and use cases
  • Scalability - from billions of words to thousands per hour 
  • Deployment Flexibility - On-premise, cloud or combinations of both
  • Integration with key IT infrastructure and platforms
  • Availability of expert consulting services for specialization 
  • Vendor focus on SOTA
  • MT system manageability
  • Cost 
  • Vendor reputation, profile and enterprise account management capabilities

Pete Smith will be presenting more details of his research study at SDL Connect next month.


===============


There is little debate: the machine translation research and practitioner communities are in a funk about BLEU. From recent webinars to professional interviews and scholarly publications, BLEU is being called on the carpet for its technical shortcomings in the face of a rapidly-developing field, as well as the lack of insight it provides to different consumers such as purchasers of MT services or systems.

BLEU itself is used widely, especially in the MT research community, as an outcome measure for evaluating MT. Yet even in that setting, there is considerable rethinking and re-evaluation of the metric, and BLEU has been an active topic of critical discussion and research for some years, including the challenges faced by evaluating automated translation across the language typology spectrum and especially in cases of morphologically rich languages. And the issue is not limited, of course, to machine translation—the metric is also a topic in NLP and natural language generation discussions generally.

BLEU’s strengths and shortcomings are well-known. At its core, BLEU is a string matching algorithm for use in evaluating MT output and is not per se a measure of translation quality. That said, here is no doubt that automated or calculated metrics are of great value, as total global MT output approaches levels of one trillion words per day.

And few would argue that, in producing and evaluating MT or translation in general, context matters. A general-purpose, public-facing MT engine designed for broad coverage among users and use cases is just that—general-purpose, and likely more challenged by perennial source language challenges such as specific domain style/terminology, informal language usage, regional language variations, and other issues.

It is no secret that many MT products are trained (at least initially) on publicly available research data and that there are, overall, real thematic biases in those datasets. News, current events, governmental and parliamentary data sets are available across a wide array of language pairs, as well as smaller amounts of data from domains such as legal, entertainment, and lecture source materials such as TED Talks. Increasingly, datasets are available in the IT and technical domains, but there are few public bilingual datasets available that are suitable for major business applications of MT technology such as e-commerce, communication, and collaboration, or customer service.

Researchers and applied practitioners have all benefited from these publicly-available resources. But the case for clarity is perhaps most evident in the MT practitioner community.

For example, enterprise customers hoping to purchase machine translation services face a dilemma: how might the enterprise evaluate an MT product or service for their particular domain, and with more nuance and depth than simply relying on marketing materials boasting scores or gains in BLEU or LEPOR? How might you evaluate major vendors of MT services specific to your use case and needs?

And as a complicating factor, we know an increasing amount about the “whys” and “hows” of fine-tuning general-purpose engines to better perform in enterprise cases such as e-commerce product listings, technical support knowledgebase content, social media analysis, and user feedback/reviews. In particular, raw “utterances” from customers and customer support personnel in these settings are authentic language, with all of its “messiness.”

The UTA research group has recently been exploring MT engine performance on customer support content, building a specialized test set compiled from source corpora including email and customer communications, communications via social media, and online customer support. In particular, we explored the utilization of automation and standard NLP-style pre-processing to rapidly construct a representative translation test set for the focused use case.

At the start, an initial set of approximately 3 million English sentence strings related to enterprise communication and collaboration were selected. Source corpora represented tasks such as email communication, customer communications, communications via social media, and online customer support.

Candidate sentence strings from these larger corpora were narrowed via a sentence clustering technique, training a FastText model on the input documents to capture both the semantic and non-semantic (linguistic) properties of the corpora. To give some sense of the linguistic features considered in string selection, corpora elements were parsed using the spaCy natural language processing library’s largest English model to consider features in a string such as the number of “stop words”; the number of tokens that were punctuation, numbers, e-mail addresses, URLs, alpha-only, and out-of-vocabulary; the number of unique lemmas and orthographic forms; number of named entities; the number of times each entity type, part-of-speech tag and dependency relation appeared in the text; and the total number of tokens. Dimensionality reduction and clustering were used in the end, to result in 1050 English-language strings for the basic bespoke test set.

The strings from the constructed set were translated into seven languages (French, German, Hindi, Korean, Portuguese, Russian, Spanish) by professional translators. Then the translated sentences from the test set were utilized as translation prompts in seven language pairs (English-French, English-German, English-Hindi, English-Korean, English-Portuguese, English-Russian, English-Spanish) by four major, publicly-available MT engines via API or web interface. At both the corpus as well as the individual string level, BLEU, METEOR, and TER scores were generated for each major engine and language pair (not all of the seven languages were represented in all engine products).

Our overall question was: does BLEU (or any of the other automated scores) support, say, the choice of engine A over engine B for enterprise purchase when the use case is centered on customer-facing and customer-generated communications? 

To be sure, the output scores presented a muddled picture. Composite scores of the general-purpose engines clustered within approximately 5-8 BLEU points of each other in most languages. And although we used a domain-specific test set, little in the results would have provided the enterprise-level customer with a clear path forward. As Kirti Vashee has pointed out recently, in responding effectively to the realities of the digital world, “5 BLEU points this way or that is negligible in most high-value business use cases.”

What are some of the challenges of authentic, customer language? Two known challenges to MT include the formality/informality of language utterances and emotive content. The double-punch of informality and emotion-laden customer utterances pose a particularly challenging case.

As we reviewed in a recent webinar, customer-generated strings in support conversations or online interactions present a translator with a variety of expressions of emotion, tone, humor, sarcasm, all embedded within a more informal and Internet-influenced style of language. Some examples included:

             Support…I f***ing hate you all. [Not redacted in the original.]
            Those late in the day deliveries go missing” a lot.
            Nope didnt turn upjust as expectednow what dude?
            I feel you man, have a good rest of your day!
           Seriously, this is not OK.
           A bunch of robots who repeat the same thing over & over.
           #howdoyoustayinbusiness

Here one can quickly see how an engine trained primarily with formal, governmental or newspaper source would be quickly challenged. But in early results, our attempts to unpack the issues of how MT may perform on emotive content (i.e., not news, legal, or technical content) have provided little insight to date. Early findings suggest surprisingly little interaction between standard ratings of sentiment and emotion run on the test set individual strings (VADER positive, negative, neutral, composite and IBM tone analysis) and variance in downstream BLEU scores.

Interestingly, as an aside in our early work, raw BLEU scores across languages for the entire test set did generally correlate comparatively highly with METEOR scores. Although this correlation is expected, the strength of the relationship was surprising in an NMT context, as high as r=.9 across 1000+ strings in a given language pair. If, as the argument goes, NMT brings strengths in fluency which includes elements METEOR scoring is, by design, more sensitive to (such as synonyms or paraphrasing), one might expect that correlation to be weaker. More broadly, these and other questions around automatic evaluation have a long history of consideration by the MT and WMT communities.

One clearly emerging practice in the field is to combine an automated metric such as BLEU along with human evaluation on a smaller data set, to confirm and assure that the automated metrics are useful and provide critical insight, especially if the evaluation is used to compare MT systems. Kirti Vashee, Alon Lavie, and Daniel Marcu have all written on this topic recently.

Thus, the developing, more nuanced understanding of the value of BLEU may be as automated scores seen as initially most useful during MT research and system development, where they are by far the most widely-cited standard. The recent Machine Translation Summit XVII in Dublin, for example, had almost 500 mentions or references to BLEU in the research proceedings alone.

But this measure may be potentially less accurate or insightful when broadly comparing different MT systems within the practitioner world, and perhaps more insightful again to both researcher and practitioner when paired with human or other ratings. As one early MT researcher has noted, “BLEU is easy to criticize, but hard to get away from!”

Discussions at the recent TAUS Global Content Conference 2019 further developed the ideas of MT engine specialization in the context of the modern enterprise content workflow. Presenters such as SDL and others offered views future visions of content development personalization and use in a multilingual world. These future workflows may contain hundreds or thousands of specialized, specially-trained and uniquely maintained automated translation engines and other linguistic algorithms, as content is created, managed, evaluated, and disseminated globally.

There is little doubt that the automated evaluation of translation will continue to play a key role in this emerging vision. However, a better understanding of the field’s de facto metrics and the broader MT evaluation process in this context is clearly imperative.

And what of use cases that continue to emerge, such as the possibility of intelligent or MT content in the educational space? The UTA research group is also exploring MT applications specific to education and higher education as well. For example, millions of users daily make use of learning materials such as MOOCs—educational content that attracts users across borders, languages, and cultures. A significant portion of international learners come to and potentially struggle with English-language content in edX or other MOOC courses—and thousands of MOOC offerings exist in the world’s languages, untranslated for English-speakers. What role might machine translation potentially play in this educational endeavor?




Dr. Pete Smith, Chief Analytics Officer, and Professor
Mr. Henry Anderson, Data Scientist
Localization and Translation Program
Department of Modern Languages and Office of University Analytics
The University of Texas at Arlington


Monday, August 5, 2019

Adapting Neural MT to Support Digital Transformation

We live in an era where the issue of digital transformation is increasingly recognized as a primary concern, and a key focus of executive management teams in global enterprises. The stakes are high for businesses that fail to embrace change. Since 2000, almost half (52%) of Fortune 500 companies have either gone bankrupt, been acquired, or ceased to exist as a result of digital disruption. It’s also estimated that 75% of today’s S&P 500 will be replaced by 2027, according to Innosight Research. 
Responding effectively to the realities of the digital world have now become a matter of survival as well a means to build long term competitive advantage.

When we consider what is needed to drive digital transformation in addition to structural integration, we see that large volumes of current, relevant, and accurate content that support the buyer and customer journey are critical to enhancing the digital experience both in B2C and B2B scenarios. 

Large volumes of relevant content are needed to enhance the customer experience in the modern digital era, where customers interact continuously with enterprises in a digital space, on a variety of digital platforms. To be digitally relevant in this environment requires that enterprises must increasingly be omni-market-focused, and have large volumes of relevant content available in every language in every market they participate on a continuous basis.


This requires that the modern enterprise must create more content, translate more content and deliver more content on an ongoing basis to be digitally relevant and visible. Traditional approaches to translating enterprise content simply cannot scale and a new approach is needed. The possibility of addressing these translation challenges without automation is nil, but what is required is a much more active man-machine collaboration that we at SDL call machine-first human optimized. Thus, the need for a global enterprise to escalate the focus on machine translation (MT) is growing and has become much more urgent. 

However, the days of only using generic MT to solve any high volume content translation challenges are over, and the ability of the enterprise to utilize MT in a much more optimal and agile manner across a range of different use cases is needed to enable an effective omni-market strategy to be deployed.

 A one-size-fits-all MT strategy will not enable the modern enterprise to effectively deliver the critical content needed to their target global markets in an effective and optimal way. 

Superior MT deployment requires ongoing and continuous adaptation of the core MT technology to varied use cases, subject domain, and customer-relevant content needs. MT deployment also needs to happen with speed and agility to deliver business advantage, and few enterprises can afford the long learning and development timelines required by any do-it-yourself initiative.

The MT Adaptation Challenge

Neural machine translation (NMT) has quickly established itself as the preferred model for most MT use cases today. Most experts now realize that MT performs best in industrial deployment scenarios when it is adapted and customized to the specific subject domain, terminology, and use case requirements. Generic MT is often not enough to meet key business objectives. However, the constraints to successful development of adapted NMT models is difficult for the following reasons:
  1. The sheer volume of training data that is required to build robust systems. This is typically in the hundreds of millions of words range that few enterprises will ever be able to accumulate and maintain. Models built with inadequate foundational data are sure to perform poorly and fail in meeting business objectives and providing business value. Many AI initiatives fail or underperform because of data insufficiency. 
  2. The available options to train NMT systems are complex and almost all of them require that any training data used to adapt NMT systems be made universally available to the development platform being used to further enhance their platform. This often raises serious data security and data privacy issues in this massively digital era, where the data related to the most confidential customer interactions and product development initiatives are needed to be translated on a daily basis. Customer interactions, sentiment and service, and support data are too valuable to be shared with open source AI platforms.
  3. The cost of keeping abreast of state-of-the-art NMT technology standards are also high. For example, a current best of breed English to German NMT system requires tens of millions of training data segments, hundreds and even thousands of hours of GPU cycles, deep expertise to tune and adjust model parameters and knowhow to bring it all together. It is estimated that just for this one single system it costs around $9,000 in training time costs on public cloud infrastructure, and 40 days of training time! These costs are likely to be higher if the developer does not have real expertise and is learning as they attempt to do it. These costs can be reduced substantially by moving to an on-premise training setup and by working with a foundation sytem that has been set up by experts.
  4. NMT model development requires constant iteration and ongoing and continuous experimentation with varying data sets and tuning strategies. There is a certain amount of uncertainty in any model development and outcomes cannot always be predicted upfront thus repeated and frequent updates should be expected. Thus, computing costs can rapidly escalate when using cloud infrastructure. 
Given the buzz around NMT, many naïve practitioners jump into DIY (do-it-yourself) open-source options that are freely available, only to realize months or years later that they have nothing to show for their efforts. 

The many challenges of working with open-source NMT are covered here. While it is possible to succeed with open-source NMT, a sustained and ongoing research/production investment is required with very specialized human resources to have any meaningful chance of success.


The other option that enterprises employ to meet their NMT adaptation needs is to go to dedicated MT specialists and MT vendors, and there are significant costs associated with this approach as well. The ongoing updates and improvements usually come with direct costs associated with each individual effort. These challenges have limited the number of adapted and tuned NMT systems that can be deployed, and have also created resistance to deploying NMT systems more widely as generic system problems are identified.

The most informed practitioners are just beginning to realize that using BLEU scores to select MT systems is usually quite short-sighted. The business impact of 5 BLEU points this way or that is negligible in most high value business use cases, and use case optimization is usually much more beneficial and valuable to the business mission.


As a technology provider who is focused on enterprise MT needs, SDL already provides existing adaptation capabilities, which range from:
  • Customer created dictionaries for instant self-service customization – suitable for specific terminology enforcement on top of a generic model. 
  • NMT model adaptation as a service, performed by the SDL MT R&D team.
This situation will now change and continue to evolve with the innovative new NMT adaptation solution being introduced by SDL which is a hybrid of the MT vendor and DIY approach. A solution that provides the best of both.


 

The Innovative SDL NMT Adaptation Solution

The SDL NMT Trainer solution provides the following:
  • Straightforward and simple NMT model adaptation without requiring users to be data scientists or experts.
  • Foundational data provided in the Adaptable Language Pairs to expedite and accelerate the development of robust and deployable systems quickly.
  • On-premise training that completely precludes the possibility of any highly confidential training data that encapsulates customer interactions, information governance, product development and partner and employee communications ever leaving the enterprise.
  • Once created the encrypted adapted models can be deployed easily on SDL in an on-premise deployment or cloud infrastructure with no possibility of data leakage.
  • Multiple use cases and optimizations are possible to be developed on a single language pair and customers can re-train and adjust their models continuously as data becomes available or as new use cases are identified.
  • A pricing model that encourages and supports continuous improvement and experimentation on existing models and allows for many more use cases to be deployed on the same language combination. 
The initial release of the SDL On-Premise Trainer is the foundation of an ever-adapting machine translation solution that will grow in capability and continue to evolve with additional new features.


Research shows that NMT models are very dependent on high quality training data and outcomes are highly dependent on the quality of the data used. The cleaner the data is, the better the adaptation will be, and thus after this initial product release, SDL plans to introduce an update later this year that leverages years of experience in translation memory management to include the appropriate automated cleaning steps required to make the data used as good as possible for neural MT model training.

The promise of the best AI solutions in the market is to continuously learn and improve with informed and structured human feedback, and the SDL technology is being architected to evolve and improve with this human feedback. While generic MT serves the needs of many internet users who need to get a rough gist of foreign language content, the global enterprise needs MT solutions that perform optimally on critical terminology, and are sensitive to linguistic requirements within the enterprise’s core subject domain. This is a solution that leverages a customer’s ability to produce high quality adaptations with minimal effort in as short a time as possible and thus make increasing volumes of critical DX content multilingual.

If you'd like to learn more about what's new in SDL's Adaptable Neural Language Pairs, click here.

Monday, June 17, 2019

The Challenge of Open Source MT

This is the raw, first draft, and slightly longer rambling version of a post already published on SDL.COM.

MT is considered one of the most difficult problems in the general AI and machine learning field. In the field of artificial intelligence, the most difficult problems are informally known as AI-complete problems, implying that the difficulty of these computational problems is equivalent to that of solving the central artificial intelligence problem— that is, making computers as intelligent as people. It is no surprise that humankind has been working on this problem for almost 70 years now, and is still quite some distance from having solved this problem.

“To translate accurately, a machine must be able to understand the text. It must be able to follow the author's argument, so it must have some ability to reason. It must have extensive world knowledge so that it knows what is being discussed — it must at least be familiar with all the same commonsense facts that the average human translator knows. Some of this knowledge is in the form of facts that can be explicitly represented, but some knowledge is unconscious and closely tied to the human body: for example, the machine may need to understand how an ocean makes one feel to accurately translate a specific metaphor in the text. It must also model the authors' goals, intentions, and emotional states to accurately reproduce them in a new language. In short, the machine is required to have a wide variety of human intellectual skills, including reason, commonsense knowledge and the intuitions that underlie motion and manipulation, perception, and social intelligence. Machine translation, therefore, is believed to be AI-complete.”
 
One of the myths that seem to prevail in the localization world today is that anybody with a hoard of translation memory data can easily develop and stand-up an MT system using one of the many open source toolkits or DIY (do-it-yourself) solutions that are available. We live in a time where there is a proliferation of open source machine learning and AI related development platforms. Thus, people believe that given some data, and a few computers, a functional and useful MT system can be developed. However, as many who have tried have found out, the reality is much more complicated and the path to success is long, winding, and sometimes even treacherous. For an organization to successfully consider developing an open source machine translation solution to deployable quality, a few critical elements for successful outcomes is required:
  1. At least a basic competence with machine learning technology, 
  2. An understanding of the broad range of data needed and used in building and developing an MT system,
  3. An understanding of the proper data preparation and data optimization processes needed to maximize success,
  4. The ability to understand, measure and respond to successful and failure outcomes with model building that are very much part of the development process,
  5. An understanding of the additional support tools and connected data flow infrastructure needed to make MT deployable at enterprise scale.

The very large majority of open source MT efforts fail, in that they do not consistently produce output that is equal to, or better than, any easily accessed public MT solution, or they cannot be deployed in a robust and effective manner. 


This is not to say that this is not possible, but the investments and long-term commitment required for success are often underestimated or simply not properly understood. A case can always be made for private systems that offer greater control and security, even if they are generally less accurate than public MT options. However, in the localization industry, we see that if “free” MT solutions are available that are superior to an LSP built system, translators will prefer to just use those. We also find that for the few of these self-developed MT systems that do produce useful output quality, larger integration and data integration issues are often an impediment, and thus difficult to deploy at enterprise scale and robustness. 

Some say that those who ignore the lessons of history are doomed to repeat the errors. Not so long ago, when the Moses SMT toolkits were released, we heard industry leaders’ claim, “Let a thousand MT systems bloom”, but in retrospect, did more than a handful survive beyond the experimentation phase?


Why is relying on open source difficult for enterprise use?


The state-of-the-art of machine translation and the basic technology is continuously evolving and practitioners need to understand and stay current with the research to have viable systems in deployment. A long, sustained and steady commitment is needed just to stay abreast.

If public MT can easily outperform home-built systems, there is little incentive for employees and partners to use these in-house systems, and thus we are likely to see rogue behavior where users will reject the in-house system, or see users forced to use sub-standard systems. This is especially true for MT systems in localization use cases where the highest output quality is demanded. Producing systems that consistently perform as required, needs deep expertise and broad experience. An often overlooked reason for failure is that to do it yourself, it is necessary to have an understanding and some basic expertise with the various elements in and around machine learning technology. Many do-it-yourselfers don’t know how to do any more than load TM into an open source framework.

While open source does indeed provide access to the same algorithms, much of the real skill in building MT systems is in the data analysis, data preparation, and data cleansing to ensure that the algorithms learn from a sound quality foundation. The most skillful developers also understand the unique requirements of different use cases and may develop additional tools and processes to augment and enhance the MT related tasks. Often times the heavy lifting for many uses cases is done outside and around the neural MT models, understanding error patterns and developing strategies to resolve them.


Staying abreast is a challenge

Over the last few years, the understanding of what the “best NMT algorithms” are has changed regularly. A machine translation system that is deployed on an enterprise scale requires an “all in” long-term commitment or it will be doomed to be a failed experiment:

  • Building engineering teams that understand what research is most valid and relevant, and then upgrading and refreshing existing systems is a significant, ongoing and long-term investment. 
  • Keeping up with the evolution in the research community requires constant experimentation and testing that most practitioners will find hard to justify. 
  • Practitioners must know why and when to change as the technology evolves or risk being stuck with sub-optimal systems. 
Open-source initiatives that emerge in academic environments, such as Moses, also face challenges. They often stagnate when the key students that were involved in setting up initial toolkits graduate and are hired away. The key research team may also move on to other research that has more academic stature and potential. These shifting priorities can force DIY MT practitioners to switch toolkits at great expense, both in terms of time and redundant resource expenditures.
 
To better understand the issue of a basic open-source MT toolkit in the face of enterprise MT capability requirements, consider why an organization would choose to use an enterprise-grade content management system (CMS) to set up a corporate website instead of a tool like WordPress. While both systems could be useful in helping the organization build and deploy a corporate web presence, enterprise CMS systems are likely to offer specialized capabilities that make them much more suitable for enterprise use.
 
 
Deep expertise with MT is acquired over time by building thousands of systems across varied use cases and language combinations. Do we really believe that a DIY practitioner who builds a few dozen systems will have the same insight and expertise? Expertise and insight are acquired painstakingly over time. It is very easy "to do MT badly" and quite challenging to do it well.



 As the global communication, collaboration and content sharing imperatives demanded by modern digital transformation initiatives become well understood, many enterprises see that MT is now a critical technology building block that enables better DX. However, there are many specialized requirements including data security and confidentiality, adaptation to different business use cases, and the ability to deploy systems in a broad range of enterprise use scenarios. MT is increasingly a mission-critical technology for global business and requires the same care and attention that the selection of enterprise CMS, email, and database systems do. The issue of enterprise optimization is an increasingly critical element in selecting this kind of core technology.


What are the key requirements for enterprise MT?

There is more to successful MT deployment than simply being able to build an NMT model. A key requirement for successful MT development by the enterprise is long-term experience with machine learning research and technology at industrial scale in the enterprise use context.

With MT, actual business use case experience also matters since it is a technology that requires the combination of computational linguistics, data management, human translator interaction, and systems integration into organizational IT infrastructure for robust solutions to be developed. Best practices evolve from extensive and broad experience that typically takes years to acquire, in addition to success with hundreds, if not thousands, of systems.

The SDL MT engineering team has been a pioneer on data-driven MT technology since its inception with Statistical MT in the early 2000s and has been involved with a broad range of enterprise deployments in the public and private sectors. The deep expertise that SDL has built since then encompasses the combined knowledge gained in all of the following areas:

  • Data preparation for training and building MT engines, acquired through the experience of building thousands of engines across many language combinations for various use cases.
  • Deep machine learning techniques to assess and understand the most useful and relevant research in the NLP community for the enterprise context.
  • Development of tools and architectural infrastructure that allows rapid adoption of research breakthroughs, but still maintains existing capabilities in widely deployed systems.
  • Productization of breakthrough research for mission-critical deployability, which is a very different process from typical experimentation.
  • Pre- and post-processing infrastructure, tools and specialized capabilities that add value around core MT algorithms and enable systems to perform optimally in enterprise deployment settings. 
  • Ongoing research to adapt MT research for optimal enterprise use, e.g., using CPUs rather than GPUs to reduce deployment costs, as well as the system cost and footprint. 
  • Long-term efforts on data collection, cleaning, and optimization for rapid integration and testing with new algorithmic ideas that may emerge from the research community.
  • Close collaboration with translators and linguists to identify and solve language-specific issues, which enables unique processes to be developed to solve unique problems around closely-related languages. 
  • Ongoing interaction with translators and informed linguistic feedback on error patterns provide valuable information to drive ongoing improvements in the core technology.
  • Development of unique language combinations with very limited data availability (e.g., ZH to DE) by maximizing the impact of available data. Utilization of zero-shot translation (between language pairs the MT system has never seen) produces very low-quality systems through its very basic interlingua, but can be augmented and improved by intelligent and informed data supplementation strategies.
  • Integration with translation management software and processes to allow richer processing by linguistic support staff.
  • Integration with other content management and communication infrastructure to allow pervasive and secure implementation of MT capabilities in all text-rich software infrastructure and analysis tools.

The bottom line

The evidence suggests that embarking on a self-managed open-source-based MT initiative is for the very few who are ready to make the substantial long-term commitment and investments needed. Successful outcomes require investment in building expertise not only in machine learning but in many other related and connected areas. The same kinds of rules that apply to enterprise decisions on selecting email, content management and database systems should apply here. Properly executed, MT is a critical tool that enhances and expands the digital global footprint of the organization, and it should be treated with the same seriousness dedicated to any major strategic initiative.


Friday, April 26, 2019

Understanding MT Quality - What Really Matters?

This is the second post in our posts series on machine translation quality. Again this is a slightly less polished and raw variant of a version published on the SDL site. The first one focused on BLEU scores, which are often improperly used to make decisions on inferred MT quality, where it clearly is not the best metric to draw this inference.

The reality of many of these comparisons today is that scores based on publicly available (i.e. not blind) news domain tests are being used by many companies and LSPs to select MT systems which translate IT, customer support, pharma, financial services domain related content. Clearly, this can only result in sub-optimal choices.

The use of machine translation (MT) in the translation industry has historically been heavily focused on localization use cases, with the primary intention to improve efficiency, that is, speed up turnaround and reduce unit word cost. Indeed, machine translation post-editing (MTPE) has been instrumental in helping localization workflows achieve higher levels of productivity.




Many users in the localization industry select their MT technology based on two primary criteria:
  1. Lowest cost
  2. “Best quality” assessments based on metrics like BLEU, Lepor or TER, usually done by a third party
The most common way to assess the quality of an MT system output is to use a string-matching algorithm score like BLEU. As we pointed out previously, equating a string-match score with the potential future translation quality of an MT system in a new domain is unwise, and quite likely to result in disappointing results. BLEU and other string-matching scores offer the most value to research teams building and testing MT systems. When we further consider that scores based on old news domain content are being used to select systems for customer support content in IT and software subject domains it seems doubly foolish.

One problem with using news domain content is that it tends to lack tone and emotion. News stories discuss terrorism and new commercial ventures in almost exactly the same tone.  As Pete Smith points out in the webinar link below, in business communication, and customer service and support scenarios the tone really matters. Enterprises that can identify dissatisfied customers and address the issues that cause dissatisfaction are likely to be more successful. CX is all about tone and emotion in addition to the basic literal translation. 

Many users consider only the results of comparative evaluations – often performed by means of questionable protocols and processes using test data that is invisible or not properly defined – to select which MT systems to adopt.  Most frequently, such analyses produce a score table like the one shown below, which might lead users to believe they are using the “best-of-breed” MT solution since they selected the “top” vendor (highlighted in green). 

English to French
English to Chinese
English to Dutch
Vendor A – 46.5
Vendor C – 36.9
Vendor B – 39.5
Vendor B – 45.2
Vendor A – 34.5
Vendor C – 37.7
Vendor C – 43.5
Vendor B – 32.7
Vendor A – 35.5

While this approach looks logical at one level, it often introduces errors and undermines efficiency because of the administrative inconsistency between different MT systems. Also, the suitability of the MT output for post editing may be a key requirement for localization use cases, but this may be much less important in other enterprise use cases.




Assessing business value and impact


The first post in this blog series exposes many of the fallacies of automated metrics that use string-matching algorithms (like BLEU and Lepor), which are not reliable quality assessment techniques as they only reflect the calculated precision and recall characteristics of text matches in a single test set, on material that is usually unrelated to the enterprise domain of interest. 

The issues discussed challenge the notion that single-point scores can really tell you enough about long-term MT quality implications. This is especially true as we move away from the localization use case. Speed, overall agility and responsiveness and integration into customer experience related data flow matters much more in the following use cases. The actual translation quality variance measured by BLEU and Lepor may have little to no impact on what really matters in the following use cases.



The enterprise value-equation is much more complex and goes far beyond linguistic quality and Natural Language Processing (NLP) scores. To truly reflect the business value and impact, evaluation of MT technology must factor in non-linguistic attributes including:
  • Adaptability to business use cases
  • Manageability
  • Integration into enterprise infrastructure
  • Deployment flexibility   
To effectively link MT output to business value implications, we need to understand that although linguistic precision is an important factor, it often has a lower priority in high-value business use cases. This view will hopefully take hold as the purpose and use of MT is better understood in the context of a larger business impact scenario, beyond localization.

But what would more dynamic and informed approaches look like? MT evaluation certainly cannot be static since systems must evolve as requirements change. Instead of a single-point score, we need a more complex framework that provides an easy, single measure that tells us everything we need to know about an MT system. Today, this is unfortunately not yet feasible.




A more meaningful evaluation framework


While single-point scores do provide a rough and dirty sense of an MT system’s performance, it is more useful to focus testing efforts on specific enterprise use case requirements. This is also true for automated metrics, which means that scores based on news domain tests should be viewed with care since they are not likely to be representative of performance on specialized enterprise content. 

When rating different MT systems, it is essential to score key requirements for enterprise use, including:

  • Adaptability: Range of options and controls available to tune the MT system performance for very specific use cases. For example, optimization techniques applied to eCommerce catalog content should be very different from those applied to technical support chatbot content or multilingual corporate email systems.
  • Data privacy and security: If an MT system will be used to translate confidential emails, business strategy and tactics documents, human evaluation requirements will differ greatly from a system that only focuses on product documentation. Some systems will harvest data for machine learning purposes, and it is important to understand this upfront.
  • Deployment flexibility: Some MT systems need to be deployed on-premises to meet legal requirements, such as is the case in litigation scenarios or when handling high-security data. 
  • Expert services: Having highly qualified experts to assist in the MT system tuning and customization can be critical for certain customers to develop ideal systems. 
  • IT integration: Increasingly, MT systems are embedded in larger business workflows to enable greater multilingual capabilities, for example, in communication and collaboration software infrastructures like email, chat and CMS systems.
  • Overall flexibility: Together, all these elements provide flexibility to tune the MT technology to specific use cases and develop successful solutions.

Ultimately, the most meaningful measures of MT success are directly linked to business outcomes and use cases. The definition of success varies by the use case, but most often, linguistic accuracy as an expression of translation quality is secondary to other measures of success. 


The integrity of the overall solution likely has much more impact than the MT output quality in the traditional sense: not surprisingly, MT output quality could vary by as much as 10-20% on either side of the current BLEU score without impacting the true business outcome. Linguistic quality matters but is not the ultimate driver of successful business outcomes. In fact, there are reports of improvements in output quality in an eCommerce use case that actually reduced the conversion rates on the post-edited sections, as this post-edited content was viewed as being potentially advertising-driven and thus less authentic and trustworthy.








True expressions of successful business outcomes for different use cases


Global enterprise communication and collaboration
  • Increased volume in cross-language internal communication and knowledge sharing with safeguarded security and privacy
  • Better monitoring and understanding of global customers 
  • Rapid resolution of global customer problems, measured by volume and degree of engagement
  • More active customer and partner communications and information sharing
Customer service and support
  • Higher volume of successful self-service across the globe
  • Easy and quick access to multilingual support content 
  • Increased customer satisfaction across the globe
  • The ability of monolingual live agents to service global customers regardless of the originating customer’s language 
eCommerce
  • Measurably increased traffic drawn by new language content
  • Successful conversions in all markets
  • Transactions are driven by newly translated content
  • The stickiness of new visitors in new language geographies
Social media analysis
  • Ability to identify key brand impressions 
  • Easy identification of key themes and issues
  • A clear understanding of key positive and negative reactions
Localization
  • Faster turnaround for all MT-based projects
  • Lower production cost as a reflection of lower cost per word
  • Better MTPE experience based on post-editor ratings
  • Adaptability and continuous improvement of the MT system

A more detailed presentation and webinar that goes into much more detail on this subject is available from Brightalk. 


In upcoming posts in this series, we will continue to explore the issue of MT quality assessment from a broad enterprise needs perspective. More informed practices will result in better outcomes and significantly improved MT deployments that leverage the core business mission to solve high-volume multilingual challenges more effectively.

Wednesday, April 17, 2019

Understanding MT Quality: BLEU Scores

This is the first in a series of posts discussing various aspects of MT quality from the context of enterprise use and value, where linguistic quality is important, but not the only determinant of suitability in a structured MT technology evaluation process. A cleaner, more polished, and shorter studio version of this post is available here. You can consider this post a first draft, or the live stage performance (stream of consciousness) version.

What is BLEU (Bilingual Evaluation Understudy)?

As the use of enterprise machine translation expands, it becomes increasingly more important for users and practitioners to understand MT quality issues in a relevant, meaningful, and accurate way.
The BLEU score is a string-matching algorithm that provides basic output quality metrics for MT researchers and developers. In this first post, we will review and look more closely at the BLEU score, which is probably the most widely used MT quality assessment metric in use by MT researchers and developers over the last 15 years. While it is widely understood that the BLEU metric has many flaws, it continues to be a primary metric used to measure MT system output even today, in the heady days of Neural MT.
Firstly, we should understand that a fundamental problem with BLEU is that it DOES NOT EVEN TRY to measure “translation quality”, but rather focuses on STRING SIMILARITY (usually to a single human reference). What has happened over the years is that people choose to interpret this as a measure of the overall quality of an MT system. BLEU scores only reflect how a system performs on the specific set of test sentences used in the test. As there can be many correct translations, and most BLEU tests rely on test sets with only one correct translation reference, it means that it is often possible to score perfectly good translations poorly.
The scores do not reflect the potential performance of the system on other material that differs from the specific test material, and all inferences on what the score means should be made with great care, after taking a close look at the existing set of test sentences. It is very easy to use and interpret BLEU incorrectly and the localization industry abounds with examples of incorrect, erroneous, and even deceptive use.

Very simply stated, BLEU is a “quality metric” score for an MT system that is attempting to measure the correspondence between a machine translation output and that of a human with the understanding that "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Scores are calculated for individual MT translated segments—generally sentences—by comparing them with a set of good quality human reference translations. Most would consider BLEU scores more accurate at a corpus level rather than at a sentence level.
BLEU gained popularity because it was one of the first MT quality metrics to report a high correlation with human judgments of quality, a notion that has been challenged often, but after 15 years of attempts to displace it from prominence, the allegedly “improved” derivatives (METEOR, LEPOR) have yet to really unseat its dominance. BLEU together with human assessment remains the preferred metrics of choice today.

A Closer, More Critical Examination of BLEU


BLEU is actually nothing more than a method to measure the similarity between two text strings. To infer that this metric, which has no linguistic consideration or intelligence whatsoever, can predict not only past “translation quality” performance, but also future performance is indeed quite a stretch.
Measuring translation quality is much more difficult because there is no absolute way to measure how “correct” a translation is. MT is a particularly difficult AI challenge because computers prefer binary outcomes, and translation has rarely if ever only one single correct outcome. Many “correct” answers are possible, and there can be as many “correct” answers as there are translators. The most common way to measure quality is to compare the output strings of automated translation to a human translation text string of the same sentenceThe fact that one human translator will translate a sentence in a significantly different way than another human translator, leads to problems when using these human references to measure “the quality” of an automated translation solution.
The BLEU metric scores a translation on a scale of 0 to 1. The metric attempts to measure adequacy and fluency in a similar way to a human would, e.g. does the output convey the same meaning as the input sentence, and is the output good and fluent target language? The closer to 1, the more overlap there is with a human reference translation and thus the better the system is. In a nutshell, the BLEU metric measures how many words overlap, giving higher scores to sequential words. For example, a string of four words in the translation that match the human reference translation (in the same order) will have a positive impact on the BLEU score and is weighted more heavily (and scored higher) than a one or two-word match. It is very unlikely that you would ever score 1 as that would mean that the compared output is exactly the same as the reference output. However, it is also possible that an accurate translation would receive a low score because it uses different words than the reference used. This problem potential can be seen in the following example. If we select one of these translations for our reference set, all the other correct translations will score lower!

How does BLEU work?

To conduct a BLEU measurement the following data is necessary:
  1. One or more human reference translations. (This should be data which has NOT been used in building the system (training data) and ideally should be unknown to the MT system developer. It is generally recommended that 1,000 or more sentences be used to get a meaningful measurement.) If you use too small a sample set you can sway the score significantly with just a few sentences that match or do not match well.
  2. Automated translation output of the exact same source data set.
  3. A measurement utility that performs the comparison and score calculation for you.

  • Studies have shown that there is a reasonably high correlation between BLEU and human judgments of quality when properly used.
  • BLEU scores are often stated on a scale of 1 to 100 to simplify communication but should not be confused with the percentage of accuracy.
  • Even two competent human translations of the exact same material may only score in the 0.6 or 0.7 range as they likely use different vocabulary and phrasing.
  • We should be wary of very high BLEU scores (in excess of 0.7) as it is likely we are measuring improperly or overfitting.

A sentence translated by MT may have 75% of the words overlap with one translator’s translation, and only 55% with another translator’s translation; even though both human reference translations are technically correct, the one with the 75% overlap with machine translation will provide a higher “quality” score for the automated translation. This is somewhat arbitrary. Random string matching scores should not be equated to overall translation quality. Therefore, although humans are the true test of correctness, they do not provide an objective and consistent measurement for any meaningful notion of quality.
As would be expected using multiple human reference tests will always result in higher scores as the MT output has more human variations to match against. The NIST (National Institute of Standards & Technology) used BLEU as an approximate indicator of quality in its annual MT competitions with four human reference sets to ensure that some variance in human translation is captured, and thus allow more accurate assessments of the MT solutions being evaluated. The NIST evaluation also defined the development, test, and evaluation process much more carefully and competently, and thus comparing MT systems under their rigor and purview was meaningful. This has not been true for many of the comparisons done since, and many recent comparisons are deeply flawed.

 

Why are automated MT quality assessment metrics needed?

Automated quality measurement metrics have always been important to the developers and researchers of data-driven based MT technology, because of the iterative nature of MT system development, and the need for frequent assessments during the development of the system. They can provide rapid feedback on the effectiveness of continuously evolving research and development strategies.
Recently, we see that BLEU and some of its close derivatives (METEOR, NIST, LEPOR, and F-Measure) are also often used to compare the quality of different MT systems in enterprise use settings. This can be problematic as a “single point quality score” based on publically sourced news domain sentences is simply not representative of the dynamically changing, customized, and modified potential of an active and evolving enterprise MT system. Also, such a score does not incorporate the importance of overall business requirements in an enterprise use scenario where other workflow, integration, and process-related factors may actually be much more important than small differences in scores. Useful MT quality in the enterprise use context will vary greatly, depending on the needs of the specific use-case.
Most of us would agree that competent human evaluation is the best way to understand the output quality implications of different MT systems. However, human evaluation is generally slower, less objective, and likely to be more expensive and thus not viable in many production use scenarios when many comparisons need to be made on a constant and ongoing basis. Thus, automated metrics like BLEU provide a quick and often dirty quality assessment that can be useful to those who actually understand its basic mechanics. However, they should also understand its basic flaws and limitations and thus avoid coming to over-reaching or erroneous conclusions based on these scores.

There are two very different ways that such scores may be used,

  • R&D Mode: In comparing different versions of an evolving system during the development of the production MT system, and,
  • Buyer Mode: In comparing different MT systems from different vendors and deciding which one is the “best” one.

The MT System Research & Development NeedData-driven MT systems could probably not be built without using some kind of automated measurement metric to measure ongoing progress. MT system builders are constantly trying new data management techniques, algorithms, and data combinations to improve systems, and thus need quick and frequent feedback on whether a particular strategy is working or not. It is necessary to use some form of standardized, objective and relatively rapid means of assessing quality as part of the system development process in this technology. If this evaluation is done properly, the tests can also be useful over a longer period to understand how a system evolves over many years.

The MT Buyer Need: As there are many MT technology options available today, BLEU and its derivatives are sometimes used to select what MT vendor and system to use. The use of BLEU in this context is much more problematic and prone to drawing erroneous conclusions as often comparisons are being made between apples and oranges. The most common error in interpreting BLEU is the lack of awareness and understanding that there is a positive bias towards one MT system because it has already seen and trained on the test data, or has been used to develop the test data set.

Problems with BLEU


While BLEU is very useful to those who build and refine MT systems, it’s value as an effective way to compare totally different MT systems is much more limited and needs to be done very carefully, if done at all, as it is easily and often manipulated to create the illusion of superiority.
“CSA Research and leading MT experts have pointed out for over a decade that these metrics [BLEU] are artificial and irrelevant for production environments. One of the biggest reasons is that the scores are relative to particular references. Changes that improve performance against one human translation might degrade it with respect to another… Approaches that emphasize usability and user acceptance take more effort than automatic scores but point the way toward a useful and practical discussion of MT quality. “
There are several criticisms of BLEU that should also be understood if you are to use the metric effectively. BLEU only measures direct word-by-word similarity and looks to match and measure the extent to which word clusters in two sentences or documents are identical. Accurate translations that use different words may score poorly since there is no match in the human reference. 
There is no understanding of paraphrases and synonyms so scores can be somewhat misleading in terms of overall accuracy. You have to get the exact same words used in the human reference translation to get credit e.g.
"Wander" doesn't get partial credit for "stroll," nor "sofa" for "couch."
Also, nonsensical language that contains the right phrases in the wrong order can score high. e.g.
"Appeared calm when he was taken to the American plane, which will to Miami, Florida" would get the very same score as: "was being led to the calm as he was would take carry him seemed quite when taken".
A more recent criticism identifies the following problems:
  • It is an intrinsically meaningless score
  • It admits too many variations – meaningless and syntactically incorrect variations can score the same as good variations
  • It admits too few variations – it treats synonyms as incorrect
  • More reference translations do not necessarily help
These and other problems are described in this article and this critical academic review. The core problem is that word-counting scores like BLEU and its derivatives - the linchpin of the many machine-translation competitive comparisons - don't even recognize well-formed language, much less real translated meaning. Here is a more recent post that I highly recommend, as it very clearly explains other metrics, and shows why it also still makes sense to use BLEU in spite of its many problems.
For post-editing work assessments there is a growing preference for Edit Distance scores to more accurately reflect the effort involved, even though it too is far from perfect.
The problems are further exacerbated with the Neural MT technology which can often generate excellent translations that are quite different from the reference and thus score poorly. Thus, many have found that lower (BLEU) scoring NMT systems are clearly preferred over higher scoring SMT systems when human evaluations are done. There are some new metrics (ChrF, SacreBLEU, Rouge) attempting to replace BLEU, but none have gathered any significant momentum yet and the best way to evaluate NMT system output today is still well structured human assessments.

What is BLEU useful for?

Modern MT systems are built by “training” a computer with examples of human translations. As more human translation data is added, systems should generally get better in quality. Often, new data can be added with beneficial results, but sometimes new data can cause a negative effect especially if it is noisy or otherwise “dirty”. Thus, to measure if progress is being made in the development process, the system developers need to be able to measure the quality impact rapidly and frequently to make sure they are improving the system and are in fact making progress.
BLEU allows developers a way “to monitor the effect of daily changes to their systems in order to weed out bad ideas from good ideas.” When used to evaluate the relative merit of different system building strategies, BLEU can be quite effective as it provides very quick feedback and this enables MT developers to quickly refine and improve translation systems they are building and continue to improve quality on a long term basis.

What is BLEU not useful for?


BLEU scores are always very directly related to a specific “test set” and a specific language pair. Thus, BLEU should not be used as an absolute measure of translation quality because the BLEU score can vary even for one language depending on the test and subject domain. In most cases comparing BLEU scores across different languages is meaningless unless very strict protocols have been followed. Because of this, it is always recommended to use human translators to verify the accuracy of the metrics after systems have been built. Also, most MT industry leaders will always vet the BLEU score readings with human assessments before production use.
In competitive comparisons, it is important to carry out the comparison tests in an unbiased, scientific manner to get a true view of where you stand against competitive alternatives. The “test set” should be unknown (“blind”) to all the systems that are involved in the measurement. This is something that is often violated in many widely used comparisons today. If a system is trained with the sentences in the “test set” it will obviously do well on the test but probably not as well on data that it has not seen before. Many recent comparisons score MT systems on News Domain related test sets that may also be used in training by some MT developers. A good score on news domain may not be especially useful for an enterprise use case that is heavily focused on IT, pharma, travel or any domain other than news.
However, in spite of all the limitations identified above, BLEU continues to be a basic metric used by most, if not all MT researchers today. Though, now, most expert developers regularly use human evaluation on smaller sets of data to ensure that they indeed have a true and meaningful BLEU. The MT community have found that supposedly improved metrics like METEOR, LEPOR, and other metrics have not really gained any momentum. BLEU and its flaws and issues are more clearly understood, and thus more reliable, especially if used together with supporting human assessments. Also, many buyers today realize that MT system performance on their specific subject domains and translatable content for different use cases matters much more than how generic systems might perform on news stories.


In upcoming posts in this series, we will continue to explore the issue of MT quality from a broad enterprise needs perspective. More informed practices will result in better outcomes and significantly improved MT deployments that leverage the core business mission to solve high-volume multilingual challenges more effectively.