Wednesday, September 25, 2019

In a Funk about BLEU

MT quality measurement, like human translation quality measurement, has been a difficult and challenging subject for both the translation industry and for many MT researchers and systems developers as the most commonly used metric BLEU, is now quite widely understood to be of especially limited value with NMT systems. 

Most of the other text-matching NLP scoring measures are just as suspect, and practitioners are reluctant to adopt them as they are either difficult to implement, or the interpretation pitfalls and nuances of these other measures are not well understood. They all can generate a numeric score based on various calculations of Precision and Recall that need to be interpreted with great care. Most experts will say that the only reliable measures are those done by competent humans and increasingly best practices suggest that a trust-but-verify approach is better. There are many variations of superficially accurate measures available today, but on closer examination, they are all lacking critical elements to make them entirely reliable and foolproof.

So, as much as BLEU scores suck, we continue to use them since some, or perhaps even many of us understand them. Unfortunately, many still don't have a real clue, especially in the translation industry. 

I wonder sometimes if all this angst about MT quality measurement is much ado about nothing. We do in fact, need very rough indicators of MT quality to make judgments of suitability in business use cases, but taking these scores as final indicators of true quality is problematic. It is likely that the top 5, or even top 10 systems are essentially equivalent in terms of the MT quality impact on the business purpose. The real difference in business impact comes from other drivers: competence, experience, process efficiency and quality of implementation.

I would argue that even for localization use cases, the overall process design and other factors matter more than the MT output quality.

 As we have said before, technology has value when it produces favorable business outcomes, even if these outcomes can be somewhat challenging to measure with a precise and meaningful grade. MT is a technology that is seldom perfect, but even in its imperfection can provide great value to an enterprise with a global presence. MT systems with better BLEU or Lepor scores do not necessarily produce better business outcomes. I would argue that an enterprise could use pretty much any "serious" MT system without any impact on the final business outcome. 

This is most clear with eCommerce and global customer service and support use cases, where the use of MT can very rapidly yield a significant ROI. 

"eBay’s push for machine translation has helped the company increase Latin American exports by nearly 20%, according to researchers from the Massachusetts Institute of Technology, and illustrates the potential for increased commercial activity as translation technologies gain wider adoption in business."
MT deployment use case presentations shared by practitioners who have used MT to translate large volumes of knowledgebase support content show that what matters is whether the content helps customers across the globe get to answers that solve problems faster. Translation quality matters but only if it helps understandability. In the digital world, speed is crucial and often more important.

Some 100,000 buyers exchange a total of 2 billion translated text messages every week on the global-trade platform. This velocity and volume of communication that is enabled by MT enable new levels of global commerce and trade. How many of these messages do you think are perfect translations? 

A monolingual Live Support Agent who can service thousands of global customers a week because he/she can quickly understand the question and send back relevant and useful support content back to a customer using MT  is another example. The ability to do this in volume matters more than perfect linguistic quality.

So then the selection of the right MT technology or solution will come down to much more enterprise relevant issues like:

  • Data Security & Privacy 
  • Adaptability to enterprise unique terminology and use cases
  • Scalability - from billions of words to thousands per hour 
  • Deployment Flexibility - On-premise, cloud or combinations of both
  • Integration with key IT infrastructure and platforms
  • Availability of expert consulting services for specialization 
  • Vendor focus on SOTA
  • MT system manageability
  • Cost 
  • Vendor reputation, profile and enterprise account management capabilities

Pete Smith will be presenting more details of his research study at SDL Connect next month.


There is little debate: the machine translation research and practitioner communities are in a funk about BLEU. From recent webinars to professional interviews and scholarly publications, BLEU is being called on the carpet for its technical shortcomings in the face of a rapidly-developing field, as well as the lack of insight it provides to different consumers such as purchasers of MT services or systems.

BLEU itself is used widely, especially in the MT research community, as an outcome measure for evaluating MT. Yet even in that setting, there is considerable rethinking and re-evaluation of the metric, and BLEU has been an active topic of critical discussion and research for some years, including the challenges faced by evaluating automated translation across the language typology spectrum and especially in cases of morphologically rich languages. And the issue is not limited, of course, to machine translation—the metric is also a topic in NLP and natural language generation discussions generally.

BLEU’s strengths and shortcomings are well-known. At its core, BLEU is a string matching algorithm for use in evaluating MT output and is not per se a measure of translation quality. That said, here is no doubt that automated or calculated metrics are of great value, as total global MT output approaches levels of one trillion words per day.

And few would argue that, in producing and evaluating MT or translation in general, context matters. A general-purpose, public-facing MT engine designed for broad coverage among users and use cases is just that—general-purpose, and likely more challenged by perennial source language challenges such as specific domain style/terminology, informal language usage, regional language variations, and other issues.

It is no secret that many MT products are trained (at least initially) on publicly available research data and that there are, overall, real thematic biases in those datasets. News, current events, governmental and parliamentary data sets are available across a wide array of language pairs, as well as smaller amounts of data from domains such as legal, entertainment, and lecture source materials such as TED Talks. Increasingly, datasets are available in the IT and technical domains, but there are few public bilingual datasets available that are suitable for major business applications of MT technology such as e-commerce, communication, and collaboration, or customer service.

Researchers and applied practitioners have all benefited from these publicly-available resources. But the case for clarity is perhaps most evident in the MT practitioner community.

For example, enterprise customers hoping to purchase machine translation services face a dilemma: how might the enterprise evaluate an MT product or service for their particular domain, and with more nuance and depth than simply relying on marketing materials boasting scores or gains in BLEU or LEPOR? How might you evaluate major vendors of MT services specific to your use case and needs?

And as a complicating factor, we know an increasing amount about the “whys” and “hows” of fine-tuning general-purpose engines to better perform in enterprise cases such as e-commerce product listings, technical support knowledgebase content, social media analysis, and user feedback/reviews. In particular, raw “utterances” from customers and customer support personnel in these settings are authentic language, with all of its “messiness.”

The UTA research group has recently been exploring MT engine performance on customer support content, building a specialized test set compiled from source corpora including email and customer communications, communications via social media, and online customer support. In particular, we explored the utilization of automation and standard NLP-style pre-processing to rapidly construct a representative translation test set for the focused use case.

At the start, an initial set of approximately 3 million English sentence strings related to enterprise communication and collaboration were selected. Source corpora represented tasks such as email communication, customer communications, communications via social media, and online customer support.

Candidate sentence strings from these larger corpora were narrowed via a sentence clustering technique, training a FastText model on the input documents to capture both the semantic and non-semantic (linguistic) properties of the corpora. To give some sense of the linguistic features considered in string selection, corpora elements were parsed using the spaCy natural language processing library’s largest English model to consider features in a string such as the number of “stop words”; the number of tokens that were punctuation, numbers, e-mail addresses, URLs, alpha-only, and out-of-vocabulary; the number of unique lemmas and orthographic forms; number of named entities; the number of times each entity type, part-of-speech tag and dependency relation appeared in the text; and the total number of tokens. Dimensionality reduction and clustering were used in the end, to result in 1050 English-language strings for the basic bespoke test set.

The strings from the constructed set were translated into seven languages (French, German, Hindi, Korean, Portuguese, Russian, Spanish) by professional translators. Then the translated sentences from the test set were utilized as translation prompts in seven language pairs (English-French, English-German, English-Hindi, English-Korean, English-Portuguese, English-Russian, English-Spanish) by four major, publicly-available MT engines via API or web interface. At both the corpus as well as the individual string level, BLEU, METEOR, and TER scores were generated for each major engine and language pair (not all of the seven languages were represented in all engine products).

Our overall question was: does BLEU (or any of the other automated scores) support, say, the choice of engine A over engine B for enterprise purchase when the use case is centered on customer-facing and customer-generated communications? 

To be sure, the output scores presented a muddled picture. Composite scores of the general-purpose engines clustered within approximately 5-8 BLEU points of each other in most languages. And although we used a domain-specific test set, little in the results would have provided the enterprise-level customer with a clear path forward. As Kirti Vashee has pointed out recently, in responding effectively to the realities of the digital world, “5 BLEU points this way or that is negligible in most high-value business use cases.”

What are some of the challenges of authentic, customer language? Two known challenges to MT include the formality/informality of language utterances and emotive content. The double-punch of informality and emotion-laden customer utterances pose a particularly challenging case.

As we reviewed in a recent webinar, customer-generated strings in support conversations or online interactions present a translator with a variety of expressions of emotion, tone, humor, sarcasm, all embedded within a more informal and Internet-influenced style of language. Some examples included:

             Support…I f***ing hate you all. [Not redacted in the original.]
            Those late in the day deliveries go missing” a lot.
            Nope didnt turn upjust as expectednow what dude?
            I feel you man, have a good rest of your day!
           Seriously, this is not OK.
           A bunch of robots who repeat the same thing over & over.

Here one can quickly see how an engine trained primarily with formal, governmental or newspaper source would be quickly challenged. But in early results, our attempts to unpack the issues of how MT may perform on emotive content (i.e., not news, legal, or technical content) have provided little insight to date. Early findings suggest surprisingly little interaction between standard ratings of sentiment and emotion run on the test set individual strings (VADER positive, negative, neutral, composite and IBM tone analysis) and variance in downstream BLEU scores.

Interestingly, as an aside in our early work, raw BLEU scores across languages for the entire test set did generally correlate comparatively highly with METEOR scores. Although this correlation is expected, the strength of the relationship was surprising in an NMT context, as high as r=.9 across 1000+ strings in a given language pair. If, as the argument goes, NMT brings strengths in fluency which includes elements METEOR scoring is, by design, more sensitive to (such as synonyms or paraphrasing), one might expect that correlation to be weaker. More broadly, these and other questions around automatic evaluation have a long history of consideration by the MT and WMT communities.

One clearly emerging practice in the field is to combine an automated metric such as BLEU along with human evaluation on a smaller data set, to confirm and assure that the automated metrics are useful and provide critical insight, especially if the evaluation is used to compare MT systems. Kirti Vashee, Alon Lavie, and Daniel Marcu have all written on this topic recently.

Thus, the developing, more nuanced understanding of the value of BLEU may be as automated scores seen as initially most useful during MT research and system development, where they are by far the most widely-cited standard. The recent Machine Translation Summit XVII in Dublin, for example, had almost 500 mentions or references to BLEU in the research proceedings alone.

But this measure may be potentially less accurate or insightful when broadly comparing different MT systems within the practitioner world, and perhaps more insightful again to both researcher and practitioner when paired with human or other ratings. As one early MT researcher has noted, “BLEU is easy to criticize, but hard to get away from!”

Discussions at the recent TAUS Global Content Conference 2019 further developed the ideas of MT engine specialization in the context of the modern enterprise content workflow. Presenters such as SDL and others offered views future visions of content development personalization and use in a multilingual world. These future workflows may contain hundreds or thousands of specialized, specially-trained and uniquely maintained automated translation engines and other linguistic algorithms, as content is created, managed, evaluated, and disseminated globally.

There is little doubt that the automated evaluation of translation will continue to play a key role in this emerging vision. However, a better understanding of the field’s de facto metrics and the broader MT evaluation process in this context is clearly imperative.

And what of use cases that continue to emerge, such as the possibility of intelligent or MT content in the educational space? The UTA research group is also exploring MT applications specific to education and higher education as well. For example, millions of users daily make use of learning materials such as MOOCs—educational content that attracts users across borders, languages, and cultures. A significant portion of international learners come to and potentially struggle with English-language content in edX or other MOOC courses—and thousands of MOOC offerings exist in the world’s languages, untranslated for English-speakers. What role might machine translation potentially play in this educational endeavor?

This is a more fleshed-out version of a blog post by Pete Smith and Henry Anderson of the University of Texas at Arlington already published on They describe initial results from a research project they are conducting on MT system quality measurement and related issues. 

Dr. Pete Smith, Chief Analytics Officer, and Professor
Mr. Henry Anderson, Data Scientist
Localization and Translation Program
Department of Modern Languages and Office of University Analytics
The University of Texas at Arlington