eMpTy Pages: Understanding MT Quality: BLEU Scores

This is the first in a series of posts discussing various aspects of MT quality from the context of enterprise use and value, where linguistic quality is important, but not the only determinant of suitability in a structured MT technology evaluation process. .

What is BLEU (Bilingual Evaluation Understudy)?

As the use of enterprise machine translation expands, it becomes increasingly more important for users and practitioners to understand MT quality issues in a relevant, meaningful, and accurate way.

The BLEU score is a string-matching algorithm that provides basic output quality metrics for MT researchers and developers. In this first post, we will review and look more closely at the BLEU score, which is probably the most widely used MT quality assessment metric in use by MT researchers and developers over the last 15 years. While it is widely understood that the BLEU metric has many flaws, it continues to be a primary metric used to measure MT system output even today, in the heady days of Neural MT.

Firstly, we should understand that a fundamental problem with BLEU is that it DOES NOT EVEN TRY to measure “translation quality”, but rather focuses on STRING SIMILARITY (usually to a single human reference). What has happened over the years is that people choose to interpret this as a measure of the overall quality of an MT system. BLEU scores only reflect how a system performs on the specific set of test sentences used in the test. As there can be many correct translations, and most BLEU tests rely on test sets with only one correct translation reference, it means that it is often possible to score perfectly good translations poorly.

The scores do not reflect the potential performance of the system on other material that differs from the specific test material, and all inferences on what the score means should be made with great care, after taking a close look at the existing set of test sentences. It is very easy to use and interpret BLEU incorrectly and the localization industry abounds with examples of incorrect, erroneous, and even deceptive use.

Very simply stated, BLEU is a “quality metric” score for an MT system that is attempting to measure the correspondence between a machine translation output and that of a human with the understanding that "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Scores are calculated for individual MT translated segments—generally sentences—by comparing them with a set of good quality human reference translations. Most would consider BLEU scores more accurate at a corpus level rather than at a sentence level.

BLEU gained popularity because it was one of the first MT quality metrics to report a high correlation with human judgments of quality, a notion that has been challenged often, but after 15 years of attempts to displace it from prominence, the allegedly “improved” derivatives (METEOR, LEPOR) have yet to really unseat its dominance. BLEU together with human assessment remains the preferred metrics of choice today.

A Closer, More Critical Examination of BLEU

BLEU is actually nothing more than a method to measure the similarity between two text strings. To infer that this measurement, which has no linguistic consideration or intelligence whatsoever, can predict not only past “translation quality” performance, but also future performance is indeed quite a stretch.

Measuring translation quality is much more difficult because there is no absolute way to measure how “correct” a translation is. MT is a particularly difficult AI challenge because computers prefer binary outcomes, and translation has rarely if ever only one single correct outcome. Many “correct” answers are possible, and there can be as many “correct” answers as there are translators. The most common way to measure quality is to compare the output strings of automated translation to a human translation text string of the same sentence. The fact that one human translator will translate a sentence in a significantly different way than another human translator, leads to problems when using these human references to measure “the quality” of an automated translation solution.

The BLEU measure scores a translation on a scale of 0 to 1. The measurement attempts to measure adequacy and fluency in a similar way to a human would, e.g. does the output convey the same meaning as the input sentence, and is the output good and fluent target language? The closer to 1, the more overlap there is with a human reference translation and thus the better the system is. In a nutshell, the BLEU score measures how many words overlap, giving higher scores to sequential words. For example, a string of four words in the translation that match the human reference translation (in the same order) will have a positive impact on the BLEU score and is weighted more heavily (and scored higher) than a one or two-word match. It is very unlikely that you would ever score 1 as that would mean that the compared output is exactly the same as the reference output. However, it is also possible that an accurate translation would receive a low score because it uses different words than the reference used. This problem potential can be seen in the following example. If we select one of these translations for our reference set, all the other correct translations will score lower!

How does BLEU work?

To conduct a BLEU measurement the following data is necessary:

One or more human reference translations. (This should be data which has NOT been used in building the system (training data) and ideally should be unknown to the MT system developer. It is generally recommended that 1,000 or more sentences be used to get a meaningful measurement.) If you use too small a sample set you can sway the score significantly with just a few sentences that match or do not match well.
Automated translation output of the exact same source data set.
A measurement utility that performs the comparison and score calculation for you.

Studies have shown that there is a reasonably high correlation between BLEU and human judgments of quality when properly used.
BLEU scores are often stated on a scale of 1 to 100 to simplify communication but should not be confused with the percentage of accuracy.
Even two competent human translations of the exact same material may only score in the 0.6 or 0.7 range as they likely use different vocabulary and phrasing.
We should be wary of very high BLEU scores (in excess of 0.7) as it is likely we are measuring improperly or overfitting.

A sentence translated by MT may have 75% of the words overlap with one translator’s translation, and only 55% with another translator’s translation; even though both human reference translations are technically correct, the one with the 75% overlap with machine translation will provide a higher “quality” score for the automated translation. This is somewhat arbitrary. Random string matching scores should not be equated to overall translation quality. Therefore, although humans are the true test of correctness, they do not provide an objective and consistent measurement for any meaningful notion of quality.

As would be expected using multiple human reference tests will always result in higher scores as the MT output has more human variations to match against. The NIST (National Institute of Standards & Technology) used BLEU as an approximate indicator of quality in its annual MT competitions with four human reference sets to ensure that some variance in human translation is captured, and thus allow more accurate assessments of the MT solutions being evaluated. The NIST evaluation also defined the development, test, and evaluation process much more carefully and competently, and thus comparing MT systems under their rigor and purview was meaningful. This has not been true for many of the comparisons done since, and many recent comparisons are deeply flawed.

Why are automated MT quality assessment metrics needed?

Automated quality measurement metrics have always been important to the developers and researchers of data-driven based MT technology, because of the iterative nature of MT system development, and the need for frequent assessments during the development of the system. They can provide rapid feedback on the effectiveness of continuously evolving research and development strategies.

Recently, we see that BLEU and some of its close derivatives (METEOR, NIST, LEPOR, and F-Measure) are also often used to compare the quality of different MT systems in enterprise use settings. This can be problematic as a “single point quality score” based on publically sourced news domain sentences is simply not representative of the dynamically changing, customized, and modified potential of an active and evolving enterprise MT system. Also, such a score does not incorporate the importance of overall business requirements in an enterprise use scenario where other workflow, integration, and process-related factors may actually be much more important than small differences in scores. Useful MT quality in the enterprise use context will vary greatly, depending on the needs of the specific use-case.

Most of us would agree that competent human evaluation is the best way to understand the output quality implications of different MT systems. However, human evaluation is generally slower, less objective, and likely to be more expensive and thus not viable in many production use scenarios when many comparisons need to be made on a constant and ongoing basis. Thus, automated metrics like BLEU provide a quick and often dirty quality assessment that can be useful to those who actually understand its basic mechanics. However, they should also understand its basic flaws and limitations and thus avoid coming to over-reaching or erroneous conclusions based on these scores.

There are two very different ways that such scores may be used,

R&D Mode: In comparing different versions of an evolving system during the development of the production MT system, and,

Buyer Mode: In comparing different MT systems from different vendors and deciding which one is the “best” one.

The MT System Research & Development Need: Data-driven MT systems could probably not be built without using some kind of automated measurement metric to measure ongoing progress. MT system builders are constantly trying new data management techniques, algorithms, and data combinations to improve systems, and thus need quick and frequent feedback on whether a particular strategy is working or not. It is necessary to use some form of standardized, objective and relatively rapid means of assessing quality as part of the system development process in this technology. If this evaluation is done properly, the tests can also be useful over a longer period to understand how a system evolves over many years.

The MT Buyer Need: As there are many MT technology options available today, BLEU and its derivatives are sometimes used to select what MT vendor and system to use. The use of BLEU in this context is much more problematic and prone to drawing erroneous conclusions as often comparisons are being made between apples and oranges. The most common error in interpreting BLEU is the lack of awareness and understanding that there is a positive bias towards one MT system because it has already seen and trained on the test data, or has been used to develop the test data set.

Problems with BLEU

While BLEU is very useful to those who build and refine MT systems, it’s value as an effective way to compare totally different MT systems is much more limited and needs to be done very carefully, if done at all, as it is easily and often manipulated to create the illusion of superiority.

“CSA Research and leading MT experts have pointed out for over a decade that these metrics [BLEU] are artificial and irrelevant for production environments. One of the biggest reasons is that the scores are relative to particular references. Changes that improve performance against one human translation might degrade it with respect to another… Approaches that emphasize usability and user acceptance take more effort than automatic scores but point the way toward a useful and practical discussion of MT quality. “

Excerpt from a CSA Blog on BLEU Misuse, April 2017

There are several criticisms of BLEU that should also be understood if you are to use the metric effectively. BLEU only measures direct word-by-word similarity and looks to match and measure the extent to which word clusters in two sentences or documents are identical. Accurate translations that use different words may score poorly since there is no match in the human reference.

There is no understanding of paraphrases and synonyms so scores can be somewhat misleading in terms of overall accuracy. You have to get the exact same words used in the human reference translation to get credit e.g.

"Wander" doesn't get partial credit for "stroll," nor "sofa" for "couch."

Also, nonsensical language that contains the right phrases in the wrong order can score high. e.g.

"Appeared calm when he was taken to the American plane, which will to Miami, Florida" would get the very same score as: "was being led to the calm as he was would take carry him seemed quite when taken".

A more recent criticism identifies the following problems:

It is an intrinsically meaningless score
It admits too many variations – meaningless and syntactically incorrect variations can score the same as good variations
It admits too few variations – it treats synonyms as incorrect
More reference translations do not necessarily help

These and other problems are described in this article and this critical academic review. The core problem is that word-counting scores like BLEU and its derivatives - the linchpin of the many machine-translation competitive comparisons - don't even recognize well-formed language, much less real translated meaning. Here is a more recent post that I highly recommend, as it very clearly explains other metrics, and shows why it also still makes sense to use BLEU in spite of its many problems.

For post-editing work assessments there is a growing preference for Edit Distance scores to more accurately reflect the effort involved, even though it too is far from perfect.

The problems are further exacerbated with the Neural MT technology which can often generate excellent translations that are quite different from the reference and thus score poorly. Thus, many have found that lower (BLEU) scoring NMT systems are clearly preferred over higher scoring SMT systems when human evaluations are done. There are some new metrics (ChrF, SacreBLEU, Rouge) attempting to replace BLEU, but none have gathered any significant momentum yet and the best way to evaluate NMT system output today is still well structured human assessments.

What is BLEU useful for?

Modern MT systems are built by “training” a computer with examples of human translations. As more human translation data is added, systems should generally get better in quality. Often, new data can be added with beneficial results, but sometimes new data can cause a negative effect especially if it is noisy or otherwise “dirty”. Thus, to measure if progress is being made in the development process, the system developers need to be able to measure the quality impact rapidly and frequently to make sure they are improving the system and are in fact making progress.

BLEU allows developers a way “to monitor the effect of daily changes to their systems in order to weed out bad ideas from good ideas.” When used to evaluate the relative merit of different system building strategies, BLEU can be quite effective as it provides very quick feedback and this enables MT developers to quickly refine and improve translation systems they are building and continue to improve quality on a long term basis.

What is BLEU not useful for?

BLEU scores are always very directly related to a specific “test set” and a specific language pair. Thus, BLEU should not be used as an absolute measure of translation quality because the BLEU score can vary even for one language depending on the test and subject domain. In most cases comparing BLEU scores across different languages is meaningless unless very strict protocols have been followed. Because of this, it is always recommended to use human translators to verify the accuracy of the metrics after systems have been built. Also, most MT industry leaders will always vet the BLEU score readings with human assessments before production use.

In competitive comparisons, it is important to carry out the comparison tests in an unbiased, scientific manner to get a true view of where you stand against competitive alternatives. The “test set” should be unknown (“blind”) to all the systems that are involved in the measurement. This is something that is often violated in many widely used comparisons today. If a system is trained with the sentences in the “test set” it will obviously do well on the test but probably not as well on data that it has not seen before. Many recent comparisons score MT systems on News Domain related test sets that may also be used in training by some MT developers. A good score on news domain may not be especially useful for an enterprise use case that is heavily focused on IT, pharma, travel or any domain other than news.

However, in spite of all the limitations identified above, BLEU continues to be a basic metric used by most, if not all MT researchers today. Though, now, most expert developers regularly use human evaluation on smaller sets of data to ensure that they indeed have a true and meaningful BLEU. The MT community have found that supposedly improved metrics like METEOR, LEPOR, and other metrics have not really gained any momentum. BLEU and its flaws and issues are more clearly understood, and thus more reliable, especially if used together with supporting human assessments. Also, many buyers today realize that MT system performance on their specific subject domains and translatable content for different use cases matters much more than how generic systems might perform on news stories.

In upcoming posts in this series, we will continue to explore the issue of MT quality from a broad enterprise needs perspective. More informed practices will result in better outcomes and significantly improved MT deployments that leverage the core business mission to solve high-volume multilingual challenges more effectively.

A cleaner, possibly more polished, and shorter studio version of this post is available here.

6 comments:

Konstantin SavenkovApril 18, 2019 at 7:29 AM
BLEU is a corpus level metric, which is also very important. It lacks detail and does not provide insights. Sentence-level metrics are super-useful for buyer evaluations: when you see how sentence-level scores are distributed across your corpus, you may use it as a magnifying glass for human linguistic / subject matter evaluation. And in a matter of hours (not days or weeks) it's clear what should be attributed to alternative translations, what is due to poor reference data and what is MT fault.
Tom KnorrApril 18, 2019 at 7:34 AM
The problem with machine translation quality is human translation quality. In most cases the translation is an interpretation and because very few human translators have the opportunity to semantically research what was actually meant by the source. This is assuming that the writer of the source text actually formulated what's on his or her brain adequately enough to correctly represent what he/she was thinking.
Of course in human communication, even across language barriers, there is most likely the possibility to ask questions and refine the understanding. In written text, this is a bit more tricky. I had recently an argument with a scholar on why Wittgenstein's 'Wortspiel' is translated into 'word game' when reading the German text clearly fires up the neurons for a 'wordplay' in the sense of a theater play, as in an act. Games are a subcategory of plays but not the precise translation when it comes to describing the interaction and role-playing of words. Too bad we cannot ask Wittgenstein anymore about what he thinks about this. Somewhere this got translated as 'games' and is an active, established sense in translated philosophy.
I run into translations frequently, being in the business of developing quality machine translation. I am also a native level speaker of 2 languages having lived and studied in two cultures for equal time.
Very often a translation seems to be good enough to get the point across -- to a human. When looking closer, e.g. having a CAT tool available that suggests possible senses or knowing both cultures very well we see that details that are present in the typical context of one language are not carried over to the other language, typically because a more general term was used in the translation. Examples can be found in Europarl translations of EU parliament transcripts, as well as in many parallel corpora. These are the sources of training data for machine translation engines. Even the 'standard' for most machine learning work 'Wordnet' is horribly imprecise and of course, does not provide the number of translations required to train a neural engine.
Allan HallApril 18, 2019 at 8:08 AM
Any automated scoring system is welcome in the quest to optimise MT output, however this needs to be added instantaneously to improve the readability of the translation. The key is to use the output as part of a TMS ecosystem and route the segment to either the final output or to a post edit resource. Not in hours or in minutes but milliseconds. This is the real potential!
Note I do not use the word quality in terms of measuring MT.
Tom KnorrApril 18, 2019 at 8:13 AM
So, what does a quality measurement of a machine translation actually measure? How well a machine can mimic questionable translations? Developing a good machine translation engine does require more than counting word frequencies and predicting the likelihood of matching words. Don't get me wrong the engines work and they will eventually outdo any human translator -- if they had quality material to learn from. But here is the crux of the problem, in human-to-human communication we don't actually need high quality and precision, we can always ask if we don't understand what was said. Introducing a machine in the communication makes a 2-way communication a 4-way communication source-> machine-> target (-> machine -> source: if questions are asked). That decreases the confidence of correctness of the communication from 0.5 to 0.33. In order to avoid that it is imperative that the machine translation engine translates very precisely what the source means to say, almost requiring negotiation between the machine and the source. This cannot be done with the current garbage-in garbage-out approach.

Don't get me wrong here the current translation engines are good enough to order a couple of beers in a foreign language and if you translated your most successful pickup line at the bar and the girl slaps you in the face you can always blame it on the MT-engine. My son recently encouraged me to finish my machine translation tool. He concluded that Google translate sucks because he got an F in Spanish class. Put that in the context of coming from a generation that DEPENDS on computers to live
Lucia Guerroro RomeoApril 19, 2019 at 10:58 AM

Thanks for this insightful post, Kirti Vashee; MT evaluation is a challenging task indeed. A combination of automated metrics and human evaluation seems to be the wisest (but not the quickest or cheapest) option. As to BLEU, I think the following sentence is key: 'There can be as many correct translations as there are translators and, therefore, using one human reference to measure the quality of an automated translation solution presents issue'.
Tom HoarApril 19, 2019 at 10:59 AM
It is incorrect and misleading to say that “BLEU is a quality metric score.” It is not. Full stop. BLEU is a likeness score between a desired reference and a test.

Pages