Tuesday, September 6, 2016

Human Evaluation of Machine Translation Output

This is another post from Juan Rowda, this time co-written with his colleague Olga Pospelova, that provides useful information on this subject. Human evaluation of MT output is key to understanding several fundamental questions about MT technology including:
  1. Is the MT engine improving?
  2. Do the automated metrics (which MT systems developers use to guide development strategy) correlate with independent human judgments of the same MT output?
  3. How difficult/easy is the PEMT output task?
  4. What is the fair and reasonable PEMT rate?
The difficulty with human evaluation of MT output is closely related to the difficulty associated with any discussion of translation quality in general in the industry. It is challenging to come up with an approach that is consistent over time and across different people. Being scientifically objective is particularly a challenge. So, like much else in MT, estimates have to be made, and some rigor needs to be applied to ensure consistency over time and across evaluators. I will add some additional thoughts and references after Juan's post below, to provide a view on some new ways being explored to approach this issue of how humans could evaluate MT output in an objective way. 
Machine translation (MT) output evaluation is essential in machine translation development. This is key to determining the effectiveness of the existing MT system, estimating the level of required postediting, negotiating the price, and setting reasonable expectations. As we discussed in our article on quality estimation, machine translation output can be evaluated automatically, using methods like BLEU and NIST, or by human judges. The automatic metrics use one or more human reference translations, which are considered the gold standard of translation quality. The difficulty lies in the fact that there may be many alternative correct translations for a single source segment.

Human evaluation, however, also has a number of disadvantages. Primarily because it is a costly and time-consuming process. Human judgment is also subjective in nature, so it is difficult to achieve a high level of intra-rater (consistency of the same human judge) and inter-rater (consistency across multiple judges) agreement. In addition, there are no standardized metrics and approaches to human evaluation. 

Let us explore the most commonly used types of human evaluation:


Rating :-

Judges rate translations based on a predetermined scale. For example, a scale from 1 to 5 can be used, where 1 is the lowest and 5 is the highest score. One of the challenges of this approach is establishing a clear description of each value in the scale and the exact differences between the levels of quality. Even if human judges have explicit evaluation guidelines, they still find it difficult to assign numerical values to the quality of the translation (Koehn & Monz, 2006).

The two main dimensions or metrics used in this type of evaluation are adequacy and fluency.

Adequacy , according to the Linguistic Data Consortium, is defined as "how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation." 

The annotators must be bilingual in both the source and target language in order to judge whether the information is preserved across translation.

A typical scale used to measure accuracy is based on the question "How much meaning is preserved?"
5: all meaning
4: most meaning
3: some meaning
2: little meaning
1: none 

Fluency refers to the target only, without taking the source into account; the main evaluation criteria are grammar, spelling, choice of words, and style.

A typical scale used to measure fluency is based on the question "Is the language in the output fluent?"
5: flawless
4: good
3: non-native
2: disfluent
1: incomprehensible 


Judges are presented with two or more translations (usually from different MT systems) and are required to choose the best option. This task can be confusing when the ranked segments are nearly identical or contain difficult-to-compare errors. The judges must decide which errors have a greater impact on the quality of the translation (Denkowski & Lavie, 2010). On the other hand, it is often easier for human judges to rank systems than to assign absolute scores (Vilar et al., 2007). This is because it is difficult to quantify the quality of the translation. 

Error Analysis :-

Human judges identify and classify errors in MT output. Classification of errors might depend on the specific language and content type. Some examples of error classes are "missing words", "incorrect word order", "added words", "wrong agreement", "wrong part of speech", and so on. It is useful to have reference translations in order to classify errors; however, as mentioned above, there may be several correct ways to translate the same source segment. Accordingly, reference translations should be used with care.

When evaluating the quality of eBay MT systems, we use all the aforementioned methods. However, our metrics can vary in the provision of micro-level details about some areas specific to eBay content. As a result, one of the evaluation criteria is to identify whether brand names and product names (the main noun or noun phrase identifying an item) were translated correctly. This information can help in identifying the problem areas of MT and focusing on the enhancement of that particular area.

Some types of human evaluation, such as error analysis, can only be conducted by professional linguists, while other types of judgment can be performed by annotators who are not linguistically trained.

Is there a way to cut the cost of human evaluation? Yes, but unfortunately, low-budget crowd-sourcing evaluations tend to produce unreliable results. How then, can we save money without compromising the validity of our findings?
  • Start with a pilot test - a process of trying out your evaluation on a small data set. This can reveal critical flaws in your metrics, such as ambiguous questions or instructions.
  • Monitor response patterns to remove judges whose answers are outside the expected range.
  • Use dynamic judgments - a feature that allows fewer judgments on the segments where annotators agree, and more judgments on segments with a high inter-rater disagreement.
  • Use professional judgments that are randomly inserted throughout your evaluation job. Pre-labeled professional judgments will allow for the removal of judges with poor performance.
Human evaluation of machine translation quality is still very important, even though there is no clear consensus on the best method. It is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment. 

Juan Rowda
Staff MT Language Specialist, eBayín-fernández-rowda-b238915 
Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for some time. He first started working with MT in 2006. Juan helped to localize quite a few major video games, as well. 
He was also a professional CAT tool trainer and taught courses on localization.
Juan holds a BA in technical, scientific, legal, and literary translation. 

Olga Pospelova
Localization Program Manager, eBay

Olga Pospelova is a Localization Program Manager at eBay where she has been leading linguistic support to Human Language Technology Initiative. The broad spectrum of services includes human evaluation of MT output, creating testing and training data for MT systems, semantic annotation, etc. Prior to joining eBay, Olga worked as an Assistant Professor of Russian at the Defense Language Institute Foreign Language Center. Olga holds an MA in Linguistics from San Jose State University


So in this quest for evaluation objectivity, there are a number of recent initiatives that are useful. For those who are considering the use of Error Analysis, TAUS has attempted to provide a comprehensive error typology scheme together with some tools. These errors can then be used to develop a scoring system that will weight errors differently and does help to develop a more objective approach to a discussion on MT quality. 

This presentation by Alon Lavie provides more details on the various approaches that are being tried and includes some information on the Edit Distance measurements which can only be done after a segment is post-edited. 
This study is interesting since it provides evidence that monolingual editors can perform quite efficiently and consistently provided they are only shown Target data i.e. without the source. In fact, they do this more efficiently and consistently than bilinguals who are shown both source and target. I also assume that an MT system has to reach a certain level of quality before this is true.

Eye-tracking studies are another way to get objective data and initial studies provide some useful information for MT practitioners. The premise is very simple: Lower quality MT output slows down an editor who has to look much more carefully at the output to correct it. It also shows that some errors cause more of a drag than others. In both charts shown below the shorter the bar the better. Three MT systems are compared here and it is clear that the Comp MT system has a lower frequency of most error types.
The results from the eye fixation times are consistent and correlated with automatic metrics like BLEU. In the chart below we see that Word Order errors cause a longer gaze time than other errors. This is useful information for productivity analysts who can use it to determine problem areas or determine the suitability of MT output for comprehension or post-editing.
The basic questions being addressed by all these efforts are all inching us forward slowly and perhaps some readers can shed light on new approaches to answer them. Adaptive MT and the Fairtrade initiative are also different ways to address the same fundamental issues. My vote so far is for Adaptive MT as the most promising approach for professional business translation, even though it is not strictly focused on static quality measurement.


  1. Thank you both Juan and Kirti for describing current approaches to human evaluation of MT output. I agree that, in order to obtain objective information about quality, we need to combine both automatic and human evaluation. Based on my experience, I would like to add that having a chat with posteditors, where they can tell us what they think are the most cumbersome or dangerous errors –or highlight those areas where the system is best at, too- can give us additional valuable information about a system’s strengths and weaknesses. Human evaluation methods are closed –they work as a survey as opposed to an interview, which is open-, and some interesting aspects might not be showing up. So, for instance, even if your error analysis evaluation system highlights word order as the most recurrent mistake, in a phone call you can find out that posteditors are more worried about missing words, because, say, that text has long sentences and a missing word can easily go unnoticed, causing severe mistranslations; or maybe they are annoyed at capitalization errors, because fixing them is more boring and time-consuming than fixing mistakes that occur more often. I realize that, when many languages and large teams are involved, it's not realistic to arrange a phone call with everyone but you can have it at least with a part of the team, or alternatively add some sort of an open ‘Comments’ box to your human evaluation system in which posteditors can highlight the strengths and weaknesses of the MT output. This approach has helped us prioritize when deciding where to start with when fine-tuning our MT systems.

    1. Lucia,
      I agree that direct post-editor feedback is invaluable and your suggestion has the sheer power of common sense. In all our efforts to do this better many of us often overlook the most obvious things. Asking post-editors to rank their most urgent issues should definitely rank as the most important adjustment prioritization feedback.

    2. Agreed, Lucia. As Kirti said, post-editors can add a lot of value with their feedback. What evaluations ultimately provide is a number, a measurement in the scale - that is a starting point to answer the question "is my system good or bad". The p-e feedback adds a whole lot of information that a score won't provide.

  2. Rick Woyde BLEU metric scores do not have any real correlation with translation quality. It's simply a tool to sell poor MT output. Human reviewers must be trained to look for error patterns. Unlike human translators, who make random mistakes, MT systems make the same mistakes over and over again. Identifying and correcting these patterns leads to vast quality improvements. Too often the review process is focused on creating the needed deliverable instead of MT improvement.

    1. BLEU is most useful for developers who are building the MT engine as it can guide them on data choices and the impact of adding and removing data subsets. If you understand them they do have a very direct relationship to translation quality but they are not absolute measures. It is very easy to measure them incorrectly and most DIY people do it badly. LSPs should learn about these scores before making any interpretations or drawing any conclusions based on a BLEU score.

    2. Thanks for your comment, Rick. I agree and disagree with your comment at the same time. BLEU is not perfect. I never liked the fact of comparing 2 or more translations because in language 1 + 1 is rarely 2. There's some many possible good translations... But at the same time, BLUE can take more than one human reference translation. It is only a tool, and there are other scores out there that are currently used, like TER, etc. The people training the engines need to have at least a reference value, and that's what BLEU provides. I couldn't agree more with you regarding patterns and improving MT output. That's exactly what we do at eBay. We don't have any deliverables, we just focus on improving the quality of the output. I have published some other articles were I mentioned the importance of patterns and how fixing individual, minor issues is not efficient in MT.

    3. John Moran

      Rick - not all MT systems make the same mistakes over and over again. Go to to see one that doesn't.

    4. Rick, as Kirti explained BLEU is super useful when you train an engine. Engines are trained by trial and error: you take one dataset, process it in different ways or you filter out subsets, and then use all variants of that initial dataset to train "variant engines". With BLEU you can then compare how all variants perform on the same test set. BLUE will tell you variant X is better than variant Y and worse than variant Z. It won't tell you Z is good or bad. This is true even when your BLEU test set is representative for the type of sentences you want your engine to translate. You should not use BLEU to figure out how good a translation is. If that is what you want, you need translators (plural!) to evaluate the MT output. Or, you can do a post-calculation on a translation job. That is how I do it.

    5. Gert, I'm quite aware of how BLEU and MT systems work, but as I mentioned a few years back I had a MT vendor try to get me to sign off on a "customized" MT engine using BLEU scores. The customized engine performed worse than Google Translate but had good BLEU scores. Making the BLEU scores irrelevant.

    6. John Moran Lilt is not the only MT system using adaptive MT, so does Pairaphrase. Pairaphrase not only updates segments in real time as you translate within a doc it will do so across a batch of docs too.

    7. Kirti Vashee I've seen MT providers produce "acceptable" BLEU scores with translation quality that was worse than Google Translate. I agree there's not enough education on what a BLEU score actually means. It's simply not a great indicator of the quality you'll receive from an MT system and can be misleading as a result.

    8. @Rick The BLEU scores are only as good as the Test Set -- If they are not consistent with human evaluations or if they say that the engine is better than Google but your humans say not so then Change the test set so that you measure things that make sense to all concerned. If you understand this than BLEU scores will be meaningful. Learn what to measure and how to measure and understand what a vendor says when they give you a BLEU score. Most LSPs do not want to invest the time to develop a good BLEU Test set so they will get random test set results.