eMpTy Pages: Human Evaluation of Machine Translation Output

Tuesday, September 6, 2016

Human Evaluation of Machine Translation Output

This is another post from Juan Rowda, this time co-written with his colleague Olga Pospelova, that provides useful information on this subject. Human evaluation of MT output is key to understanding several fundamental questions about MT technology including:

Is the MT engine improving?
Do the automated metrics (which MT systems developers use to guide development strategy) correlate with independent human judgments of the same MT output?
How difficult/easy is the PEMT output task?
What is the fair and reasonable PEMT rate?

The difficulty with human evaluation of MT output is closely related to the difficulty associated with any discussion of translation quality in general in the industry. It is challenging to come up with an approach that is consistent over time and across different people. Being scientifically objective is particularly a challenge. So, like much else in MT, estimates have to be made, and some rigor needs to be applied to ensure consistency over time and across evaluators. I will add some additional thoughts and references after Juan's post below, to provide a view on some new ways being explored to approach this issue of how humans could evaluate MT output in an objective way.

--------------

Machine translation (MT) output evaluation is essential in machine translation development. This is key to determining the effectiveness of the existing MT system, estimating the level of required postediting, negotiating the price, and setting reasonable expectations. As we discussed in our article on quality estimation, machine translation output can be evaluated automatically, using methods like BLEU and NIST, or by human judges. The automatic metrics use one or more human reference translations, which are considered the gold standard of translation quality. The difficulty lies in the fact that there may be many alternative correct translations for a single source segment.

Human evaluation, however, also has a number of disadvantages. Primarily because it is a costly and time-consuming process. Human judgment is also subjective in nature, so it is difficult to achieve a high level of intra-rater (consistency of the same human judge) and inter-rater (consistency across multiple judges) agreement. In addition, there are no standardized metrics and approaches to human evaluation.

Let us explore the most commonly used types of human evaluation:

Rating :-

Judges rate translations based on a predetermined scale. For example, a scale from 1 to 5 can be used, where 1 is the lowest and 5 is the highest score. One of the challenges of this approach is establishing a clear description of each value in the scale and the exact differences between the levels of quality. Even if human judges have explicit evaluation guidelines, they still find it difficult to assign numerical values to the quality of the translation (Koehn & Monz, 2006).

The two main dimensions or metrics used in this type of evaluation are adequacy and fluency.

Adequacy , according to the Linguistic Data Consortium, is defined as "how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation."

The annotators must be bilingual in both the source and target language in order to judge whether the information is preserved across translation.

A typical scale used to measure accuracy is based on the question "How much meaning is preserved?"
5: all meaning
4: most meaning
3: some meaning
2: little meaning
1: none

Fluency refers to the target only, without taking the source into account; the main evaluation criteria are grammar, spelling, choice of words, and style.

A typical scale used to measure fluency is based on the question "Is the language in the output fluent?"
5: flawless
4: good
3: non-native
2: disfluent
1: incomprehensible

Ranking:-

Judges are presented with two or more translations (usually from different MT systems) and are required to choose the best option. This task can be confusing when the ranked segments are nearly identical or contain difficult-to-compare errors. The judges must decide which errors have a greater impact on the quality of the translation (Denkowski & Lavie, 2010). On the other hand, it is often easier for human judges to rank systems than to assign absolute scores (Vilar et al., 2007). This is because it is difficult to quantify the quality of the translation.

Error Analysis :-

Human judges identify and classify errors in MT output. Classification of errors might depend on the specific language and content type. Some examples of error classes are "missing words", "incorrect word order", "added words", "wrong agreement", "wrong part of speech", and so on. It is useful to have reference translations in order to classify errors; however, as mentioned above, there may be several correct ways to translate the same source segment. Accordingly, reference translations should be used with care.

When evaluating the quality of eBay MT systems, we use all the aforementioned methods. However, our metrics can vary in the provision of micro-level details about some areas specific to eBay content. As a result, one of the evaluation criteria is to identify whether brand names and product names (the main noun or noun phrase identifying an item) were translated correctly. This information can help in identifying the problem areas of MT and focusing on the enhancement of that particular area.

Some types of human evaluation, such as error analysis, can only be conducted by professional linguists, while other types of judgment can be performed by annotators who are not linguistically trained.

Is there a way to cut the cost of human evaluation? Yes, but unfortunately, low-budget crowd-sourcing evaluations tend to produce unreliable results. How then, can we save money without compromising the validity of our findings?

Start with a pilot test - a process of trying out your evaluation on a small data set. This can reveal critical flaws in your metrics, such as ambiguous questions or instructions.
Monitor response patterns to remove judges whose answers are outside the expected range.
Use dynamic judgments - a feature that allows fewer judgments on the segments where annotators agree, and more judgments on segments with a high inter-rater disagreement.
Use professional judgments that are randomly inserted throughout your evaluation job. Pre-labeled professional judgments will allow for the removal of judges with poor performance.

Human evaluation of machine translation quality is still very important, even though there is no clear consensus on the best method. It is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment.

Juan Rowda
Staff MT Language Specialist, eBay
https://www.linkedin.com/in/juan-martín-fernández-rowda-b238915

Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for some time. He first started working with MT in 2006. Juan helped to localize quite a few major video games, as well.

He was also a professional CAT tool trainer and taught courses on localization.
Juan holds a BA in technical, scientific, legal, and literary translation.

Olga Pospelova
Localization Program Manager, eBay

Olga Pospelova is a Localization Program Manager at eBay where she has been leading linguistic support to Human Language Technology Initiative. The broad spectrum of services includes human evaluation of MT output, creating testing and training data for MT systems, semantic annotation, etc. Prior to joining eBay, Olga worked as an Assistant Professor of Russian at the Defense Language Institute Foreign Language Center. Olga holds an MA in Linguistics from San Jose State University.

------------------------------------------

So in this quest for evaluation objectivity, there are a number of recent initiatives that are useful. For those who are considering the use of Error Analysis, TAUS has attempted to provide a comprehensive error typology scheme together with some tools. These errors can then be used to develop a scoring system that will weight errors differently and does help to develop a more objective approach to a discussion on MT quality.

This presentation by Alon Lavie provides more details on the various approaches that are being tried and includes some information on the Edit Distance measurements which can only be done after a segment is post-edited.

This study is interesting since it provides evidence that monolingual editors can perform quite efficiently and consistently provided they are only shown Target data i.e. without the source. In fact, they do this more efficiently and consistently than bilinguals who are shown both source and target. I also assume that an MT system has to reach a certain level of quality before this is true.

Eye-tracking studies are another way to get objective data and initial studies provide some useful information for MT practitioners. The premise is very simple: Lower quality MT output slows down an editor who has to look much more carefully at the output to correct it. It also shows that some errors cause more of a drag than others. In both charts shown below the shorter the bar the better. Three MT systems are compared here and it is clear that the Comp MT system has a lower frequency of most error types.

The results from the eye fixation times are consistent and correlated with automatic metrics like BLEU. In the chart below we see that Word Order errors cause a longer gaze time than other errors. This is useful information for productivity analysts who can use it to determine problem areas or determine the suitability of MT output for comprehension or post-editing.

The basic questions being addressed by all these efforts are all inching us forward slowly and perhaps some readers can shed light on new approaches to answer them. Adaptive MT and the Fairtrade initiative are also different ways to address the same fundamental issues. My vote so far is for Adaptive MT as the most promising approach for professional business translation, even though it is not strictly focused on static quality measurement.

12 comments:

Lucía GuerreroSeptember 12, 2016 at 1:47 AM
Thank you both Juan and Kirti for describing current approaches to human evaluation of MT output. I agree that, in order to obtain objective information about quality, we need to combine both automatic and human evaluation. Based on my experience, I would like to add that having a chat with posteditors, where they can tell us what they think are the most cumbersome or dangerous errors –or highlight those areas where the system is best at, too- can give us additional valuable information about a system’s strengths and weaknesses. Human evaluation methods are closed –they work as a survey as opposed to an interview, which is open-, and some interesting aspects might not be showing up. So, for instance, even if your error analysis evaluation system highlights word order as the most recurrent mistake, in a phone call you can find out that posteditors are more worried about missing words, because, say, that text has long sentences and a missing word can easily go unnoticed, causing severe mistranslations; or maybe they are annoyed at capitalization errors, because fixing them is more boring and time-consuming than fixing mistakes that occur more often. I realize that, when many languages and large teams are involved, it's not realistic to arrange a phone call with everyone but you can have it at least with a part of the team, or alternatively add some sort of an open ‘Comments’ box to your human evaluation system in which posteditors can highlight the strengths and weaknesses of the MT output. This approach has helped us prioritize when deciding where to start with when fine-tuning our MT systems.
ReplyDelete
Replies
Rick WoydeSeptember 14, 2016 at 3:19 PM
Rick Woyde BLEU metric scores do not have any real correlation with translation quality. It's simply a tool to sell poor MT output. Human reviewers must be trained to look for error patterns. Unlike human translators, who make random mistakes, MT systems make the same mistakes over and over again. Identifying and correcting these patterns leads to vast quality improvements. Too often the review process is focused on creating the needed deliverable instead of MT improvement.
ReplyDelete
Replies

Add comment

eMpTy Pages

Pages

Tuesday, September 6, 2016

Human Evaluation of Machine Translation Output

Rating :-

Ranking:-

Error Analysis :-

12 comments:

Get new posts by email:

Search This Blog

Pages

Featured Post

Comparing MT System Performance