Tuesday, August 16, 2016

MT Output Quality Estimation - A Linguist's Perspective

The issue of rapidly understanding the MT output quality as precisely as possible BEFORE post-editing begins, is an important one. Juan Rowda from eBay provides an interesting way to make this assessment based on linguistic criteria and provides a model that can be further refined and defined by motivated users. Automated metrics like BLEU used by MT system developers provide very little of value for this PEMT effort assessment. This approach is an example of the kinds of insightful tools that can only be developed by linguists who are engaged with a long-term MT project and want to solve problems that can add real value to the MT development and PEMT process. I think it is interesting that somebody outside of the "translation industry" came up with this kind of practical innovation, that can facilitate and greatly enhance efforts in a project involving translation of several hundred million new words on a regular basis.

This article is based on a quality estimation method I developed and originally formally presented at AMTA in 2015. The premise of the method is a different approach to machine translation quality estimation (MTQE) created entirely from a linguist’s perspective.

What is MTQE?
Quality Estimation is a method used to automatically provide a quality indication for machine translation output without depending on human reference translations. In more simple terms, it’s a way to find out how good or bad the translations produced by an MT system are, without human intervention.
A good point to make before we go into more detail on QE is the difference between evaluation and estimation. There are two main ways in which you can evaluate the quality of MT output: human evaluation (a person will check the translation and provide feedback) and automatic evaluation (there are different methods that can provide a score on the translation quality without human intervention). 
Traditionally, to automatically evaluate the quality of any given MT output, at least one reference translation created by a human translator is required. The differences and similarities between the MT output and the reference translation can then be turned into a score to determine the quality of said output. This is the approach followed by certain methods like BLEU or NIST.

The main differentiator of quality estimation is that it does not require a human reference translation.

QE is a prediction of the MT output quality based on certain features and attributes. These features can be, for example, the number of prepositional or noun phrases in the source and target (and their difference), the number of named entities (names of places, people, companies, etc.), and many more attributes. With these features, using techniques like machine learning, a QE model can be created to obtain a score that represents the estimation of the translation quality. 

At eBay, we use MT to translate search queries, item titles and item descriptions . To train our MT systems, we work with vendors that help us post-edit content. Due to the challenging nature of our content (user-generated, diversity of categories, millions of listings, etc.), a quick method to estimate the level of effort post-editing will require, definitely adds value to our process. QE can help you obtain important information on this level of difficulty in an automated manner. For example, one can estimate how many segments have a very low-quality translation and could be just discarded instead of post-edited. 

What’s the purpose of MTQE?
MTQE can be used for several purposes. Firstly, a primary purpose is to estimate the quality of translations at the segment and file-level. Segment-level QE scores can help you target post-editing efforts, by focusing only on segments that makes sense to post-edit. You can also estimate overall post-editing effort/time as it would be safe to assume that segments with a low quality score take more time to post-edit. It is also possible to compare MT systems based on QE scores and see which engine might perform better. This is especially helpful if you are trying to decide which engine you should use, or determine if a new version of an engine is actually working better than its predecessor or not. The main purpose of MTQE is to estimate post-editing effort, i.e., how hard it will be to post-edit a text, and how long it might take. QE can help you obtain this valuable information in an automated manner. For example, identify which segments have a very low-quality translation and thus should be discarded instead of post-edited. It can also answer a very common question: Can I use MT for this translation project? 

With Quality Estimation you can:
  • estimate the quality of a translation at the segment/file level,
  • target post-editing (choose sections, segments, or files to post-edit),
  • discard bad content that makes no sense to post-edit,
  • estimate post-editing effort/time,
  • compare MT systems to identify the best performing system for a given content,
  • monitor a system’s quality progress over time, and more.

Why a Linguist’s Approach?
Standard approaches to QE involve complex formulas and concepts most linguists are not familiar with, like Naive Bayes, Gaussian processes, neural networks, decision trees, etc... Since so far, QE has been mostly dealt with by computational linguistics scientists. It is also true that traditional QE models are technically hard to create and implement. 

For this reason, I decided to try a different approach, one developed entirely from a linguist’s perspective. This implies that this method may have certain advantages and disadvantages compared to other approaches, but coming from a linguistic background, my aim was to create a process and methodology that translators and linguists in the traditional localization industry could actually use.
In the research described in  Linguistic Indicators for Quality Estimation of Machine Translations , they show how linguistic and shallow features in the source text, the MT output and the target text can help estimate the quality of the content. Drawing on the research described here I developed a linguistic approach to rapidly determining QE.

In a nutshell, finding potential issues in the following three dimensions of the content can help us get an idea of the MT output quality. These three dimensions are:
  • complexity (source text, how complex it is, how difficult will it be for MT to translate),
  • adequacy (the translation itself, how accurate it is), and
  • fluency (target text only).

The next step was then trying to identify specific features in these three dimensions, in my content, that would provide an accurate estimation of the output quality. After some trial and error, I decided to use the following set of features:
  • Length: is a certain maximum length exceeded? Is there a significant difference between source and target? The idea here is that the longer a sentence is, the harder it may be for the MT system to get it right.
  • Polysemy: words that can have multiple meanings (and therefore, multiple translations). With millions of listings across several broad categories, this is a big issue for eBay content. For example, if you search for lime on, you will get results from Clothing categories (lime color), from Home & Garden (lime seeds), from Health & Beauty (there’s a men’s fragrance called Lime), from Recorded Music (there’s a band called Lime), etc.. The key here is that, if a polysemous word is in the source, this is an indication of a potential issue. Another key: if a given translation for a source term is near certain words, that is a potential error too. Let me make that clearer: “clutch” can be translated a) as that pedal in your car or b) as a small handbag; if you have “a” in your target occurring next to words like bag, leather, purse, or Hermes, that’s most likely a problem.
Here’s an elaboration and further discussion on polysemy if you want to learn more.
  • Terminology: basically checking that some terms are correctly translated. For eBay content, things like brands, typical e-commerce acronyms, and company terminology are critical. Some brand names may be tricky to deal with, as some have common names, like Coach or Apple, as opposed to exclusively proper names like Adidas or Nike.
  • Patterns: any set of words or characters that can be identified as an error. Patterns can be duplicate words, tripled letters, missing punctuation signs, formal/informal style indicators, words that shouldn’t occur in the same sentence, and more. The use of regular expressions gives you a great deal of flexibility to look for these error patterns. For example, in Spanish, sentences don’t typically end in prepositions, so it’s not hard to create a regular expression that finds ES prepositions at the end of a sentence: (prep1|prep2|prep3|etc)\.$
  • Blacklists: terms that shouldn’t occur in the target language. A typical example of these would be offensive words. In the case of languages like Spanish, this is useful to detect regionalisms.
  • Numbers: numbers occurring in the source should also appear in the target.
  • Spelling: Common misspellings.
  • Grammar: potential grammar errors, unlikely word combinations, like a preposition followed by a conjugated verb.
After some initial trial and error runs, I discarded ideas like named entity recognition and part-of-speech tagging. I couldn’t get any reliable information that would help with the estimation, but this doesn’t mean these two can be completely discarded as features. They would, of course, introduce a higher level of complexity to the method but could yield positive results. This list of determinants is not final and can evolve. 
All these features, with all of its checks, make up your QE model. 

How do you use the model?
The idea is simple; let me break it down for you:
  • The goal is to get a score for each segment that can be used as an indication of the quality level.
  • The presence of any of the above-mentioned features indicates a potential error.
  • Each error can be assigned a number of points, a certain weight. (During my tests I assigned one point to each type of error, but this can be customized for different purposes.)
  • The number of errors is divided by the number of words to obtain a score.
  • The ideal score, no potential errors detected, would be 0.

Quality estimation must be automatic – it makes no sense to check manually for each of these features. A very easy and inexpensive way to find potential issues is using Checkmate, which also integrates LanguageTool, a spelling and grammar checker. Both are open source.

There is a way to account for each of the linguistic features mentioned in Checkmate: terminology and blacklists can be set up in the Terminology tab, spelling and grammar in the LanguageTool tab, patterns can be created in the Patterns tab, etc. The set of checks you create can be saved as a profile and be reused. You just need to create a profile once, and you can update it when necessary.

Checkmate will verify one or more files at the same time, and display a report of all potential issues found. By knowing how many errors were detected in a file, you can get a score at the document level. 

Getting scores at the segment level involves an extra step. What we need at this point is to add up all the potential errors found for each segment (every translation unit is assigned an ID by Checkmate, and that makes the task easier), count the number of words in each segment, and divide those values to get scores. All the necessary data can be taken from Checkmate’s report, which is available in several formats.
To be able to carry out this step of the process with minimal effort, I created an Excel template and put together a VBA macro that, after copying and pasting the contents of the Checkmate report gets the job done for you. The results should be similar to this, with highest and lowest scores in red and green: 
This is the VBA code I used, commented and broken down into smaller bits (VBA experts, please not that I’m not an expert. 
Several tests were run to check the effectiveness of this approach. We took content samples of roughly the same size with different levels of quality, from perfect (good quality human translation) to very poor (MT output with extra errors injected). Each sample was post-edited by two post-editors, recording the time required for post-editing each sample. Post-editors didn’t know that the samples had different levels of quality. At the same time, we obtained the QE score of each sample. 
First, we started with short Spanish samples of around 300 words. One of the samples was the golden standard, one of the samples was raw MT output with errors injected, and the rest of the samples were in-between. Then we repeated the same steps with bigger samples of around 1,000 words. A third test was done using bigger files (around 50,000 words) in the following different stages of our training process:
  • MT output (raw MT)
  • review 1, (post-edited, reviewed, and sent back for further post-editing after not meeting quality standards)
  • review 2 (final post-edited and reviewed file, meeting predetermined quality standards).

Some of these tests were then extended to include Russian, Brazilian Portuguese, and Chinese. Only one post-editor worked on each of these three languages. 

Analyzing Results
Results showed that post-editing time and the QE scores were aligned and strongly correlated. This list shows two sets of samples (~300 words and ~1,000 words), their sizes, the number of potential issues found, and the score. The last two columns show the time taken by each post-editor. Samples in green are the golden standards for each set; in red, the worst quality sample in each set. 

As you can see, QE scores and post-editing times are overall aligned. A score of 0 indicates no potential issues were detected (which does not necessarily means that the file has no errors at all - it just means no errors were found). These initial tests were run at the document level. 

This is a different representation of the results for the second set of samples (~1,000 words). Red bars represent the time taken by post-editor #1 to post-edit each sample; green bars are for post-editor #2. The blue line represents the QE score obtained for each sample. 

With the help of colleagues, similar tests were run for 3 additional languages (BPT, RU, and ZH) with similar results. The only language with inconsistent results was Chinese. We later discovered that Checkmate had some issues with double-byte characters. Also, the set of features we had for Chinese was rather small compared to other languages. 

One thing that becomes obvious from these results is that a strong QE profile (i.e., the number of checks included for each feature, how good they are at catching issues) has a key role in producing accurate results. As you can see above, the RU profile caught more potential errors than the BPT one. 

In a third test, I estimated the quality of the same file after 3 stages of our training process (as described above in the Testing section). After each step, the score improved. The presence of many potential errors in a file that was post-edited two times helped fine-tune some features in the model. This also reinforced the idea that a one-size-fits-all model is not realistic. Models can and should be adapted to your type of content. Let’s take eBay titles as an example: they have no syntactic structure (they are just a collection of words), so perhaps they don’t need any grammar checks. Titles usually contain brand names, part numbers and model names, so perhaps spelling checks will not provide meaningful information. 

During this test, I also checked changes in edit distance. As the score improved, the edit distance grew closer to the average edit distance for this type of content at that point in time, which was 72. By looking at the score and the edit distance, I could infer that there’s room to improve the quality of this particular file. Some analysis at the segment level can help confirm these conclusions. Checking segments with the best and worst scores helps determine how reliable your results are. 

Challenges of using this model
A high number of false positives may occur based on the nature of the content. For example, some English brand names may be considered spelling errors by certain spellcheckers in the target language. LanguageTool uses an ignore list to avoid incorrectly flagging any terms you add to it. Overall, it’s virtually impossible to avoid false positives in quality checks in any language. Efforts should be made to minimize them as much as possible.
Another challenge is trying to match a score with a post-editing effort measurement - it’s not easy to come up with a metric that accurately predicts the number of words per second that can be post-edited given a certain score. I’m sure that it is not impossible, but a lot of data is required for precise metrics.
The model is flexible enough to allow you to assign a certain weight to each feature. This can be challenging at first, but it is, in my opinion, a “good problem”. This allows users to adapt the score to their specific needs.

What motivated the development of this method was mainly the idea of providing translators, linguists, post-editors, translation companies, and people working with MT in general, a means to use quality estimation.
Regarding next steps for this method, I see some clear ones. It would be really interesting to be able to match QE scores with post-editing time. It doesn’t seem impossible and it’s probably a matter of collecting enough data. Another interesting idea would be integrating QE in a CAT tool or any other post-editing environment, and have segment-level QE scores displayed to post-editors. Comparing post-editing time and score in one of such tools could also help fine-tune the QE model and more accurately predict post-editing effort.
I see this just as a starting point. Personally, I like the idea of people taking this model, personalizing it and improving it, and of course sharing their results. There is definitely room for improvement, and I’m sure that new features can be added to make results even better. Perhaps in the future new QE models can combine statistical data and language data while keeping the process simple enough.

Juan Rowda
Staff MT Language Specialist, eBay
Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for some time. He first started working with MT in 2006. Juan helped to localize quite a few major video games, as well. 
He was also a professional CAT tool trainer and taught courses on localization.
Juan holds a BA in technical, scientific, legal, and literary translation. 


  1. Interesting approach. However, your flexibility for "customisation" and "personalisation" could turn out to be a double-edged sword. In large organisations like yours it can be used as intended - to make an honest assessment of the post-editing process and refine it over time. However, in another scenario in which agencies are desperately looking for translators to post-edit raw MT output, your flexibility could be used by cynical agencies to fiddle the statistics and use ostensibly "official" ratings to persuade inexperienced translators to accept hopeless post-editing tasks.
    So although your project does show laudable intentions and interesting ideas, it is not yet enough to address the misgivings in the translator and post-editor community.

    1. Thanks for your interest, Victor. My goal was to share an idea, a method, for something that I believe is lacking in the "industry". To be honest, I did not consider the potential implications you mention. How this idea can be used is outside of my control, as I guess the same happens with many other ideas once they reach the public, users, etc. But I'm hoping there will be also many positive things coming out of its use. Since the customization is done by linguists (i.e., someone that understands the language really well needs to create these features), post-editors should be able to quickly realize that the checks are not appropriate or that the system is rigged. I think cynical agencies, as you call them, can benefit more from getting accurate scores and metrics (and thus being able to budget properly, accept/reject projects, pay resources fairly, etc) than from creating a bad reputation for themselves.
      It's an interesting idea; thank you for taking the time to share it.

    2. Victor,

      Thank you for your comment.

      All tools have the potential to be misused by unscrupulous users. But I think the approach suggested here empowers translators/editors in the following ways:

      1) For those who may choose to work on PEMT jobs by providing them a way to rapidly assess the MT output quality in a linguistically informed way,
      2) Setup personal effort/work benchmarks (based on these scores) to understand the level of difficulty/effort involved in performing the new PEMT task. This would require keeping track of actual work performed versus scores these old jobs had.
      3) To identify projects worth doing versus ones to avoid, independently, and without input from the agency.

      There is an effort involved in setting up your measurement and scoring system but then all you need to do is run new potential jobs through there and decide if it is worth doing or not.

      If properly done I think this gives a post-editor a reliable way to identify a "bullshit" project based solely on his linguistic knowledge and personal assessment of work previously done and measured.