Tuesday, September 12, 2017

LSP Perspective: Applying the Human Touch to MT, Qualitative Feedback in MT evaluation

In all the discussion on MT that we hear, we do not often hear much about the post-editors and what could be done to enhance and improve the often negatively viewed  PEMT task.  Lucía Guerrero provides useful insights on her direct experience in improving the work experience for post-editors. Interestingly over the years I have noted that, strategies to improve the post-editor experience can often make mediocre MT engines viable, and  failure to do so can make good engines fail in fulfilling the business promise. I cannot really say much beyond what Lucía says here other than restate what she is saying in slightly different words. The keys to success seem to be:
  1. Build Trust by establishing transparent and fair compensation and forthright work related communication
  2. Develop ways to involve post-editors in the MT engine refinement and improvement process involvement
  3. Demonstrate that  the Feedback Cycle does in fact improve the work experience on an ongoing basis

Post-editing has become the most common practice when using MT. According to Common Sense Advisory (2016), more than 80% of LSPs offer Machine Translation Post-Editing (MTPE) services, and one of the main conclusions from a study presented by Memsource at the 2017 Conference of the European Association for Machine Translation (EAMT) states that less than 10% of the MT done in Memsource Cloud was left unedited. While it is true that a lot of user-generated content is machine-translated without post-editing (we see it every day at eBay, Amazon, Airbnb, to mention just a few), whether it is RBMT, SMT, or NMT, post-editors are still needed to improve the raw MT output.

Quantitative Evaluation Methods: Only Half the Picture

While these data show that they are key, linguists are often excluded from the MT process, and only required to participate in the post-editing task, with no interaction “in process.” Human evaluation is still seen as “expensive, time consuming and prone to subjectivity.” Error annotation takes a lot of time, compared to automated metrics such as BLEU or WER, which are certainly cheaper and faster. These tools provide quantitative data usually obtained by automatically comparing the raw MT to a reference translation, but the post-editor’s evaluation is hardly ever taken into account. Shouldn’t that be important if the post-editor’s role is here to stay?

While machines are better than we are at spotting differences, humans are better at assessing linguistic phenomena, categorizing them and giving detailed analysis.

Our approach at CPSL is to involve post-editors in three stages of the MT process:
  • For testing an MT engine in a new domain or language combination
  • For regular evaluation of an existing MT engine
  • For creating/updating post-editing guidelines
Some companies use the Likert scale for collecting human evaluation. This method involves asking people – normally the end-users, rather than linguists – to assess raw MT segments one by one, based on criteria such as adequacy (how effectively has the source text message been transferred to the translation?) or fluency (does the segment sound natural to a native speaker of the target language?).

For our evaluation purposes, we find it more useful to ask the post-editor to fill in a form with their feedback, correlating information such as source segment, raw MT and post-edited segment, type and severity of errors encountered, and personal comments.

Turning Bad Experiences Into Rewarding Jobs

One of the main issues I often have to face when I manage an MT-based project is the reluctance of some translators to work with machine-translated files due to bad previous post-editing experiences. I have heard many stories about post-editors being paid based on an edit distance that was calculated from a test that was not even close to reality, or post-editors never being asked for their evaluation of the raw MT output. They were only asked for the post-edited files, and sometimes, the time spent, but just for billing purposes. One of our usual translators even told me that he received machine-translated files that were worse than Google Translates results (NMT had not yet been implemented). All these stories have in common the fact that post-editors are seldom involved in the system improvement and evaluation process. This can turn post-editing into an alienating job that nobody wants to do a second time.

To avoid such situations, we decided to create our own feedback form for assessing and categorizing error severity and prioritizing the errors. For example, errors such as incorrect capitalization of months and days in Spanish, word order problems in questions in English, punctuation issues in French, and other similar errors were given the highest priority by our post-editors and our MT provider was asked to fix them immediately. The complexity of the evaluation document can vary according to need. It can be as detailed as the Dynamic Quality Framework (DQF) template or be a simple list of the main errors with an example.

Post Editor Feedback Form

However, more than asking for severity and repetitiveness, what I really want to know is what I call ‘annoyance level,’ i.e. what made the post-editing job too boring, tedious or time-consuming – in short, a task that could lead the post-editor to decline a similar job in the future. These are variables that quantitative metrics cannot provide. Automated metrics cannot provide any insight on how to prioritize error fixing, either by error severity level or by ‘annoyance level.’ Important errors can go unnoticed in a long list of issues, and thus never be fixed.

I have managed several MT-based projects where the edit distance was acceptable (< 30%) and the post-editors’ overall experience, to my surprise was still unpleasant. In such cases, the post-editor came back to me saying that certain types of errors were so unacceptable for them that they didn’t want to post-edit again. Sometimes this opinion was related to severity and other times to perception, i.e. errors a human would never make. In these cases, the feedback form helped detect the errors and turned a previously bad experience into an acceptable job.

It is worth noting that one cannot rely on one single post- editor's feedback. The acceptance threshold can vary quite a lot from one person to another, and post-editing skills are also different. Thus, the most reasonable approach is to collect feedback from several post-editors, compare their comments and use them as a complement to the automatic metrics.

We must definitely make an effort to include the post-editors’ comments as a variable when evaluating MT quality, to prioritize certain errors when optimizing the engines. If we have a team of translators whom we trust, then we should also trust them when they comment on the raw MT results. Personally, I always try my best to send machine-translated files that are in good shape so that the post-editing experience is acceptable. In this way, I can keep my preferred translators (recycled as post-editors) happy and on board, willing to accept more jobs in the future. This can make a significant difference not only in their experience but also in the quality of the final project.

5 Tips for Successfully Integrating Qualitative Feedback into your MT Evaluation Workflow

  1. Devise a tool and a workflow for collecting feedback from the post-editors.
It doesn’t have to be a sophisticated tool and the post-editors shouldn’t have to fill in huge Excel files with all changes and comments. It’s enough to collect the most awkward errors; those they wouldn’t want to fix over and over again. However, if you don’t have the time to read and process all this information, a short informal conversation on the phone from time to time can also be of help and give you valuable feedback about how the system is working.

  1. Agree to fair compensation
Much has been said about this. My advice would be to rely on the automatic metrics but to include the post- editor's feedback in your decision. Therefore, I usually offer hourly rates when language combinations are new and the effort is harder, and per word rates when the MT systems are established and have stable edit distances. When using hourly rates, you can ask your team to use time-tracking apps in their CAT tools or ask them to report the real hours spent. To avoid last-minute surprises, for full PE it is advisable to indicate a maximum number of hours based on the expected PE speed, and ask them to inform you of any deviation, whereas for light post-editing you may want to indicate a minimum amount of hours to make sure the linguists are not leaving anything unchecked.
  1. Never promise the moon
If you are running a test, tell your team. Be honest about the expected quality and always explain the reason why you are using MT (cost, deadline…).
  1. Don’t force anyone to become a post-editor
I have seen very good translators becoming terrible post-editors; either they change too many things or too few, or simply cannot accept that they are reviewing a translation done by a machine. I have also seen bad translators become very good post-editors. Sometimes a quick chat on the phone can be enough to check if they are reluctant to use MT per se, or if the system really needs further improvement before the next round.
  1. Listen, listen, listen
We PMs tend to provide the translators with a lot of instructions and reference material and make heavy use of email. Sometimes, however, it’s worth it to arrange short calls and listen to the post-editors’ opinion of the raw MT. For long-term projects or stable MT-based language combinations, it is also advisable to arrange regular group calls with the post-editors, either by language or by domain.

And… What About NMT Evaluation?

According to several studies on NMT, it seems that the errors produced by these systems are harder to detect than those produced by RBMT and SMT, because they occur at the semantic level (i.e. meaning). NMT takes context into account and the resulting text flows naturally; we no longer see the syntactically awkward sentences we are used to with SMT. But the usual errors are mistranslations, and mistranslations can only be detected by post-editors, i.e. by people. In most NMT tests done so far, BLEU scores were low, while human evaluators considered that the raw MT output was acceptable, which means that with NMT we cannot trust BLEU alone. Both source and target text have to be read and assessed in order to decide if the raw MT is acceptable or not; human evaluators have to be involved. With NMT, human assessment is clearly even more important, so while the translation industry works on a valid approach for evaluating NMT, it seems that qualitative information will be required to properly assess the results of such systems.


 Lucía Guerrero is a senior Translation and Localization Project Manager at CPSL, in the translation industry since 1998. In the past, she has managed localization projects for Apple Computer and translated children’s and art books. At CPSL she is specialized in international and national institutions, machine translation and post-editing.

About CPSL:  ------------------------------------------------------------------

CPSL (Celer Pawlowsky S.L.) is one of the longest-established language services providers in the translation and localization industry, having served clients for over 50 years in a range of industries including: life sciences, energy, machinery and tools, automotive and transport, software, telecommunications, financial, legal, electronics, education and government. The company offers a full suite of language services – translation and localization, interpreting and multimedia related services such as voice-over, transcription and subtitling.

CPSL is among a select number of language service suppliers that are triple quality certified, including ISO 9001, ISO 17100 and ISO 13485 for medical devices. Based in Barcelona (Spain) and with production centers in Madrid, Ludwigsburg (Germany) Boston (USA) and a sales office in the United Kingdom,the company offers language integrated solutions on both sides of the Atlantic Ocean, 24/7, 365 days a year.

CPSL has been the driving force behind the new ISO 18587 (2017) standard, which sets out requirements for the process of post-editing machine translation (MT) output. Livia Florensa, CEO at CPSL, is the architect of the standard, which has just been published. As Project Leader for this mission, she played a key role, being responsible for the proposal and for drafting and coordinating the development of the standard. ISO 18587 (2017) regulates the post-editing of content processed by machine translation systems, and also establishes the competences and qualifications that post-editors must have. The standard is intended for use by post-editors, translation service providers and their clients.

Find out more at:

The following is Storify Twitter coverage by Lucia from the EAMT conference earlier this year.


  1. Readers who are trying to get the most from MT+PE might be interested in the presentation I gave in June in Barcelona at LocWorld. It can be browsed on SlideShare at The script is available on request at

  2. Luigi, I was there and I found your presentation very inspiring, so I definetely recommend your slides. I think we are both on the same line as regards to combining quantitative and qualitative methods when evaluating MT to grasp the whole picture.