Tuesday, October 8, 2019

Post-editese is real

Ever since machine translation was introduced into the professional translation industry, there have been questions about what the impact would be on a final delivered translation service product. For much of the history of MT many translators claimed that while translation production work using a post-edited MT (PEMT) process was faster, the final product was not as good. The research suggests that this has been true from a strictly linguistic perspective, but many of us also know that PEMT worked quite successfully with technical content especially with terminology and consistency even in the days of SMT and RBMT. 

As NMT systems proliferate, we are at a turning point, and I suspect that we will see many more NMT systems that are in fact seen as providing useful output that clearly enhances translator productivity, especially on output from systems built by experts. NMT will also quite likely have an influence on the output quality and the difference is also likely to become less prominent. This is what is meant by developers who make claims of achieving human parity. If competent human translators cannot tell that segments they review came from MT or not, we can make a limited claim of having achieved human parity. This does not mean that this will be true for every new sentence submitted to this system. 

We should also understand that MT  provides the greatest value in use scenarios where you have large volumes of content (millions rather than thousands of words), short turnaround times, and limited budgets. Increasingly MT is used in scenarios where little or no post-editing is done, and by many informed estimates, we are already at a run rate of a trillion words a day going through MT engines. While post-editese may be an important consideration in localization use scenarios, this is likely no more than 2% of all MT usage.  

Enterprise MT use is rapidly moving into a phase where it is an enterprise-level IT resource. The modern global enterprise needs to enable and allow millions of words to be translated on demand in a secure and private way and needs to be integrated deeply into critical communication, collaboration, and content creation and management software.

The research presented by Antonio Toral below documents the impact of post-editing on the final output across multiple different language combinations and MT systems. 


This is a summary of the paper “Post-editese: an Exacerbated Translationese” by Antonio Toral, which was presented at MT Summit 2019, where it won the best paper award.


Post-editing (PE) is widely used in the translation industry, mainly because it leads to higher productivity than unaided human translation (HT). But, what about the resulting translation? Are PE translations as good as HT? Several research studies have looked at this in the past decade and there seems to be consensus: PE is as good as HT or even better (Koponen, 2016).

Most of these studies measure the quality of translations by counting the number of errors therein. Taking into account that there is more to quality than just the number of mistakes, we ask ourselves the following question instead: are there differences between translations produced with PE vs HT? In other words, does the final output created via PEs and HTs have different traits?

Previous studies have unveiled the existence of translationese, i.e. the fact that HTs and original texts exhibit different characteristics. These characteristics can be grouped along with the so-called translation universals (Baker, 1993) and fundamental laws of translation (Toury, 2012), namely simplification, normalization, explicitation and interference. Along this line of thinking, we aim to unveil the existence of post-editese (i.e. the fact that PEs and HTs exhibit different characteristics) by confronting PEs and HTs using a set of computational analyses that align to the aforementioned translation universals and laws of translation.


We use three datasets in our experiments: Taraxü (Avramidis et al., 2014), IWSLT (Cettolo et al., 2015; Mauro et al., 2016) and Microsoft “Human Parity” (Hassan et al., 2018). These datasets cover five different translation directions and allow us to assess the effect of machine translation (MT) systems from 2011, 2015-16 and 2018 on the resulting PEs.


Lexical Variety

We assess the lexical variety of a translation (HT, PE or MT) by calculating its type-token ratio:

In other words, given two translations equally long (number of words), the one with bigger vocabulary (higher number of unique words) would have a higher TTR, being therefore considered lexical richer, or higher in lexical variety.

The following figure shows the results for the Microsoft dataset for the direction Chinese-to-English (zh–en, the results for the other datasets follow similar trends and can be found in the paper). HT has the highest lexical variety, followed by PE, while the lowest value is obtained by the MT systems. A possible interpretation is as follows: (i) lexical variety is low in MT because these systems prefer the translation solutions that are frequent in the training data used to train such systems and (ii) a post-editor will add lexical variety to some degree (difference in the figure between MT and PE), but because MT primes him/her (Green et al., 2013), the resulting PE translation will not achieve the lexical variety of HT.

Lexical Density

The lexical density of a text indicates its amount of information and is calculated as follows:
where content words correspond to adverbs, adjectives, nouns, and verbs. Hence, given two translations equally long, the one with the higher number of content words would be considered to have higher lexical density, in other words, to contain more information.

The following figure shows the results for the three translation directions in the Taraxü dataset: English-to-German, German-to-English and Spanish-to-German. The lexical density in HT is higher than in both PE and MT and there is no systematic difference between the latter two.

Length Ratio

Given a source text (ST) and a target text (TT), where TT is a translation of ST (HT, PE or MT), we compute a measure of how different in length the TT is with respect to the ST:
This means that the bigger the difference in length between the ST and the TT (be it because TT is shorter or longer than the ST), the higher the length ratio.

The following figure shows the results for the Taraxü dataset. The trend is similar to the one in lexical variety; this is, HT obtains the highest result, MT the lowest and PE lies somewhere in between. We interpret this as follows: (i) MT results in a translation of similar length to that of the ST due to how the underlying MT technology works and PE is primed by the MT output while (ii) a translator working from scratch may translate more freely in terms of length.

Part-of-speech Sequences

Finally, we assess the interference of the source language on a translation (HT, PE and MT) by measuring how close the sequence of part-of-speech tags in the translation is to the typical part-of-speech sequences of the source language and to the typical part-of-speech sequences of the target language. If the sequences of a translation are similar to the typical sequences of the source language that would indicate that there is an inference from the source language in the translation.

The following figure shows the results for the IWSLT dataset. The metric used is perplexity difference; the higher it is the lower the interference (full details on the metric can be found in the paper). Again, we find a similar trend as in some of the previous analyses: HT gets the highest results, MT the lowest and PE somewhere in between. The interpretation is again similar: MT outputs exhibit a large amount of interference from the source language, a post-editor gets rid of some of that interference but the resulting translation still has more interference than an unaided translation.


The findings from our analyses can be summarised as follows in terms of HT vs PE:
  • PEs have lower lexical variety and lower lexical density than HTs. We link these to the simplification principle of translationese. Thus, these results indicate that post-editese is lexically simpler than translationese.
  • Sentence length in PEs is more similar to the sentence length of the source texts, than sentence length in HTs. We link this finding to interference and normalization: (i) PEs have
interference from the source text in terms of length, which leads to translations that follow the typical sentence length of the source language; (ii) this results in a target text whose
length tends to become normalized.
  • Part-of-speech (PoS) sequences in PEs are more similar to the typical PoS sequences of the source language than PoS sequences in HTs. We link this to the interference principle: the sequences of grammatical units in PEs preserve to some extent the sequences that are typical of the source language.

In terms of the role of MT: we have not considered only HTs and PEs but also MT outputs, from the MT systems that were the starting point to produce the PEs. This to corroborate a claim in the literature (Greenet al., 2013), namely that in PE the translator is primed by the MT output. We expected then to find similar trends to those found in PEs also in MT outputs and this was indeed the case in all four analyses. In some experiments, the results of PE were somewhere in between those of HT and MT. Our interpretation is that a post-editor improves the initial MT output, but due to being primed by the MT output, the result cannot attain the level of HT, and the footprint of the MT system remains in the resulting PE.


As said in the introduction, we know that PE is faster than HT. The question I wanted to address was then: can PE not only be faster but also be at the level of HT quality-wise? In this study, this is looked at from the point of view of translation universals and the answer is clear: no. However, I'd like to point out three additional elements:
  1. The text types in the 3 datasets that I have used are news and subtitles, both are open-domain and could be considered to a certain extent "creative". I wonder what happens with technical texts, given their relevance for industry, and I plan to look at that in the future.
  2. As mentioned in the introduction, previous studies have compared HT vs PE in terms of the number of errors in the resulting translation. In all the studies I've encountered PE is at the level of HT or even better. Thus, for technical texts where terminology and consistency are important, PE is probably better than HT. I find thus the choice between PE and HT to be a trade-off between consistency on one hand and translation universals (simplification, normalization and interference) on the other.
  3. PE falls behind HT in terms of translation universals because MT falls behind HT in those terms. However, this may not be the case anymore in the future. For example, the paper shows that PE-NMT has less interference than PE-SMT, thanks to the better reordering in the former.

Antonio Toral is an Assistant Professor at the Computational Linguistics group, Center for Language and Cognition, Faculty of Arts, University of Groningen (The Netherlands). His research is in the area of Machine Translation. His main topics include resource acquisition, domain adaptation, diagnostic evaluation and hybrid approaches.

Related Work

Other work has previously looked at HT vs PE beyond the number of errors. The most related papers to this paper are Bangalore et al. (2015), Carl and Schaeffer (2017), Czulo and Nitzke (2016), Daems et al. (2017) and Farrell (2018).


Avramidis, Eleftherios, Aljoscha Burchardt, Sabine Hunsicker, Maja Popovic, Cindy Tscherwinka, David Vilar, and Hans Uszkoreit. 2014. The taraxü corpus of human-annotated machine translations. In LREC, pages 2679–2682.

Baker, Mona. 1993. Corpus linguistics and translation studies: Implications and applications. Text and technology: In honor of John Sinclair, 233:250.

Bangalore, Srinivas, Bergljot Behrens, Michael Carl, Maheshwar Gankhot, Arndt Heilmann, Jean Nitzke, Moritz Schaeffer, and Annegret Sturm. 2015. The role of syntactic variation in translation and post-editing. Translation Spaces, 4(1):119–144.

Carl, Michael and Moritz Jonas Schaeffer. 2017. Why translation is difficult: A corpus-based study of non-literality in post-editing and from-scratch translation. Hermes, 56:43–57.

Cettolo, Mauro, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The iwslt 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation.

Green, Spence, Jeffrey Heer, and Christopher D Manning. 2013. The efficacy of human post-editing for language translation. Chi 2013, pages 439–448.

Hassan, Hany, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, Will Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Zhuangzi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving Human Parity on Automatic Chinese to English News Translation.

Koponen, Maarit. 2016. Is machine translation post-editing worth the effort? A survey of research into post-editing and effort. Journal of Specialised Translation, 25(25):131–148.

Mauro, Cettolo, Niehues Jan, Stüker Sebastian, Bentivogli Luisa, Cattoni Roldano, and Federico Marcello. 2016. The iwslt 2016 evaluation campaign. In International Workshop on Spoken Language Translation.

Toury, Gideon. 2012. Descriptive translation studies and beyond: Revised edition, volume 100. John Benjamins Publishing.


  1. An interesting additional field of study would be the "Wrong end of the stick" syndrome. In other words, if the MT misinterprets the source text and actually states the opposite (e.g. by missing or misinterpreting a semantic negative), but does so in superficially decent wording, in what proportion of PEMT cases is this error actually caught? Of course the same can also happen in human translation, so your analysis method would still be relevant. The difficult part would be identifying such mistranslations - I doubt whether MT of any type will ever be able to invent a formula for this.

  2. Good point Victor! This is particularly an issue with NMT, because sometimes it produces very fluent translations whose meaning, however, is different from that of the source. Hence post-editors need to be alert to identify such errors. In other words: do not trust NMT output just because is perfectly fluent.

  3. Kirti, "can PE not only be faster but also be at the level of HT quality-wise?" I say no. Not as long as two things prevail.

    (1) When big-data corpus is the source of the MT, only the exceedingly predominant patterns survive and pass to the raw MT output.

    (2) Although MT itself has improved, PE "best" practices have been stagnant for over 50 years. They survive, virtually unchanged, from the early 1960s when developed for the ALPAC study.

    Of course if you change one or both practices, you no longer have today's MT nor today's PE. You have something new. So, for today's PE and today's MT, the answer still stays at "no." The possibilities are in tomorrow's PE and MT best practices.

  4. @tahoar - I'm intrigued about how you envisage new "best" practices for post-editing MT. As far as the "hardware" for the task is concerned, the combination of human eyes/brain/language competence has been at the root of post-translation editing for far longer than 50 years. The cognition of a human brain has not been duplicated yet, and I firmly believe that this will never be fully achieved. So apart from the efforts of the "quality-doesn't-matter" MT afficionados, I see no realistic prospects of a fundamental revolution in any "best practices" for PEMT.

    1. Thanks, Victor. I'll speak my opinions because there's zero academic research in this area, and that's the crux of the problem. In 70 years of MT research and 50 years of PE research, there has never been a double-blind study that evaluates how translators interact with MT.

      Never once has there been a double-blind study that measured the cognition work required to edit MT and/or the cognitive stress resulting from the work.

      Yes, there have been studies... But none have used trusted double-blind methodologies to maximize the accuracy and minimize the effects of bias. Antonio Toral's pioneering work reported here clearly demonstrates biases exist when translators work with raw MT, "in PE the translator is primed by the MT output." I would also call this a placebo effect. Double-blind methodologies are the research tool that best minimizes these biases.

      You're likely aware of double-blind studies in medical clinical trials, marketing research, political opinion surveys. Double-blind is a fundamental, vitally important methodology to reduce the effects of bias... the participant's biases AND the test administrators' biases... hence DOUBLE-blind. Ask yourself, why have double-blind methodologies never been uses within language industry or academia to measure PE activities? My answer? No, there's no conspiracy theory here. I believe it's because MT has always been so bad that even a cursory review by anyone skilled in translation would immediately pierce the veil of "blind." Hence, why bother?

      Today, however, MT systems and their outputs have improved. Google GNMT routinely generates Edit Distance Zero (exact-match) segments at an average rate of 5%. Customized SMT systems routinely generate Edit Distance Zero segments at a rate of 30%. As MT systems are improving, how do we train AND COMPENSATE translators to work with the "good" translations? As I said above, today's PE best practices fail with this use case.

    2. Last month I was recruited by a university in eastern Europe to help them design and build a translation package for a pioneering academic study. This study will be the first ever double-blind study that measures how translators work with various kinds of suggestions... HT suggestions from a TM... raw MT suggestions from two different MT engines... no suggestions from failed fuzzy match... MT and TM suggestions that have been judged by an academic panel as preferred translations (i.e. should be edit distance zero results)... MT and TM segments that have been deemed validated by being published in periodicals... TM with low fuzzy and exact match source scores... MT with low BLEU scores and exact target matches... the specifications are extensive.

      How do we achieve double-blind? The academic team defined the specifications for the test set's segments. They provided a corpus (that served as a TM). They tasked me to extract pairs that met their specifications and then reduce the number to 2400 by random selection. Then, I piped the 2400 sources through two MT systems... a popular online MT system and a custom MT system. We filtered those results that matched their specification and reduced the number by random selection to meet their size needs. There's no professor bias defining "good" translations other than what was deemed the specification in the beginning. The rest is random selection, much like a professional translator's luck of the draw when they receive a package from a client.

      Through all that, all segments were assigned an index number. I built an XLIFF package with the final random selections. The XLIFF ID property is the segment's index number. The XLIFF package is devoid of all TM fuzzy information and there are no MT flags. The translator is blind to where the suggestion comes from. They must rely on nothing but their cognition, not on the CAT tool's reports, when reviewing the suggestions.

      The test administrators don't know the metadata about the segments. I maintain a spreadsheet that maps each segment's ID index to the specification metadata. The researcher doesn't have access to that spreadsheet with the information until after all participants have completed the study. That will be late this year.

      I'm aware of some of the survey process but I won't try to explain it here. The researcher will publish those details in his thesis (dissertation?). I'm excited to be a part of this pioneering research. We're venturing into the world of learning what we don't know about what we don't know.

      I believe we need to understand the extent and effects of the biases on all sides... translator biases, agency biases, LSP's biases, all biases... before we can prescribe any meaningful and effective steps to improve PE. This pioneering survey is a great first step to discovering these biases. I sincerely hope that more language industry and academia research will employ double-blind methodologies.

  5. Really, really interesting article, Antonio. Thanks for sharing!