eMpTy Pages: Translation Quality -- WannaCry?

Thursday, June 8, 2017

Translation Quality -- WannaCry?

For as long as I have been engaged with the professional translation industry, I have seen that there exist great confusion and ambiguity around the concept of "translation quality". This is a services industry where nobody has been able to coherently define "quality" in a way that makes sense to a new buyer and potential customer of translation services. It also is, unfortunately, the basis of a lot of the differentiation claims made by translation agencies in competitive situations. Thus, is it surprising that many buyers of translation services are mystified and confused about what this really means?

To this day it is my sense that the best objective measures of "translation quality", imperfect and flawed though they may be, come from the machine translation community. The computational linguistics community have very clear definitions of adequacy and fluency that can be reduced to a number, and have the perfect order that mathematics provide.

The tranlsation industry is however reduced to confusing discussions, where ironically the words and terms used in the descriptions, are ambiguous and open to multiple interpretations. It is really hard to just say, " We produce translations that are accurate, fluent and natural," since we have seen that these words mean different things to different people. To add to the confusion, translation output quality discussions are often conflated with translation process related issues. I maintain that the most articulate and generally useful discussion on this issue comes from the MT and NLP communities.

I feel compelled to provide something on this subject below that might be useful to a few, but I acknowledge that this remains an unresolved issue, that undermines the perceived value of the primary product that this industry produces.

Here are the basic criteria that a Translation Service Provider offering a quality service should fulfill:

a) Translation

Correct transfer of information from the source text to the target text.
Appropriate choice of terminology, vocabulary, idiom, and register in the target language.
Appropriate use of grammar, spelling, punctuation, and syntax, as well as the accurate transfer of dates, names, figures, etc. in the target language.
Appropriate style for the purpose of the text.

b) Work process

Certification in accordance with national and/or international quality standards.

Gábor Ugray provides an interesting perspective on "Translation Quality" below and again raises some fundamental questions about the value of new fangled quality assessment tools, when we have yet to clarify why we do what we do. He also provides very thoughtful guidance on the way forward and suggests some things that IMO might actually improve the quality of the translation product.

Quality definitions based on error counts etc.. are possibly useful to the dying bulk market as Gabor points out, and as he says, "real quality" comes from clarifying intent, understanding the target audience, long-term communication and writing experience, and from new in situ and in process tools, that enhance the translator work and knowledge-gained-via-execution experience that these new tools might provide. Humans learn and improve by watching carefully when they make mistakes, (how, why, where), not by keeping really accurate counts of errors made.

We desperately need new tools that go beyond the TM and MT paradigm as we know it today, and really understand what might be useful and valuable to a translator or an evolving translation process. Fortunately, Gabor is in a place where he might get some to listen to these new ideas, and even try new implementations that actually produce higher quality.

The emphasis and callouts in his post below are almost all mine.

================

An idiosyncratic mix of human and machine translation might be the key to tracing down the notorious ransomware, WannaCry. What does the incident tell us about the translating profession’s prospects? A post on – translation quality.

Quality matters, and it doesn’t

Flashpoint’s stunning linguistic analysis[1] of the WannaCry malware was easily the most intriguing piece of news I read last week (and we do live in interesting times). This one detail by itself blows my mind: WannaCry’s ransom notice was dutifully localized into no less [2] than 28 languages. When even the rogues are with us on the #L10n bandwagon, what other proof do you need that we live in a globalized age?

But it gets more exciting. A close look at those texts reveals that only the two Chinese versions and the English text were authored by a human; the other 25 are all machine translations. A typo in the Chinese suggests that a Pinyin input method was used. Substituting 帮组 bāngzǔ for 帮助 bāngzhù is indicative of a Chinese speaker hailing from a southern topolect. Other vocabulary choices support the same theory. The English, in turn, “appears to be written by someone with a strong command of English, [but] a glaring grammatical error in the note suggests the speaker is non-native or perhaps poorly educated.” According to Language Log[3], the error is “But you have not so enough time.”
I find all this revealing for two reasons. One, language matters. With a bit of luck (for us, not the hackers), a typo and an ungrammatical sentence may ultimately deliver a life sentence for the shareholders of this particular venture. Two, language matters only so much. In these criminals’ cost-benefit analysis, free MT was exactly the amount of investment those 25 languages deserved.

This is the entire translating profession’s current existential narrative in a nutshell. One, translation is a high-value and high-stakes affair that decides lawsuits; it’s the difference between lost business and market success. Two, translation is a commodity, and bulk-market translators will be replaced by MT real soon. Intriguingly, the WannaCry story seems to support both of these contradictory statements.

Did the industry sidestep the real question?

I remember how 5 to 10 years ago panel discussions about translation quality were the most amusing parts of conferences. Quality was a hot topic and hotly debated. My subjective takeaway from those discussions was that (a) everyone feels strongly about quality, and (b) there’s no consensus on what quality is. It was the combination of these two circumstances that gave rise to memorable, and often intense, debates.

Fast-forward to 2017, and the industry seems to have moved on from this debate, perhaps admitting through its silence that there’s no clear answer.

Or is there? The heated debates may be over, but quality assessment software seems to be all the rage. There’s TAUS’s DQF initiative[4]. Its four cornerstones are (1) content profiling and knowledge base; (2) tools; (3) a quality dashboard; (4) an API. CSA’s Arle Lommel just wrote [5] about three new QA tools on the block: ContentQuo, LexiQA, and TQAuditor. Trados Studio has TQA, and memoQ has LQA, both built-in modules for quality assessment.

I have a bad feeling about this. Could it be that the industry simply forgot that it never really answered the two key questions, What is quality? and How do you achieve it? Are we diving headlong into building tools that record, measure, aggregate, compile into scorecards and visualize in dashboards, without knowing exactly what and why?

A personal affair with translation quality

I recently released a pet project, a collaborative website for a German-speaking audience. It has a mix of content that’s partly software UI, partly long-form, highly domain-specific text. I authored all of it in English and produced a rough German translation that a professional translator friend reviewed meticulously. We went over dozens of choices ranging from formal versus informal address to just the right degree of vagueness where vagueness is needed, versus compulsive correctness where that is called for.

How would my rough translation have fared in a formal evaluation? I can see the right kind of red flags raised for my typos and lapses grammar, for sure. But I cannot for my life imagine how the two-way intellectual exchange that made up the bulk of our work can be quantified. It’s not a question of correct vs. incorrect. The effort was all about clarifying intent, understanding the target audience, and making micro-decisions at every step of the way in order to achieve my goals through the medium of language.

Lessons from software development

The quality evaluation of translations has a close equivalent in software development.

CAT tools have automatic QA that spots typos, incorrect numbers, deviations from terminology, wrong punctuation and the like. Software development tools have on-the-fly syntax checkers, compiler errors, code style checkers, and static code analyzers. If that’s gobbledygook for you: they are tools that spot what’s obviously wrong, in the same mechanical fashion that QA checkers in CAT tools spot trivial mistakes.

With the latest surge of quality tools, CAT tools now have quality metrics based on input from human evaluators. Software developers have testers, bug tracking systems and code reviews that do the same.

But that’s where the similarities end. Let me key you in on a secret. No company anywhere evaluates or incentivizes developers through scorecards that show how many bugs each developer produced.

Some did try, 20+ years ago. They promptly changed their mind or went out of business.[6]

Ugly crashes not withstanding, the software industry as a whole has made incredible progress. It is now able to produce more and better applications than ever before. Just compare the experience of Gmail or your iPhone to, well, anything you had on your PC in the early 2000s.

The secret lies in better tooling, empowering people, and in methodologies that create tight feedback loops.

Tooling, empowerment, feedback

In software, better tooling means development environments that understand your code incredibly well, give you automatic suggestions, allow you to quickly make changes that affect hundreds of files, and to instantly test those changes in a simulated environment.

No matter how you define quality, in intellectual work, it improves if people improve. People, in turn, improve through making mistakes and learning from them. That is why empowerment is key. In a command-and-control culture, there’s no room for initiative; no room for mistakes; and consequently, no room for improvement.

But learning only happens through meaningful feedback. That is a key ingredient of methodologies like agile. The aim is to work in short iterations; roll out results; observe the outcome; adjust course. Rinse and repeat.

Takeaways for the translation industry

How do these lessons translate (no pun intended) to the translation industry, and how can technology be a part of that?

The split. It’s a bit of an elephant in the room that the so-called bulk translation market is struggling. Kevin Hendzel wrote about this very in dramatic terms in a recent post[7]. There is definitely a large amount of content where clients are bound to decide, after a short cost-benefit analysis, that MT makes the most sense. Depending on the circumstances it may be generic MT or the more expensive specialized flavor, but it will definitely not be human translators. Remember, even the WannaCry hackers made that choice for 25 languages.

But there is, and will always be, a massive and expanding market for high-quality human translation. Even from a purely technological angle, it’s easy to see why MT systems don’t translate from scratch. They extrapolate from existing human translations, and those need to come from somewhere.

My bad feeling. I am concerned that the recent quality assessment tools make the mistake of addressing the fading bulk market. If that’s the case, the mistake is obvious: no investment will yield a return if the underlying market disappears.

Source: TAUS Quality Dashboard [link]

Why do I think that is the case? Because the market that will remain is the high-quality, high-value market, and I don’t see how the sort of charts shown in the image above will make anyone a better translator.

Let’s return to the problems with my own rough translation. There are the trivial errors of grammar, spelling and the like. Those are basically all caught by a good automatic QA checker, and if I want to avoid them, my best bet is a German writing course and a bit of thoroughness. That would take me to an acceptable bulk translator level.

As for the more subtle issues – well, there is only one proven way to improve there. That way involves translating thousands of words every week, for 5 to 10 years on end, and having intense human-to-human discussions about those translations. With that kind of close reading and collaboration, progress doesn’t come down to picking error types from a pre-defined list.

Feedback loops. Reviewer-to-translator feedback would be the equivalent of code reviews in software development, and frankly, that is only part of the picture. That process takes you closer to software that is beautifully crafted on the inside, but it doesn’t take you closer to software that solves the right problems in the right way for its end users. To achieve that, you need user studies, frequent releases and a stable process that channels user feedback into product design and development.

Imagine a scenario where a translation’s end users can send feedback, which is delivered directly to the person who created that translation. I’ll key you in on one more secret: this is already happening. For instance, companies that localize MMO (massively multiplayer online) games receive such feedback in the form of bug reports. They assign those straight to translators, who react to them in a real-time collaborative translation environment like memoQ server. Changes are rolled out on a daily basis, creating a really tight and truly agile feedback loop.

Technology that empowers and facilitates. For me, the scenario I just described is also about empowering people. If, as a translator, you receive direct feedback from a real human, say a gamer who is your translation’s recipient, you can see the purpose of your work and feel ownership. It’s the agile equivalent of naming the translator of a work of literature.

If we put metrics before competence, I see a world where the average competence of translators stagnates. Instead of an upward quality trend throughout the ecosystem, all you have is a fluctuation, where freelancers are data points that show up on this client’s quality dashboard today, and a different client’s tomorrow, moving in endless circles.

I disagree with Kevin Hendzel on one point: technology definitely is an important factor that will continue to shape the industry. But it can only contribute to the high-value segment if it sees its role in empowerment, in connecting people (from translators to end users), in facilitating communication, and in establishing tight and actionable feedback loops. The only measure of translation quality that everyone agrees on, after all, is fitness for purpose.

References

[1] Attribution of the WannaCry ransomware to Chinese speakers. Jon Condra, John Costello, Sherman Chu
https://www.flashpoint-intel.com/blog/linguistic-analysis-wannacry-ransomware/
[2] Fewer, for the pedants.
[3] Linguistic Analysis of WannaCry Ransomware Messages Suggests Chinese-Speaking Authors. Victor Mair
http://languagelog.ldc.upenn.edu/nll/?p=32886
[4] DQF: Quality benchmark for our industry. TAUS
https://www.taus.net/evaluate/dqf-background
[5] Translation Quality Tools Heat Up: Three New Entrants Hope to Disrupt the Industry. Arle Lommel, Common Sense Advisory blog.
http://www.commonsenseadvisory.com/Default.aspx?Contenttype=ArticleDetAD&tabID=63&Aid=39177&moduleId=390
[6] Incentive Pay Considered Harmful. Joel On Software, April 3, 2000
https://www.joelonsoftware.com/2000/04/03/incentive-pay-considered-harmful/
[7] Creative Destruction Engulfs the Translation Industry: Move Upmarket Now or Risk Becoming Obsolete. Kevin Hendzel, Word Prisms blog.
http://www.kevinhendzel.com/creative-destruction-engulfs-translation-industry-move-upmarket-now-risk-becoming-obsolete/

Gábor Ugray is co-founder of Kilgray, creators of the memoQ collaborative translation environment and TMS. He is now Kilgray’s Head of Innovation, and when he’s not busy building MVPs, he blogs at jealousmarkup.xyz and tweets as @twilliability.

21 comments:

Kirti VasheeJune 9, 2017 at 12:06 PM
Luigi,

Thanks for the comments.

Much of the discussion on translation quality are frustrated by the vagueness of the concept in general in the translation industry, so, perhaps I do overstate the precision of the definitions in the CL community. However, compared to the super vague discussion in the translation industry this:
https://www.taus.net/academy/best-practices/evaluate-best-practices/adequacy-fluency-guidelines

and this which attempts to get A & F scores without a reference set:
http://www.aclweb.org/anthology/P/P11/P11-2027.pdf are dramatically clearer than anything in the "industry" on what adequacy and accuracy mean, and how they could be measured.

I did actually provide a lamish attempt at a definition, just under my comment: "I feel compelled to provide something on this subject". I agree it is pretty bad.

And BLEU with multiple references (4 if possible) is actually quite useful if this properly done, which is often the rub.

This is a messy issue to discuss since the discussion generates so little of value, but that may be exactly why we need to keep trying until we find a way to do this in a way that is actually useful and valuable to many. Remember that MT is a path that is littered with failure -- repeated and frequent failure. Yet we continue to try.
ReplyDelete
Replies
Vassilis KorkasJune 9, 2017 at 2:13 PM
Hello Luigi,

Thank you for your opening comments that help take the discussion forward. I think Gábor makes some very interesting points, from whichever angle in the industry we happen to be looking at this topic.

I'd like to make an observation on your comment about QA tools. The many false positives that QA tools traditionally generate can be decreased substantially enough to make a difference in the QA process. The "secret recipe" is to employ locale-specific checks that would apply to every target locale. The challenge with this method is that it takes a lot of time to develop and tailor and refine these algorithms in order to have better results. However, there can be better results and I can vouch for that through my personal experience at lexiQA.

I should also note that we are not using Levenshtein distance metrics to achieve that. Mainstream QA tools use that measurement as the focus of their checks is consistency both at the word/unit level and the segment level. However, consistency checks alone will obviously produce a very high FP rate. This is a topic that I also happen to discuss extensively in a recent article series about linguistic QA (https://www.linkedin.com/pulse/future-linguistic-quality-assurance-localization-vassilis-korkas).
ReplyDelete
Replies
Kirti VasheeJune 9, 2017 at 3:28 PM
Hi Vassilis,

Would you like to publish your post in eMpTy Pages as well -- we can link the two posts as a kind of thread of thinking on a subject that has vexxed us all for a long time?

Let me know.

Thanks

Kirti
ReplyDelete
Replies
Vassilis KorkasJune 9, 2017 at 3:40 PM
Hello Kirti,

By all means, I'd be happy for the article to be reposted here. The more people we can engage in this conversation the better.

All best,
Vassilis
ReplyDelete
Replies
Kirti VasheeJune 9, 2017 at 4:03 PM
Please contact me via the details provided in the About section.
ReplyDelete
Replies
Victor DewsberyJune 12, 2017 at 3:30 AM
Hmmm, two contrasting articles in a single blog post. I enjoyed Gábor's article, but I am mystified by Kirti's introduction, and especially the regret expressed about the lack of an agreed definition of quality.

Can we agree on what is good music? On good taste in fashion? On the best political convictions, economic policies, product design or other areas? Why should we be surprised about our lack of agreement on what translation quality means?

I am even more mystified by Luigi's insistence on the primacy of measurable factors (e.g. in his suggestion that style and register are not measurable and therefore non-existent), about his assumption that we all regard translation as an "industry", and even more so about his outright rejection of the premium market (or segment, or whatever he may call it) simply because he has not experienced it.

Translation is not a single entity. There are many different use cases, skill levels, production scenarios, career patterns and requirements, and in practice the translation sector is broken down into many different markets. We cannot DEFINE what quality is in a once-and-for-all statement (and still less in an authoritative formula). We can DESCRIBE various types of translation project in terms of the use case needs, the possible tools, the requirements for the standard of the output etc. But to do this, we need to use WORDS. And some people in the translation scene are not comfortable using words - they can only make sense of the dicussion by using NUMBERS.

And that, in itself, is an interesting pointer to the vested interests in the debate.
ReplyDelete
Replies
Victor DewsberyJune 13, 2017 at 1:51 AM
You write: "as long as you provide no objective criteria, style and register remain so subjective to be non-significant for any measurement whatsoever. And this is exactly what has been happening in translation for centuries".
In other words, you regard "measurement" as the prime standard by which everything else must be judged, the "holy grail" of translation. That may be relevant to some types of translation project, especially if you are working on the refinement of MT solutions.
But in the type of work I deal with, a subjective evaluation by a competent expert is the best way to judge the quality of a translation, and any numerical measurement is usually irrelevant.
As you so kindly point out, this approach is based on many centuries of experience, and I do not believe that all of history should be dispatched to the rubbish heap lock-stock-and-barrel.
ReplyDelete
Replies
Kirti VasheeJune 13, 2017 at 9:25 AM
Victor I don't think it is necessary to always have numerical measures except where it is useful. But I think every NEW client will want a way to define this so that they can use it, know what it is likely to be before they buy it and have more than a general idea that the product is ready. The more clarity on the definition of quality upfront, the easier the conversation is after the work is delivered. Clear Quality definition helps create expectation equivalency.
ReplyDelete
Replies
Victor DewsberyJune 14, 2017 at 5:20 AM
Luigi, thanks for this creative portrayal.
ReplyDelete
Replies
ArnoldJune 21, 2017 at 1:06 AM
This is an interesting discussion! May I add a new notion? In this article on tcworld (http://bit.ly/2szM0Rm), there's the idea to combine the "human element" with "numbers" to get a grasp on translation quality - based on pre-defined expectations.
ReplyDelete
Replies
Victor DewsberyJune 22, 2017 at 3:45 AM
Hi Arnold, I read your link, and while the appeal to treat the review process as a constructive part of the process seems logical in theory, in practice I feel that the shelf life of appeals to try harder, get things right, give the translator instant feedback, use controlled language in source texts etc. is probably not much longer than last year's new year's resolutions. And I didn't notice any significant use of numbers in the article either (apart from a pretty graph with no obvious application), so I am not sure that the MT advocates will be more impressed than I am.
ReplyDelete
Replies
Victor DewsberyJune 26, 2017 at 3:48 AM
Hi Arnold, I'm not surprised that TAUS comes into this with its confession of faith, i.e. that "quality is measurement", or that an appeal is made to "big data". If you have been modestly aware of my blog, you may already know that I am not exactly a TAUS fanboy.
The problem is that we live in different universes. While TAUS and consorts dine on "big data" and crunch quality measurement numbers for after-dinner recreation, I continue to translate complex texts which regularly contain terminology that even Google has hardly ever heard of (not to mention the syntactical contortions that are often involved), and I have clients who are grateful when I point out logical inconsistencies in the source text.
I have no doubt that the mass-produced translation "industry" exists, and that MT and semi-automated quality evaluation are used in that "industry". But I tend to freak out when MT enthusiasts scold translators like me for not getting on the numbers bandwagon. And I then laugh when MT apologists suggest that translators should engage with the MT community to improve the situation.
If chickens were to spend their time negotiating hunting rules with the fox, where would you find a boiled egg in ten years' time?
ReplyDelete
Replies
GertrudeJune 28, 2017 at 2:52 AM
This is a very nice article
ReplyDelete
Replies

Add comment

eMpTy Pages

Pages