Monday, March 29, 2021

The Quest for Human Parity Machine Translation

The Challenge of Defining Translation Quality 

The subject of  "translation quality" has always been a challenging communication issue for the translation industry. It is particularly difficult to explain this concept in a straightforward way to an industry outsider or a customer whose primary focus is building business momentum in international markets, and who is not familiar with localization industry translation-quality-speak.  Nowadays they tend to focus on creating and managing the dynamic and ever-changing content that enhances a global customer's digital journey, rather than the static content that is the more typical focus of localization managers. Thus, the conventional way in which translation quality is discussed by LSPs is not very useful to these customers. Since every LSP claims to deliver the "best quality " or "high quality" translations", it is difficult for these buyers to tell the difference in this service aspect from one service provider to another. The quality claim between vendors thus essentially cancels out. 

These customers also differ in other ways. They need larger volumes of content to be translated rapidly at the lowest cost possible, but yet at a quality level that is useful to the customer in digital interactions with the enterprise. For millions of digital interactions with enterprise content, the linguistic perfection of translations is not a meaningful and achievable goal given the volume, short shelf-life, and instant turnaround expectations a digital customer will have.  
As industry observer and critic Luigi Muzii describes it:
"Because of the intricacies related to the intrinsic nature of languages, objective measurement of translation quality has always been a much researched and debated topic that has borne very little fruit. The notion of understood quality level remains unsolved, together with any kind of generally accepted and clearly understood quality assessment and measurement."
The industry response to this need for a better definition of translation quality is deeply colored by the localization mindset and thus we see the emergence of approaches like the Dynamic Quality Framework (DQF). Many critics consider it too cumbersome and detailed to implement in translating modern fast-flowing content streams needed for superior digital experience. While DQF can be useful in some limited localization use-case scenarios, it will surely confound and frustrate the enterprise managers who are more focused on digital transformation imperatives.  The ability to rapidly handle and translate large volumes of DX-relevant content cost-effectively is increasingly a higher priority and needs a new and different view on monitoring quality. The quality of the translation does matter in delivering superior DX but has a lower priority than speed, cost, and digital agility.

While machines do most of the translation done on the planet today, this does not mean that there is not a role for higher value-added human translation (increasingly supported by CAT tools). If the content is a critical and high-impact communication, most of us understand that human oversight is critical for success in the business mission. And if translation involves finesse, nuance, and high art, it is probably best to leave the "translating" computers completely out of the picture. 

However, in this age of digitally-driven business transformation and momentum, competent MT solutions are essential to the enterprise mission. Increasingly, more and more content is translated and presented to target customers without EVER going through any post-editing modificationThe business value of the translation is often defined by its utility to the consumer in a digital journey, basic understandability, availability-on-demand, and the overall CX impact, rather than linguistic perfection. Generally, useable accuracy and timely delivery matter more than perfect grammar and fluency. The phrase "good enough" is used both disparagingly, and as a positive attribute, for the translation output that is useful to a customer even in a less than “perfect” state.

So we have a situation today where the term translation quality is often meaningless even in "human translation" because it cannot be described to an inexperienced buyer of translation services (or regular human beings) in a clear, objective, and consistently measurable way. Comparing different human translation works of the same source material is often an exercise in frustration or subjective preference at best. Every sentence can have multiple correct, accurate translations, so how do we determine what is the best translation?  Since every LSP in the industry claims to provide the "best quality", such a claim is useless to a buyer who does not wish to wade through discussions on error counts, error categories, and error monitoring dashboards that are sometimes used to illustrate translation quality.

Defining Machine Translation Output Quality

The MT development community has also had difficulty with establishing a meaningful and widely useful comparative measurement for translation quality. Fortunately, they had assistance from the National Institute of Standards & Technology (NIST) who developed a methodology to compare the translation quality of multiple competing MT systems under carefully controlled evaluation protocols. The NIST used a variant of BLEU scores and other measures of precision, recall, adequacy, and fluency to compare different MT systems rapidly in a standardized and transparent manner.  Their efforts probably helped to establish BLEU as a preferred scoring methodology to rate both evolving and different competing MT systems. 

The competitive evaluation approach works when multiple systems are compared under carefully monitored test protocols, but becomes less useful when an individual developer announces "huge improvements" in BLEU scores as it is easy to make extravagant claims of improvement that are not easily validated. Some independent evaluations used by many today, provide comparisons where several systems may have actually trained on the test sets - this is the equivalent of giving a student the exam with the answers before a formal test. This makes some publicly available comparisons done by independent parties somewhat questionable and misleading.  Other reference test set-based measurements like hLepor, Meteor, chrF, Rouge, and others are also plagued by similar problems. These automated measurements are all useful, but unreliable indicators of absolute quality.

Best practices today suggest that a combination of multiple automated measures needs to be used together with human assessments of MT output to really get a handle on the relative quality of different MT systems. Again, this quickly gets messy as soon as we start asking annoying questions like:
  • What are we testing on?
  • Are we sure that these MT systems have not trained on the test data? 
  • What kind of translators is evaluating the different sets of MT output?  
  • How do these evaluators determine what is better and worse when comparing different correct translations?
  • How many sentences are needed to make a meaningful assessment and draw accurate conclusions when comparing multiple MT systems performance on the same source material?
So, we see that conducting an accurate evaluation is difficult, messy, and it is easy to draw wrong conclusions stemming from easy-to-make errors in the evaluation process.

However, in the last few years, several MT developers have claimed to produce MT systems that have achieved human parity. This has been especially true with the advent of neural machine translation. These claims are useful for creating a publicity buzz among ignorant journalists and fear amongst some translators, but usually disappoint anybody who looks more closely.

I have been especially vocal in challenging the first of these broad human parity claims as seen here: The Google Neural Machine Translation Marketing Deception. The challenge is very specific and related to some specific choices in the research approach and how the supporting data was presented.  A few years later Microsoft claimed they reached human parity on a much narrower focus with their Chinese to English News system but also said: 
Achieving human parity for machine translation is an important milestone of machine translation research. However, the idea of computers achieving human quality level is generally considered unattainable and triggers negative reactions from the research community and end-users alike. This is understandable, as previous similar announcements have turned out to be overly optimistic. 
The goal of achieving human parity has become a way to say that MT systems have gotten significantly better as this Microsoft communication shows. I too was also involved with the SDL claim of having "cracked Russian", which is yet another broad claim stating that human parity has been reached😧. 

Many, who are less skeptical than I am, will interpret that an MT engine that claims to have achieved human parity can ostensibly produce translations of equal quality to those produced by a human translator. This can indeed be true on a small subset of carefully selected test material, but alas we find that it is not usually true in general for much of what we submit with high expectations to these allegedly human parity MT engines. This is the unfortunate history of MT: over-promising and underdelivering. MT promises are so often empty promises 😏. 

While many in the translation and research communities feel a certain amount of outrage over these exaggerated claims (based on MT output they see in the results of their own independent tests) it is useful to understand what supporting documentation is used to make these claims. 

We should understand that at least among some MT experts there is no deliberate intent to deceive, and it is possible to do these evaluations with enough rigor and competence to make a reasonable claim of breakthrough progress, even if it falls short of the blessed state of human parity. 

There are basically two definitions of human parity generally used to make this claim.
Definition 1.If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.

Definition 2.If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
Again the devil is in the details, as the data and the people used in making the determination can vary quite dramatically. There are (50?) shades of grey rather than black and white facts in most cases.  The most challenging issue is that human judges and evaluators are at the heart of the assessment process. These evaluators can vary in competence and expertise and can range from bilingual subject matter experts and professionals to low-cost crowdsourced workers who earn pennies per evaluation. The other big problem is the messy, inconsistent, irrelevant, biased data underlying the assessments.

Ensuring objective, consistent human evaluation is necessary but difficult to do consistently on the required continuous and ongoing basis. If the underlying data used in an evaluation are fuzzy and unclear we actually move to obfuscation and confusion rather than clarity. This can be the scientific equivalent of fake news. MT engines evolve over time and the better the feedback, the faster the evolution of if developers know how to use this feedback to drive continuous improvements.  

Again, as Luigi Muzii states:
The problem with human evaluation is bias. The red-pen syndrome.

Indeed, human evaluation of machine translation is known for being expensive, time-consuming, and often biased, and yet it is supposed to overcome the drawbacks introduced by the limited accuracy and approximation of automatic evaluation. ... Indeed, translation quality metrics have become more and more complex and overly detailed, and always seem to be based on the error-catching [and categorization] approach that has proved costly and unreliable thus far.  


Useful Issues to Understand 

While the parity claims can be roughly true for a small sample of evaluated sentences, it is difficult to extrapolate parity to a broader range of content because it is simply not possible to do machine translation output evaluation on an MT scale (millions of sentences). Some of the same questions that obfuscate quality discussions with human translation services also apply to MT. If we cannot define what a "good translation" is for a human, how is it possible to do this for a mindless, common-sense-free machine, where instruction and direction need to be explicit and clear?  

Here are some validation and claim verification questions that can help an observer to understand the extent to which parity has been reached or also expose deceptive marketing spin that may motivate the claims.

What was the test data used in the assessments? 
MT systems are often tested and scored on news domain data which is most plentiful. This may not correlate well with system performance on the typical content in the global enterprise content domain. A broad range of different types of content needs to be included to make claims as extravagant as having reached human parity. 

What is the quality of the reference test set?
In some cases, researchers found that the test sets had been translated, and then back-translated with MTPE into the original source language. This could mean the content of the test sets would be simplified from a linguistic perspective, and thus easier to machine translate. Ideally, only expert human-created test sets should be used and should contain original source material and should not be translated data from another language.

Who produced the reference human translations being used and compared?
The reference translations against which all judgments will be made should be "good" translations. Easily said but not so easily done. If competent humans are creating the source test set sentences, the test process will be expensive. Thus, it is often more financially expedient to use MT or cheap translators to produce the test material.  This can cause a positive bias for widely used MT systems like Google Translate. 

How much data was used in the test to make the claim? 
Often human assessments are done with as little as 50 sentences, and automated scoring is rarely done with more than 2,000 sentences. Thus, drawing conclusions on how any MT system will handle the next million sentences it will process is risky, and likely to be overly optimistic. For example, when an MT developer says that over 90% of the system’s output has been labeled as a human translation by professional translators, they may be looking at a sample of only 100 or so sentences. To then claim that human parity has been reached is perhaps over-reaching.  

Who is making the judgments and what are their credentials?
It is usually cost-prohibitive to use expert professional translators to make the judgments and thus evaluators are often acquired on crowdsourcing platforms where evaluator and translator competence is not easily ascertained. 

It can be seen that do an evaluation properly would be a significant and expensive task, and MT developers have to do this continuously while building the system. The process needs to be efficient, fast, and consistent. It is often only possible to do such careful tests on the most mission-critical projects and is not realistic to follow all these rigorous protocols for typical low ROI enterprise projects. This is why BLEU and other "imperfect" automated quality scores are so widely used. They provide the developers with continuous feedback in a fast and cost-efficient manner if they are done with care and rigor. Recently there has been much discussion about testing on documents to assess understanding of context rather than just sentences. This will add complexity, cost, and difficulty to an already difficult evaluation process, and IMO will yield very small incremental benefits in evaluative and predictive accuracy. There is a need to balance improved process recommendations with cost, and the benefit from improved predictability. 

The Academic Response

Recently, several academic researchers provided some feedback on their examination of these MT at human parity claims. The study is called “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation” is worth a look to see the many ways in which evaluations can go wrong. The study showed that human evaluation of MT quality depends on three factors: “the choice of raters, the availability of linguistic context, and the creation of reference translations.”

Some findings from this report in summary:

“Professional translators showed a significant preference for human translation, while non-expert [crowdsourced] raters did not”.

“Human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems”

The authors recommend the following design changes to MT developers in their evaluation process:
  • Appoint professional translators as raters
  • Evaluate documents, not sentences
  • Evaluate fluency on top of adequacy
  • Do not heavily edit reference translations for fluency
  • Use original source texts
Most developers would say that implementing all these recommendations would make the evaluation process prohibitively expensive and slow. The researchers here do agree and welcome further studies into “alternative evaluation protocols that can demonstrate their validity at a lower cost.” Process changes need to be practical and reasonably possible, and we see that there is a need to balance improved process benefits with, cost, and improved predictability benefits.  

What Would Human Parity MT Look Like?

MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. I think it would be useful to the user community-at-large that MT developers restrain from making these claims until they can show all of the following:
  • 90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human
  • Catch obvious errors in the source and possibly even correct these before attempting to translate 
  • Handle variations in the source with consistency and dexterity
  • Have at least some nominal amount of contextual referential capability
Note that these are things we would expect without question from an average translator. So why not from the super-duper AI machine? 

Until we reach the point where all of the above is true, it would be useful to CLEARLY state the boundary limits of the claim with key parameters underlying the claim. Such as:
  • How large the test set was (e.g. 90% of 50 sentences where parity was achieved) 
  • Descriptions on what kind of source material was tested
  • How varied the test material was: sentences, paragraphs, phrases, etc...
  • Who judged, scored, and compared the translations
 For example, if we saw an MT developer state a parity claim as follows perhaps:
We found that a sample of 45/50 original human sourced sentences translated by the new MT system were judged by a team of three crowdsourced translator/raters as indistinguishable from the translations produced by two professional human translators.  Based on this data, we claim the system has achieved "limited human parity".

Until the minimum set of capabilities are shown at MT scale (>100,000 or even 1M sentences) we should tell MT developers to STFU and give us the claim parameters in a simple, clear, summarized way, so that we can weigh the reality of the data versus the claim for ourselves.  

I am also skeptical that we will achieve human parity by 2029 as some "singularity" enthusiasts have been saying for over a decade. 
"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not pixie dust that magically solves all your problems." 
Steven Pinker 
Elsewhere, Pinker also says:
"… I’m skeptical, though, about science-fiction scenarios played out in the virtual reality of our imaginations. The imagined futures of the past have all been confounded by boring details: exponential costs, unforeseen technical complications, and insuperable moral and political roadblocks. It remains to be seen how far artificial intelligence and robotics will penetrate into the workforce. (Driving a car is technologically far easier than unloading a dishwasher, running an errand, or changing a baby.) Given the tradeoffs and impediments in every other area of technological development, the best guess is: much farther than it has so far, but not nearly so far as to render humans obsolete."

Recently some in the Singularity community have admitted that "language is hard" as you can see in this attempt to explain why AI has not mastered translation yet.

Michael Housman, a faculty member of Singularity University, explained that the ideal scenario for machine learning and artificial intelligence is something with fixed rules and a clear-cut measure of success or failure. He named chess as an obvious example and noted machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited set of moves.

Housman elaborated, “Language is almost the opposite of that. There aren’t as clearly-cut and defined rules. The conversation can go in an infinite number of different directions. And then of course, you need labeled data. You need to tell the machine to do it right or wrong.”

Housman noted that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”

Perhaps, we need to admit that human parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?

Perhaps what we really need is an MT platform that can rapidly evolve in quality with specialized human feedback. Post-editing (MTPE) today is generally NOT a positive experience for most translators. But human interaction with the machine can be a significantly better and positive experience.  Developing interactive and highly responsive MT systems that can assist, learn, and improve the humdrum elements of translation tasks instantaneously might be a better research focus. This may be a more worthwhile goal than having a God-like machine that can translate anything and everything at human parity. 

Even in the AI-will-solve-all community, we know that "language is hard" so maybe we need more focus on improving the man-machine interface, and the quality of the interaction and find more sophisticated collaborative models. Rapid evolution, intuitive and collaborative interaction, and instant learning seem like a more promising vision to me than crawling all the data on the web and throwing machine learning pixie dust at your ten trillion word TM training data. 

Getting to a point where the large majority of translators ALWAYS WANT TO USE MT because it simply makes the work easier, more pleasant, and more efficient is perhaps a better focus for the future.  I would bet also that this different vision will be a more likely path to better MT systems that consistently produce better output over millions of sentences.   

Thursday, March 25, 2021

The Impact of MT on the Freelance Translator

The Podcast

This is a conversation or interview that I did with Paul Urwin of Proz where the links will take you to the podcast.

The conversation covers possible strategies that freelance translators can adopt to deal with PEMT and provides some guidance (hopefully) on potential new skills that professionals can develop. 

It also provides context on how valuable the translator is even with continuously improving MT and points to a growing awareness that translators are a resource whose value can only grow in importance given the never-ending momentum on content that needs to be translated.


This is Part 1.

Paul talks with machine translation expert Kirti Vashee about interactive-adaptive MT, linguistic assets, freelance positioning, how to add value in explosive content situations, e-commerce translation and the Starship Enterprise.

This is Part 2.

Paul continues the fascinating discussion with Kirti Vashee on machine translation. In this episode, they talk about how much better MT can get, which languages it works well for, data, content, pivot languages and machine interpreting.

The Future is NOT just MT

Tuesday, February 16, 2021

Building Equity In The Translation Workflow With Blockchain

This is a guest post by Bob Kuhns on the subject of blockchain use in the translation industry. He presents a very "simple" model where he shows how a blockchain could enable an ongoing,  robust, and trusted Buyer to Translator business connection that could quite possibly reduce the role of LSP middlemen whose primary value-add in business translation work today is project management services and coordination. Though this is a valuable service, it often significantly increases the cost of translation, and also sometimes creates discord, disgruntlement, and enmity amongst the freelance translation-service suppliers who ultimately do the work. 

A blockchain solution is not just about technology, it’s about solving business problems that have been insolvable before due to the inability of the ecosystem to share information in a transparent, immutable, and trusted manner.  

 LSPs continue to struggle in their communications on translation quality and the value of ongoing project services and thus it theoretically seems that a blockchain that did in fact deliver direct-to-buyer translation services that are trusted, reliable and predictable would indeed be a great step forward for the business translation industry. This is also exactly the primary reason why some in the LSP sector would not want blockchain to succeed. However, there is also a role for a more enlightened LSP in a functioning translation blockchain, one committed to transparency, equitable sharing of business value and benefits, and ultimate customer success in all their globalization initiatives.    

Blockchains show potential to address key concerns of our digitally-driven lives, such as a lack of transparency, accountability, verifiable identity, and control of dataBlockchain has the potential to enable defined quality to be delivered at a defined price in a defined timeframe with minimal administration overhead. The extent that it is able to do this in a clean and trusted way will likely drive its adoption. Blockchains are poised to catalyze new business models by cutting the costs of verifying the truth.  

Explanations of blockchains tend to get complicated quickly. However, in very basic language, blockchains help us certify that something is true, without someone in the middle doing checks and balances.  But recent translation market activity perhaps points to some of the vital building blocks to making blockchain more real in the translation industry.     

What might some of the elements to make a functional blockchain in the translation industry possible be? 

  • A robust TMS platform that could enable ANY buyer (not just localization buyers) to engage ANY translator to perform a necessary translation task.
  • Assistive translation technology that can be easily connected into a blockchain workflow (MT, TM, NLP Tools)
  • The ability to create self-sovereign data that would allow more equitable sharing of data. In the vision of many pioneers in the blockchain space, the “ownership” of data would switch from the organization that gathers it to the individual or organization that contributed it.
  • An Independent Translator rating, certification, ranking, identification database to enable competent resources to be identified and selected. 
  • Smart Contracts
  • And....

For those who think blockchain is still a distant dream, there is evidence that it is making meaningful headway in improving efficiency, accuracy, and transparency in some areas that have historically been project-management nightmares. The members of Tradelens, a blockchain joint venture between IBM and the shipper Maersk, control more than 60% of the world’s containership capacity. Seriously, do we really believe that a translation task has more variance and unplanned changes than these goods and trade flows? Watch the short movie clip at the link above to see it at work. Having trusted food quality from all over the world is perhaps an even more challenging scenario. Food Trust is a blockchain ecosystem that covers more than 100 organizations, including Carrefour and the top four grocery retailers in the U.S.     

It is quite likely that the old guard (executives, managers, and localization teams) will not be at the forefront of translation blockchain if it ever does become a reality. Mostly because they are "just too old, too tired, and too blind " as the movie says. Change is most often driven by the young who see the new potential and have the motivation in solving old enduring problems. 

"Disruption could also be spurred by an even younger generation. New York Times writer David Brooks traveled to college campuses to understand how students see the world. In a story, he wrote after the experience, starkly titled “A Generation Emerging from the Wreckage,” Brooks describes a cohort with diminished expectations. Their lived experience includes the Iraq war, the financial crisis, police brutality, political fragmentation, and the advent of fake news as a social force. In short, an entire series of important moments in which “big institutions failed to provide basic security, competence, and accountability.” To this cohort, in particular, blockchains’ promise of decentralization, with its built-in ability to ensure trust, is tantalizing. To circumvent and disintermediate institutions that have failed them is a ray of hope—as is establishing trust, accountability, and veracity through technology, or even the potential to forge new connections across fragmented societies.

This latent demand is well aligned with the promise of blockchains. While it’s a long road to maturity, these social forces provide a receptive environment, primed and ready for the moment entrepreneurs strike the right formula. "             

Alison McCauley, author of Unblocked 


News recently has shed light on unfair work arrangements for freelancers or “gig workers” with Uber being a prominent example, and now more industries have begun to examine their relationships with freelancers. Though on a modest scale, this self-examination has come to the translation industry as well. The proposed translation model is grounded on the idea that blockchain can bring equity to translators while streamlining the translation workflow.

The Realization For Change

Even before the pandemic, the translation industry was changing and self-reflecting. NMT took center stage and the less-than-equitable working relationships for translators gained notice [1]. In “The State of the Linguist Supply Chain,” Common Sense Advisory examined the translation supply chain from the perspective of over 7,000 linguists, 75% of whom are freelancers representing 178 countries and 155 language pairs [2]. This data-rich survey identified many discrepancies between translators/linguists and their clients - the Buyers of translations and LSPs.

Several illustrative takeaways are:

  • Over half (54%) of the respondents could not live solely on their translation income.
  • Linguists are attracted to the translation profession because of flexible hours (91%) and the diversity of projects (75%). Only 33% rated their pay as being a plus.
  • The frustrations of linguists include fluctuating income (65%), irregularity of work (57%), and lack of respect (25%).
  • There is a preference for working for clients (65%) because translators (80%) earn more, have more job flexibility (56%), and quicker payments. Clients (76%) pay in less than 30 days compared with only 32% with LSPs.
  • The largest benefit of working for LSPs is more work (71%).
  • Translators perceived that cost (40%) and speed (33%) are more important than

Some of the largest challenges facing translators are finding clients (55%), negotiating prices (50%), dealing with tight deadlines (35%). Just before the pandemic, linguists felt the market changing to lower prices (64%) and faster turnaround times (56%).

Though not included here, linguists’ comments found in the report provide a more human context of their jobs than the raw percentages.

The Standard Model and Its Beneficiaries

The current translation workflow is one where translation Buyers hire LSPs to provide translators and manage the translation process including day-to-day project management, translation reviews, source-target file transfers, and handling of invoicing/payments between Buyers and translators. Even with translation management systems (TMSs), the tasks of LSPs are still mostly manual with continual updates to project status and translation reviews. In short, LSPs orchestrate the translation process and relieve Buyers from needing dedicated localization departments.

Who Benefits From the Standard Model

The primary beneficiaries of the Standard Model are Buyers which can have their content translated without investing heavily in a localization team and the LSPs. While LSPs do provide a much-needed project management function, they are in control of the purse strings and, like other businesses, they exist to maximize their share of the purse.

Weaknesses of the Standard Model

Inadequacies range from workflow issues to fairness.

Project management overhead is a glaring inefficiency. Despite the use of TMSs, there is simply too much human administration and intervention throughout a project.

Translation delays can result from time differences between LSPs and their geographically-dispersed pool of translators especially when issues can not be resolved promptly.

Security is a major problem for the Standard Model. LSPs do not know who is actually doing the translating. Online MT engines have been used to translate texts risking exposure of propriety material. [3]

The inequity of the Standard Model, where the Buyer wants to minimize translation costs and the LSPs want to maximize their profits, leaves [especially freelance] translators at the bottom of the food chain.

A Blockchain-based Translation Workflow

Breaking with the Standard Model, the proposed translation workflow reduces human administration and improves translation workflows with blockchain as the backbone [4]

Blockchain, smart contracts and oracles are the key pieces of the proposed workflow. A blockchain is a decentralized ledger of immutable, encrypted records (blocks) securely denoting asset transfers such as source/target files. Since each block on a blockchain contains the identifiers of a provider and recipient of an asset, the provenance of an asset is traceable.

A smart contract is a computer protocol that is intended to enforce the execution of a contract without a third party. Smart contracts could facilitate direct source-target file transfers between Buyers and translators and quicker payment for translators when projects are completed. These transactions are trackable and irreversible.

For a wide set of applications, a blockchain, actually, a smart contract, might require real-world information. Fulfilling that need, a blockchain oracle is an entity providing network-external data through an external transaction. Linguists and MT engines are examples of oracles receiving data (source files) and sending data (target files) to a translation blockchain workflow.

Figure 1: Blockchain Translation Schematic

A skip through the workflow

While the SkipThrough of the blockchain translation schematic (Figure 1.) glosses over many details of the translation process, it points to where the human workflow administration performed by LSPs is completed by smart contracts with a blockchain recording the handoffs of files and payments.

  1. As with the Standard model, a Buyer assembles source documents and project requirements including target languages, linguistic assets (terminologies and TMs), budgets, and schedules.
  1. The source content, linguistic assets, and requirements are recorded on a blockchain.
  1. Smart contracts execute throughout the workflow directing texts to TMs, then to MT engines, or directly to MT engines or translators. Each file transfer is recorded on the blockchain.
  1. Based on MT review acceptability, smart contracts initiate transfers of translations to Translators/MTPEs for review. The blockchain records the translation transfers.
  1. Once a reviewer approves the translations, a smart contract executes, thereby sending translated material to the Buyer. The blockchain is updated.
  1. The buyer’s acceptance of the completed translations invokes a smart contract that sends payment to the Translators/MTPEs. The blockchain records the details of payments.

Who Benefits from the Blockchain Model?

The Buyers and Linguists are the primary beneficiaries. That is not to say that there is no role for LSPs in the translation industry. Until there is widespread adoption of a blockchain or some other nearly fully-automated model of translation, LSPs will co-exist with automation. Also, Buyers may turn to LSPs for projects when they lack the availability of PMs.


Buyers want quality translations with tight deadlines and a limited budget.

The streamlining of the translation workflow with blockchain replacing much of the managerial overhead will reduce costs. The savings could be used for other localization projects and ideally for fairly compensating the translators.

Because every file transfer or handoff is recorded on the blockchain, the Buyer is fully aware of the state of the project at any time. They now have access to real-time project management.

The Buyer’s material is secure. Inherent features of blockchains in recording asset transfers are transparency, traceability, and security. With the appropriate service agreements with any of the oracles involved, security weak points can be tightened.

Increased automation leads to faster translations. Since files are transferred and tracked upon execution of smart contracts, time differences are eliminated and translations can be produced 365/24/7.


As noted, linguists prefer to work directly for clients with better pay, less time pressure, and recognition.

In the proposed workflow, the work of translators can be tracked. Those producing quality translations would gain recognition and could be compensated in a fair, transparent, and consistent way. There is another side effect of traceability as well. Consistent errors either by humans or an MT engine could be identified and corrections made to improve quality on future projects.

Without the intermediate LSPs, linguists will be working and communicating directly with the Buyer. Yes, the workflow will be automated, but there would be no obstacles to human communication between the Buyer and the translators.

Elimination of time-difference delays and the human management level could allow for more time for actual translation and should lessen the time pressure felt by translators today.

Obstacles to the Blockchain Workflow

While there is much to be gained from the blockchain workflow, there are three broad hurdles for its success. One is technical and the others are due to industry resistance.

  • The technical obstacles are huge. Despite all the hype and predictions, blockchain technology is in its infancy and its future remains uncertain. Nevertheless, blockchain technology is viable and scales with Bitcoin as the most visible example. So the utility of blockchain cannot be dismissed a priori.
  • Industries do not usually embrace change, especially when it changes their business models. The Standard Model represents business as usual and it has taken time and effort to put the infrastructure into place. LSPs, who play a valuable role in today’s translation process, would be most resistant to change. However, the overhead of LSPs drives up translation costs, perhaps at the expense of the translators at the far end of the supply chain.
  • Another major barrier is the human one. With the diminished roles of LSPs, localization managers would need to adapt to a new work environment, undoubtedly a stressful situation. Translators, especially those who have well-established relationships with LSPs, would lose a conduit for work and would also have to adapt. However, there could be monetary rewards as their work and expertise are recognized via blockchain.

Change is not easy!

A Few Final Remarks

Blockchain can provide the backbone for a supply chain that brings equity, improved work arrangements, and recognition to translators. At the same time, blockchain with smart contracts streamlines the workflow by automating much of the current managerial tasks. Granted, the blockchain model is a heavy lift and faces opposition from stakeholders that control much of the workflow today. In any case, whether technical, legal, or social pressures bring about change to the supply chain, solutions for a more equitable translation environment are being discussed and concrete solutions are being proposed. Change seems inevitable.

[1] See TAUS Webinar “Blockchain: When the Token Economy Meets the Translation Industry” -; For fair pay, see: TAUS blog “Fair Pay for the translators and data-keepers!” -

[2] Pielmeier, Hélène, and Paul O’Mara, “The State of the Linguist Supply Chain,” CSA Research, January 2020.

[3] See: and

[4] For other blockchain architectures, see: Exfluency - and Kuhns, Bob, “The Pros and Cons of Blockchains and L10N Workflows,” TAUS White Paper, March 2019 -

Bob Kuhns is an independent consultant specializing in language technologies. His clients have included the Knowledge Technology Group in the Sun Microsystems Labs and Sun’s Globalization group. In the Labs, Bob was part of a team developing a conceptual indexing system and for the Globalization group, he was the project manager and lead translation technology designer for a controlled language checker, a terminology management system, and a hybrid MT system. He was also responsible for developing translation metrics and leading a competitive MT evaluation. Bob has also conducted research and published reports with Common Sense Advisory, TAUS, and MediaLocate on a variety of topics including managed authoring, advanced leveraging, MT, blockchain, and L10n workflows, and global social media.

Bob’s email is: