Pages

Friday, April 26, 2019

Understanding MT Quality - What Really Matters?

This is the second post in our posts series on machine translation quality. Again this is a slightly less polished and raw variant of a version published on the SDL site. The first one focused on BLEU scores, which are often improperly used to make decisions on inferred MT quality, where it clearly is not the best metric to draw this inference.

The reality of many of these comparisons today is that scores based on publicly available (i.e. not blind) news domain tests are being used by many companies and LSPs to select MT systems which translate IT, customer support, pharma, financial services domain related content. Clearly, this can only result in sub-optimal choices.

The use of machine translation (MT) in the translation industry has historically been heavily focused on localization use cases, with the primary intention to improve efficiency, that is, speed up turnaround and reduce unit word cost. Indeed, machine translation post-editing (MTPE) has been instrumental in helping localization workflows achieve higher levels of productivity.




Many users in the localization industry select their MT technology based on two primary criteria:
  1. Lowest cost
  2. “Best quality” assessments based on metrics like BLEU, Lepor or TER, usually done by a third party
The most common way to assess the quality of an MT system output is to use a string-matching algorithm score like BLEU. As we pointed out previously, equating a string-match score with the potential future translation quality of an MT system in a new domain is unwise, and quite likely to result in disappointing results. BLEU and other string-matching scores offer the most value to research teams building and testing MT systems. When we further consider that scores based on old news domain content are being used to select systems for customer support content in IT and software subject domains it seems doubly foolish.

One problem with using news domain content is that it tends to lack tone and emotion. News stories discuss terrorism and new commercial ventures in almost exactly the same tone.  As Pete Smith points out in the webinar link below, in business communication, and customer service and support scenarios the tone really matters. Enterprises that can identify dissatisfied customers and address the issues that cause dissatisfaction are likely to be more successful. CX is all about tone and emotion in addition to the basic literal translation. 

Many users consider only the results of comparative evaluations – often performed by means of questionable protocols and processes using test data that is invisible or not properly defined – to select which MT systems to adopt.  Most frequently, such analyses produce a score table like the one shown below, which might lead users to believe they are using the “best-of-breed” MT solution since they selected the “top” vendor (highlighted in green). 

English to French
English to Chinese
English to Dutch
Vendor A – 46.5
Vendor C – 36.9
Vendor B – 39.5
Vendor B – 45.2
Vendor A – 34.5
Vendor C – 37.7
Vendor C – 43.5
Vendor B – 32.7
Vendor A – 35.5

While this approach looks logical at one level, it often introduces errors and undermines efficiency because of the administrative inconsistency between different MT systems. Also, the suitability of the MT output for post editing may be a key requirement for localization use cases, but this may be much less important in other enterprise use cases.




Assessing business value and impact


The first post in this blog series exposes many of the fallacies of automated metrics that use string-matching algorithms (like BLEU and Lepor), which are not reliable quality assessment techniques as they only reflect the calculated precision and recall characteristics of text matches in a single test set, on material that is usually unrelated to the enterprise domain of interest. 

The issues discussed challenge the notion that single-point scores can really tell you enough about long-term MT quality implications. This is especially true as we move away from the localization use case. Speed, overall agility and responsiveness and integration into customer experience related data flow matters much more in the following use cases. The actual translation quality variance measured by BLEU and Lepor may have little to no impact on what really matters in the following use cases.



The enterprise value-equation is much more complex and goes far beyond linguistic quality and Natural Language Processing (NLP) scores. To truly reflect the business value and impact, evaluation of MT technology must factor in non-linguistic attributes including:
  • Adaptability to business use cases
  • Manageability
  • Integration into enterprise infrastructure
  • Deployment flexibility   
To effectively link MT output to business value implications, we need to understand that although linguistic precision is an important factor, it often has a lower priority in high-value business use cases. This view will hopefully take hold as the purpose and use of MT is better understood in the context of a larger business impact scenario, beyond localization.

But what would more dynamic and informed approaches look like? MT evaluation certainly cannot be static since systems must evolve as requirements change. Instead of a single-point score, we need a more complex framework that provides an easy, single measure that tells us everything we need to know about an MT system. Today, this is unfortunately not yet feasible.




A more meaningful evaluation framework


While single-point scores do provide a rough and dirty sense of an MT system’s performance, it is more useful to focus testing efforts on specific enterprise use case requirements. This is also true for automated metrics, which means that scores based on news domain tests should be viewed with care since they are not likely to be representative of performance on specialized enterprise content. 

When rating different MT systems, it is essential to score key requirements for enterprise use, including:

  • Adaptability: Range of options and controls available to tune the MT system performance for very specific use cases. For example, optimization techniques applied to eCommerce catalog content should be very different from those applied to technical support chatbot content or multilingual corporate email systems.
  • Data privacy and security: If an MT system will be used to translate confidential emails, business strategy and tactics documents, human evaluation requirements will differ greatly from a system that only focuses on product documentation. Some systems will harvest data for machine learning purposes, and it is important to understand this upfront.
  • Deployment flexibility: Some MT systems need to be deployed on-premises to meet legal requirements, such as is the case in litigation scenarios or when handling high-security data. 
  • Expert services: Having highly qualified experts to assist in the MT system tuning and customization can be critical for certain customers to develop ideal systems. 
  • IT integration: Increasingly, MT systems are embedded in larger business workflows to enable greater multilingual capabilities, for example, in communication and collaboration software infrastructures like email, chat and CMS systems.
  • Overall flexibility: Together, all these elements provide flexibility to tune the MT technology to specific use cases and develop successful solutions.

Ultimately, the most meaningful measures of MT success are directly linked to business outcomes and use cases. The definition of success varies by the use case, but most often, linguistic accuracy as an expression of translation quality is secondary to other measures of success. 


The integrity of the overall solution likely has much more impact than the MT output quality in the traditional sense: not surprisingly, MT output quality could vary by as much as 10-20% on either side of the current BLEU score without impacting the true business outcome. Linguistic quality matters but is not the ultimate driver of successful business outcomes. In fact, there are reports of improvements in output quality in an eCommerce use case that actually reduced the conversion rates on the post-edited sections, as this post-edited content was viewed as being potentially advertising-driven and thus less authentic and trustworthy.








True expressions of successful business outcomes for different use cases


Global enterprise communication and collaboration
  • Increased volume in cross-language internal communication and knowledge sharing with safeguarded security and privacy
  • Better monitoring and understanding of global customers 
  • Rapid resolution of global customer problems, measured by volume and degree of engagement
  • More active customer and partner communications and information sharing
Customer service and support
  • Higher volume of successful self-service across the globe
  • Easy and quick access to multilingual support content 
  • Increased customer satisfaction across the globe
  • The ability of monolingual live agents to service global customers regardless of the originating customer’s language 
eCommerce
  • Measurably increased traffic drawn by new language content
  • Successful conversions in all markets
  • Transactions are driven by newly translated content
  • The stickiness of new visitors in new language geographies
Social media analysis
  • Ability to identify key brand impressions 
  • Easy identification of key themes and issues
  • A clear understanding of key positive and negative reactions
Localization
  • Faster turnaround for all MT-based projects
  • Lower production cost as a reflection of lower cost per word
  • Better MTPE experience based on post-editor ratings
  • Adaptability and continuous improvement of the MT system

A more detailed presentation and webinar that goes into much more detail on this subject is available from Brightalk. 


In upcoming posts in this series, we will continue to explore the issue of MT quality assessment from a broad enterprise needs perspective. More informed practices will result in better outcomes and significantly improved MT deployments that leverage the core business mission to solve high-volume multilingual challenges more effectively.

Wednesday, April 17, 2019

Understanding MT Quality: BLEU Scores

This is the first in a series of posts discussing various aspects of MT quality from the context of enterprise use and value, where linguistic quality is important, but not the only determinant of suitability in a structured MT technology evaluation process. A cleaner, more polished, and shorter studio version of this post is available here. You can consider this post a first draft, or the live stage performance (stream of consciousness) version.

What is BLEU (Bilingual Evaluation Understudy)?

As the use of enterprise machine translation expands, it becomes increasingly more important for users and practitioners to understand MT quality issues in a relevant, meaningful, and accurate way.
The BLEU score is a string-matching algorithm that provides basic output quality metrics for MT researchers and developers. In this first post, we will review and look more closely at the BLEU score, which is probably the most widely used MT quality assessment metric in use by MT researchers and developers over the last 15 years. While it is widely understood that the BLEU metric has many flaws, it continues to be a primary metric used to measure MT system output even today, in the heady days of Neural MT.
Firstly, we should understand that a fundamental problem with BLEU is that it DOES NOT EVEN TRY to measure “translation quality”, but rather focuses on STRING SIMILARITY (usually to a single human reference). What has happened over the years is that people choose to interpret this as a measure of the overall quality of an MT system. BLEU scores only reflect how a system performs on the specific set of test sentences used in the test. As there can be many correct translations, and most BLEU tests rely on test sets with only one correct translation reference, it means that it is often possible to score perfectly good translations poorly.
The scores do not reflect the potential performance of the system on other material that differs from the specific test material, and all inferences on what the score means should be made with great care, after taking a close look at the existing set of test sentences. It is very easy to use and interpret BLEU incorrectly and the localization industry abounds with examples of incorrect, erroneous, and even deceptive use.

Very simply stated, BLEU is a “quality metric” score for an MT system that is attempting to measure the correspondence between a machine translation output and that of a human with the understanding that "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Scores are calculated for individual MT translated segments—generally sentences—by comparing them with a set of good quality human reference translations. Most would consider BLEU scores more accurate at a corpus level rather than at a sentence level.
BLEU gained popularity because it was one of the first MT quality metrics to report a high correlation with human judgments of quality, a notion that has been challenged often, but after 15 years of attempts to displace it from prominence, the allegedly “improved” derivatives (METEOR, LEPOR) have yet to really unseat its dominance. BLEU together with human assessment remains the preferred metrics of choice today.

A Closer, More Critical Examination of BLEU


BLEU is actually nothing more than a method to measure the similarity between two text strings. To infer that this metric, which has no linguistic consideration or intelligence whatsoever, can predict not only past “translation quality” performance, but also future performance is indeed quite a stretch.
Measuring translation quality is much more difficult because there is no absolute way to measure how “correct” a translation is. MT is a particularly difficult AI challenge because computers prefer binary outcomes, and translation has rarely if ever only one single correct outcome. Many “correct” answers are possible, and there can be as many “correct” answers as there are translators. The most common way to measure quality is to compare the output strings of automated translation to a human translation text string of the same sentenceThe fact that one human translator will translate a sentence in a significantly different way than another human translator, leads to problems when using these human references to measure “the quality” of an automated translation solution.
The BLEU metric scores a translation on a scale of 0 to 1. The metric attempts to measure adequacy and fluency in a similar way to a human would, e.g. does the output convey the same meaning as the input sentence, and is the output good and fluent target language? The closer to 1, the more overlap there is with a human reference translation and thus the better the system is. In a nutshell, the BLEU metric measures how many words overlap, giving higher scores to sequential words. For example, a string of four words in the translation that match the human reference translation (in the same order) will have a positive impact on the BLEU score and is weighted more heavily (and scored higher) than a one or two-word match. It is very unlikely that you would ever score 1 as that would mean that the compared output is exactly the same as the reference output. However, it is also possible that an accurate translation would receive a low score because it uses different words than the reference used. This problem potential can be seen in the following example. If we select one of these translations for our reference set, all the other correct translations will score lower!

How does BLEU work?

To conduct a BLEU measurement the following data is necessary:
  1. One or more human reference translations. (This should be data which has NOT been used in building the system (training data) and ideally should be unknown to the MT system developer. It is generally recommended that 1,000 or more sentences be used to get a meaningful measurement.) If you use too small a sample set you can sway the score significantly with just a few sentences that match or do not match well.
  2. Automated translation output of the exact same source data set.
  3. A measurement utility that performs the comparison and score calculation for you.

  • Studies have shown that there is a reasonably high correlation between BLEU and human judgments of quality when properly used.
  • BLEU scores are often stated on a scale of 1 to 100 to simplify communication but should not be confused with the percentage of accuracy.
  • Even two competent human translations of the exact same material may only score in the 0.6 or 0.7 range as they likely use different vocabulary and phrasing.
  • We should be wary of very high BLEU scores (in excess of 0.7) as it is likely we are measuring improperly or overfitting.

A sentence translated by MT may have 75% of the words overlap with one translator’s translation, and only 55% with another translator’s translation; even though both human reference translations are technically correct, the one with the 75% overlap with machine translation will provide a higher “quality” score for the automated translation. This is somewhat arbitrary. Random string matching scores should not be equated to overall translation quality. Therefore, although humans are the true test of correctness, they do not provide an objective and consistent measurement for any meaningful notion of quality.
As would be expected using multiple human reference tests will always result in higher scores as the MT output has more human variations to match against. The NIST (National Institute of Standards & Technology) used BLEU as an approximate indicator of quality in its annual MT competitions with four human reference sets to ensure that some variance in human translation is captured, and thus allow more accurate assessments of the MT solutions being evaluated. The NIST evaluation also defined the development, test, and evaluation process much more carefully and competently, and thus comparing MT systems under their rigor and purview was meaningful. This has not been true for many of the comparisons done since, and many recent comparisons are deeply flawed.

 

Why are automated MT quality assessment metrics needed?

Automated quality measurement metrics have always been important to the developers and researchers of data-driven based MT technology, because of the iterative nature of MT system development, and the need for frequent assessments during the development of the system. They can provide rapid feedback on the effectiveness of continuously evolving research and development strategies.
Recently, we see that BLEU and some of its close derivatives (METEOR, NIST, LEPOR, and F-Measure) are also often used to compare the quality of different MT systems in enterprise use settings. This can be problematic as a “single point quality score” based on publically sourced news domain sentences is simply not representative of the dynamically changing, customized, and modified potential of an active and evolving enterprise MT system. Also, such a score does not incorporate the importance of overall business requirements in an enterprise use scenario where other workflow, integration, and process-related factors may actually be much more important than small differences in scores. Useful MT quality in the enterprise use context will vary greatly, depending on the needs of the specific use-case.
Most of us would agree that competent human evaluation is the best way to understand the output quality implications of different MT systems. However, human evaluation is generally slower, less objective, and likely to be more expensive and thus not viable in many production use scenarios when many comparisons need to be made on a constant and ongoing basis. Thus, automated metrics like BLEU provide a quick and often dirty quality assessment that can be useful to those who actually understand its basic mechanics. However, they should also understand its basic flaws and limitations and thus avoid coming to over-reaching or erroneous conclusions based on these scores.

There are two very different ways that such scores may be used,

  • R&D Mode: In comparing different versions of an evolving system during the development of the production MT system, and,
  • Buyer Mode: In comparing different MT systems from different vendors and deciding which one is the “best” one.

The MT System Research & Development NeedData-driven MT systems could probably not be built without using some kind of automated measurement metric to measure ongoing progress. MT system builders are constantly trying new data management techniques, algorithms, and data combinations to improve systems, and thus need quick and frequent feedback on whether a particular strategy is working or not. It is necessary to use some form of standardized, objective and relatively rapid means of assessing quality as part of the system development process in this technology. If this evaluation is done properly, the tests can also be useful over a longer period to understand how a system evolves over many years.

The MT Buyer Need: As there are many MT technology options available today, BLEU and its derivatives are sometimes used to select what MT vendor and system to use. The use of BLEU in this context is much more problematic and prone to drawing erroneous conclusions as often comparisons are being made between apples and oranges. The most common error in interpreting BLEU is the lack of awareness and understanding that there is a positive bias towards one MT system because it has already seen and trained on the test data, or has been used to develop the test data set.

Problems with BLEU


While BLEU is very useful to those who build and refine MT systems, it’s value as an effective way to compare totally different MT systems is much more limited and needs to be done very carefully, if done at all, as it is easily and often manipulated to create the illusion of superiority.
“CSA Research and leading MT experts have pointed out for over a decade that these metrics [BLEU] are artificial and irrelevant for production environments. One of the biggest reasons is that the scores are relative to particular references. Changes that improve performance against one human translation might degrade it with respect to another… Approaches that emphasize usability and user acceptance take more effort than automatic scores but point the way toward a useful and practical discussion of MT quality. “
There are several criticisms of BLEU that should also be understood if you are to use the metric effectively. BLEU only measures direct word-by-word similarity and looks to match and measure the extent to which word clusters in two sentences or documents are identical. Accurate translations that use different words may score poorly since there is no match in the human reference. 
There is no understanding of paraphrases and synonyms so scores can be somewhat misleading in terms of overall accuracy. You have to get the exact same words used in the human reference translation to get credit e.g.
"Wander" doesn't get partial credit for "stroll," nor "sofa" for "couch."
Also, nonsensical language that contains the right phrases in the wrong order can score high. e.g.
"Appeared calm when he was taken to the American plane, which will to Miami, Florida" would get the very same score as: "was being led to the calm as he was would take carry him seemed quite when taken".
A more recent criticism identifies the following problems:
  • It is an intrinsically meaningless score
  • It admits too many variations – meaningless and syntactically incorrect variations can score the same as good variations
  • It admits too few variations – it treats synonyms as incorrect
  • More reference translations do not necessarily help
These and other problems are described in this article and this critical academic review. The core problem is that word-counting scores like BLEU and its derivatives - the linchpin of the many machine-translation competitive comparisons - don't even recognize well-formed language, much less real translated meaning. Here is a more recent post that I highly recommend, as it very clearly explains other metrics, and shows why it also still makes sense to use BLEU in spite of its many problems.
For post-editing work assessments there is a growing preference for Edit Distance scores to more accurately reflect the effort involved, even though it too is far from perfect.
The problems are further exacerbated with the Neural MT technology which can often generate excellent translations that are quite different from the reference and thus score poorly. Thus, many have found that lower (BLEU) scoring NMT systems are clearly preferred over higher scoring SMT systems when human evaluations are done. There are some new metrics (ChrF, SacreBLEU, Rouge) attempting to replace BLEU, but none have gathered any significant momentum yet and the best way to evaluate NMT system output today is still well structured human assessments.

What is BLEU useful for?

Modern MT systems are built by “training” a computer with examples of human translations. As more human translation data is added, systems should generally get better in quality. Often, new data can be added with beneficial results, but sometimes new data can cause a negative effect especially if it is noisy or otherwise “dirty”. Thus, to measure if progress is being made in the development process, the system developers need to be able to measure the quality impact rapidly and frequently to make sure they are improving the system and are in fact making progress.
BLEU allows developers a way “to monitor the effect of daily changes to their systems in order to weed out bad ideas from good ideas.” When used to evaluate the relative merit of different system building strategies, BLEU can be quite effective as it provides very quick feedback and this enables MT developers to quickly refine and improve translation systems they are building and continue to improve quality on a long term basis.

What is BLEU not useful for?


BLEU scores are always very directly related to a specific “test set” and a specific language pair. Thus, BLEU should not be used as an absolute measure of translation quality because the BLEU score can vary even for one language depending on the test and subject domain. In most cases comparing BLEU scores across different languages is meaningless unless very strict protocols have been followed. Because of this, it is always recommended to use human translators to verify the accuracy of the metrics after systems have been built. Also, most MT industry leaders will always vet the BLEU score readings with human assessments before production use.
In competitive comparisons, it is important to carry out the comparison tests in an unbiased, scientific manner to get a true view of where you stand against competitive alternatives. The “test set” should be unknown (“blind”) to all the systems that are involved in the measurement. This is something that is often violated in many widely used comparisons today. If a system is trained with the sentences in the “test set” it will obviously do well on the test but probably not as well on data that it has not seen before. Many recent comparisons score MT systems on News Domain related test sets that may also be used in training by some MT developers. A good score on news domain may not be especially useful for an enterprise use case that is heavily focused on IT, pharma, travel or any domain other than news.
However, in spite of all the limitations identified above, BLEU continues to be a basic metric used by most, if not all MT researchers today. Though, now, most expert developers regularly use human evaluation on smaller sets of data to ensure that they indeed have a true and meaningful BLEU. The MT community have found that supposedly improved metrics like METEOR, LEPOR, and other metrics have not really gained any momentum. BLEU and its flaws and issues are more clearly understood, and thus more reliable, especially if used together with supporting human assessments. Also, many buyers today realize that MT system performance on their specific subject domains and translatable content for different use cases matters much more than how generic systems might perform on news stories.


In upcoming posts in this series, we will continue to explore the issue of MT quality from a broad enterprise needs perspective. More informed practices will result in better outcomes and significantly improved MT deployments that leverage the core business mission to solve high-volume multilingual challenges more effectively.

Friday, February 1, 2019

Understanding the Realities of Language Data

This is a guest post by Luigi Muzii that focuses mostly on the various questions that surround Language Data, which by most “big data” definitions and volumes is really not what most in the community would consider big data. As the world hurtles into the brave new world that is being created by a growling volume of machine learning and AI applications, the question of getting the data right is often brushed aside. Most think the data is a solved problem or presume that data is easily available. However, those of us who have been working at MT seriously over the last decade understand this is far from a solved problem. Machines learn from data and smart engineers can find ways to leverage the patterns in data in innumerable ways. Properly used it can make knowledge work easier, or more efficient e.g. machine translation, recommendation, and personalization.

The value and quality of this pattern learning can only be as good as the data used, and however exciting all this technology seems, we need to understand that our comprehension of how the brain (much less the mind) works, is still really is only in its infancy. “Machine learning” is a geeky way of saying “finding complicated patterns in data”. The comparative learning capacity and the neuroplasticity of any 2-year-old child will pretty much put most of these "amazing and astounding" new AI technologies to shame. Computers can process huge amounts of data in seconds, and sometimes they can do this in VERY useful ways, and most of what we call AI today is rarely if ever much more than this. If the data is not good, the patterns will be suspect and often wrong. And given what we know about how easy it is to screw data up this will continue to be a challenge. In fact, DARPA and others are now discussing strategies about detecting and removing “poisoned data” on an increasingly regular basis. Much of the fear about rogue AI is based on this kind of adversarial machine learning which can lead trusted production systems astray to make biased and even dangerous errors.


Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities.

The data is your teacher. It's the data where the real value is.


Data is valuable when it is properly collected, understood, organized and categorized. Having rich metadata and taxonomy is especially valuable with linguistic data. The SMT experience has shown that much of the language industry TM was not very useful in building SMT engines without significant efforts in data cleaning. This is even more true with Neural MT. As Luigi points out, most of the data used by the MT engines that process 99% of the language translation being done on the planet today, have used data that the language industry has had very little to do with. In fact, it is also my experience that many large scale MT projects for enterprise use cases involve a data acquisition and data creation phase that produces the right kind of data to drive successful outcomes. While data from localization projects can be very useful at times, it is most often better to create and develop training data that is optimized for the business purpose. Thus a man-machine collaboration

Luigi has already written about metadata and you can find the older articles here and here.

This is also true for the content that drives digital experience in the modern digital marketplace. We are now beginning to understand that content is often the best salesperson in a digital marketplace and good content drives and can enhance a digital buyer and customer journey. And here too data quality and organization matters, in fact, it is a key to success in the dreams of digital transformation. Content needs to be intelligent.

 Ann Rockley said years ago:

Intelligent content is content that’s structurally rich and semantically categorized and therefore automatically discoverable, reusable, reconfigurable, and adaptable.

Here is a simple and succinct description of what intelligent content is. For a more detailed look at what this is and what is needed to make it happen through the translation and globalization process take a look at the SDL ebook on the Global Content Operating Model.



===============================



Let’s face it: Translation is prevalently considered as a trivial job. Investigating the rationale of this view is pointless, so let’s take it as a matter of fact and focus on the ensuing behaviors, starting with the constantly increasing unwillingness to pay a fee—however honest and adequate—for a professional performance. Also, the attitude of many industry players towards their customers does not help correct this cheapish view either.

Unfortunately, in conjunction with the prevailing of the Internet, the idea has been progressively established that goods—especially virtual ones—and indeed services, should be ever cheaper and better. The (in)famous zero marginal cost theory has heavily contributed with to the vision of an upcoming era of nearly free goods and services, “precipitating the meteoric rise of a global Collaborative Commons and the eclipse of capitalism.” But who will pay for this? Are governments supposed to subsidize all infrastructural costs to let marginal cost pricing prevail? Well, no. But keep reading anyway.

So where is the Language Data?


Is there anything in all this having to do with data? Data has entered the discussion because of big data. The huge amounts of data manipulated by the large tech corporations have led to the assumption that translation buyers, and possibly industry players too, could do the same with language data. This has also led to an expectation with respect to the power of data, an expectation that may be overly exaggerated or beyond any principle of reality.


Indeed, from the pulverization of the industry, a series of problems comes that have not yet been resolved. The main problem consists in the fact there is no one player really big and vertical having such a large and up-to-date amount of data—especially language data—to be even remotely considered big data or to be used in any comparable way.

Also, more than 99,99 percent of translations today is performed through machine translation and the vast majority of the training data of major online engines comes from sources other than the traditional industry ones. Accordingly, the data that industry players can offer and make available and even use for their business purposes are comparably very little and poor. In fact, the verticality of this data and the width of the relevant scope are totally insufficient to enable any player, including or maybe especially the largest ones, to impact the industry. A certain ‘data effect’ indeed exists only because online machine translation engines are trained with a huge amount of textual data available on the Internet regardless of the translation industry.

For these reasons, a market place of language data might be completely useless if not even pointless. It might be viable but the data available could hardly be the data needed.

For example, TAUS Matching Data is an elegant exercise, but its practicality and usefulness are yet to be proved. It is based on DatAptor, a research project pursued by the Institute for Logic, Language and Computation at the University of Amsterdam under the supervision of Professor by Khalil Sima’an. DatAptor “aims at providing automatic methods for inducing dedicated translation aids from large translation data” by selecting datasets from existing repositories. Beyond the usual integrity, cleanliness, reliability, relevance, and prevalence issues, the traditional and unsolved issue of information asymmetry persists: A deep linguistic competence and subject-field expertise, as well as a fair proficiency in data management, are needed to be sure that the dataset is relevant, reliable and up-to-date. And while the first two might possibly be found in a user querying the database, they are harder to find in the organization collecting and supposedly validating the original data.

Also, several translation data repository platforms are available today generally by harvesting data through web crawling. The data used by the highest-resourced online machine translation engines comes from millions of websites or from the digitalization of book libraries.

Interestingly, open-source or publicly-funded projects like bicleaner, TMop, Zipporah, ParaCrawl or Okapi CheckMate are growing to harvest, classify, systematize, and clean language data.

The initiative of a self-organized group of ‘seasoned globalization professionals’ from some major translation buyers may be seen as part of this trend. This group has produced a list of best practices for translation memory management. Indeed, this effort proves that models and protocols are necessary for standardization, not applications.

TMs are not dead and are not going to die as long as CAT tools and TMSs remain the primary means in the hands of translation professionals and businesses to produce language data.

At this point, two questions arise: What about the chance of having different datasets from the same original repository available on the same marketplace? And what about synthetic data? So, the challenge of selecting and using the right data sources remains unsolved.

Finally, also the coopetition paradox applies to a hypothetical language data marketplace. Although many translation industry players may interact and even cooperate on a regular basis, most of them are unwilling to develop anything that would benefit the entire industry and keep struggling to achieve a competitive advantage.

Is Blockchain really the solution?



For all these reasons, blockchain is not the solution for a weak-willed, overambitious data marketplace.

As McKinsey’s partners Matt Higginson, Marie-Claude Nadeau, and Kausik Rajgopal wrote in a recent article, “Blockchain has yet to become the game-changer some expected. A key to finding the value is to apply the technology only when it is the simplest solution available.” In fact, despite the amount of money and time spent, little of substance has been achieved.

Leaving aside the far-from-trivial problem of the immaturity, instability, expensiveness, complexity—if not obscurity—of the technology and the ensuing uncertainty, maybe blockchain can be successfully used in the future to secure agreements and their execution (possibly through smart contracts), though hardly to anything else in the translation business. Competing technologies are also emerging as less clunky alternatives. Therefore, it does not seem advisable to put your money in a belated and misfocused project based on a bulky, underachieving technology as a platform for exchanging data that will still be exposed to ownership, integrity, and reliability issues.


The importance of Metadata


Metadata is totally different: It can be extremely interesting even for a translation data marketplace.
The fact that big data is essentially metadata has possibly not been discussed enough. The information of interest for data-manipulating companies does not come from the actual content posted, but from the associated data vaguely describing user behaviors, preferences, reactions, trends, etc. Only in a few cases text strings, voice data, and images are mined, analyzed, and re-processed. Even in this case, the outcome of this analysis is stored as descriptive data, i.e. metadata.  The same applies to IoT data. Also, data is as good as the use one is capable of making of it. In Barcelona, for example, within the scope of the Decode project, mayor Ada Colau is trying to use data on the movements of citizens generated by apps like Citymapper to inform and design a better system of public transport.

In translation, metadata might prove useful for quality assessment, process analysis, and re-engineering and market research, but it is much less considered than language data and even more neglected than elsewhere.

Language data is obsessively reclaimed but ill-curated. As usual, the reason is money: Language data is supposed to be immediately profitable, by leveraging it through discount policies or by training machine translation engines. In both cases, they are seen as a means at hand to sustain the pressure on prices and reduce compensations to linguists. Unfortunately, the quality of language data is generally very poor, because curating it is costly.

Italians use the expression “fare le nozze coi fichi secchi” (make a wedding with dry figs) for an attempt to accomplish something without spending what is necessary, while Spanish say “bueno y barato no caben en un zapato” (good and cheap don’t fit in a shoe). Both expressions recall the popular adage “There ain’t no such thing as a free lunch.”

This idea is common to virtually every culture, and yet translation industry players still have to learn it, and possibly not forget it.

We often read and hearsay that there’s a shortage of good talent in the market. On the other hand, many insist that there is plenty and that the only problem of this industry is its ‘bulk market’—whatever this means and regardless of how reliable those who claim this are or boast to be and are wrongly presumed to be. Of course, if you target Translators CafĂ©, ProZ, Facebook or even LinkedIn to find matching teams you most possibly have a problem in knowing what talent is and which talents are needed today.

Let’s face it: The main reason for high-profile professionals (including linguists) being unwilling to work in the translation industry is remuneration. And this is also the main reason for the translation industry and the translation profession to be respectively considered as a lesser industry and a trivial job. In an endless downward spiral.

Bad resources have been driving out the good ones for a long time now. And if this applies to linguists that should be ribs, nerves, and muscles of the industry, let alone what may happen with sci-tech specialists.

In 1933, in an interview for the June 18 issue of The Los Angeles Times, Henry Ford offered this advice to fellow business people, “Make the best quality of goods possible at the lowest cost possible, paying the highest wages possible.” Similarly, to summarize the difference between Apple’s long-prided quest for premium prices and Amazon’s low-price-low-margin strategy, on the assumption it would make money elsewhere, Jeff Bezos declared in 2012, “Your margin is my opportunity.”

Can you tell the difference between Ford, Amazon, and any ‘big’ translation industry player? Yes, you can.


Luigi Muzii's profile photo


Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm . He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts.