Pages

Friday, April 26, 2019

Understanding MT Quality - What Really Matters?

This is the second post in our posts series on machine translation quality. 

The reality of many of these comparisons today is that scores based on publicly available (i.e. not blind) news domain tests are being used by many companies and LSPs to select MT systems which translate IT, customer support, pharma, financial services domain related content. Clearly, this can only result in sub-optimal choices.

The use of machine translation (MT) in the translation industry has historically been heavily focused on localization use cases, with the primary intention to improve efficiency, that is, speed up turnaround and reduce unit word cost. Indeed, machine translation post-editing (MTPE) has been instrumental in helping localization workflows achieve higher levels of productivity.




Many users in the localization industry select their MT technology based on two primary criteria:
  1. Lowest cost
  2. “Best quality” assessments based on metrics like BLEU, Lepor or TER, usually done by a third party
The most common way to assess the quality of an MT system output is to use a string-matching algorithm score like BLEU. As we pointed out previously, equating a string-match score with the potential future translation quality of an MT system in a new domain is unwise, and quite likely to result in disappointing results. BLEU and other string-matching scores offer the most value to research teams building and testing MT systems. When we further consider that scores based on old news domain content are being used to select systems for customer support content in IT and software subject domains it seems doubly foolish.

One problem with using news domain content is that it tends to lack tone and emotion. News stories discuss terrorism and new commercial ventures in almost exactly the same tone.  As Pete Smith points out in the webinar link below, in business communication, and customer service and support scenarios the tone really matters. Enterprises that can identify dissatisfied customers and address the issues that cause dissatisfaction are likely to be more successful. CX is all about tone and emotion in addition to the basic literal translation. 

Many users consider only the results of comparative evaluations – often performed by means of questionable protocols and processes using test data that is invisible or not properly defined – to select which MT systems to adopt.  Most frequently, such analyses produce a score table like the one shown below, which might lead users to believe they are using the “best-of-breed” MT solution since they selected the “top” vendor (highlighted in green). 

English to French
English to Chinese
English to Dutch
Vendor A – 46.5
Vendor C – 36.9
Vendor B – 39.5
Vendor B – 45.2
Vendor A – 34.5
Vendor C – 37.7
Vendor C – 43.5
Vendor B – 32.7
Vendor A – 35.5

While this approach looks logical at one level, it often introduces errors and undermines efficiency because of the administrative inconsistency between different MT systems. Also, the suitability of the MT output for post editing may be a key requirement for localization use cases, but this may be much less important in other enterprise use cases.




Assessing business value and impact


The first post in this blog series exposes many of the fallacies of automated metrics that use string-matching algorithms (like BLEU and Lepor), which are not reliable quality assessment techniques as they only reflect the calculated precision and recall characteristics of text matches in a single test set, on material that is usually unrelated to the enterprise domain of interest. 

The issues discussed challenge the notion that single-point scores can really tell you enough about long-term MT quality implications. This is especially true as we move away from the localization use case. Speed, overall agility and responsiveness and integration into customer experience related data flow matters much more in the following use cases. The actual translation quality variance measured by BLEU and Lepor may have little to no impact on what really matters in the following use cases.



The enterprise value-equation is much more complex and goes far beyond linguistic quality and Natural Language Processing (NLP) scores. To truly reflect the business value and impact, evaluation of MT technology must factor in non-linguistic attributes including:
  • Adaptability to business use cases
  • Manageability
  • Integration into enterprise infrastructure
  • Deployment flexibility   
To effectively link MT output to business value implications, we need to understand that although linguistic precision is an important factor, it often has a lower priority in high-value business use cases. This view will hopefully take hold as the purpose and use of MT is better understood in the context of a larger business impact scenario, beyond localization.

But what would more dynamic and informed approaches look like? MT evaluation certainly cannot be static since systems must evolve as requirements change. Instead of a single-point score, we need a more complex framework that provides an easy, single measure that tells us everything we need to know about an MT system. Today, this is unfortunately not yet feasible.


A more meaningful evaluation framework


While single-point scores do provide a rough and dirty sense of an MT system’s performance, it is more useful to focus testing efforts on specific enterprise use case requirements. This is also true for automated metrics, which means that scores based on news domain tests should be viewed with care since they are not likely to be representative of performance on specialized enterprise content. 

When rating different MT systems, it is essential to score key requirements for enterprise use, including:

  • Adaptability: Range of options and controls available to tune the MT system performance for very specific use cases. For example, optimization techniques applied to eCommerce catalog content should be very different from those applied to technical support chatbot content or multilingual corporate email systems.
  • Data privacy and security: If an MT system will be used to translate confidential emails, business strategy and tactics documents, human evaluation requirements will differ greatly from a system that only focuses on product documentation. Some systems will harvest data for machine learning purposes, and it is important to understand this upfront.
  • Deployment flexibility: Some MT systems need to be deployed on-premises to meet legal requirements, such as is the case in litigation scenarios or when handling high-security data. 
  • Expert services: Having highly qualified experts to assist in the MT system tuning and customization can be critical for certain customers to develop ideal systems. 
  • IT integration: Increasingly, MT systems are embedded in larger business workflows to enable greater multilingual capabilities, for example, in communication and collaboration software infrastructures like email, chat and CMS systems.
  • Overall flexibility: Together, all these elements provide flexibility to tune the MT technology to specific use cases and develop successful solutions.

Ultimately, the most meaningful measures of MT success are directly linked to business outcomes and use cases. The definition of success varies by the use case, but most often, linguistic accuracy as an expression of translation quality is secondary to other measures of success. 


The integrity of the overall solution likely has much more impact than the MT output quality in the traditional sense: not surprisingly, MT output quality could vary by as much as 10-20% on either side of the current BLEU score without impacting the true business outcome. Linguistic quality matters but is not the ultimate driver of successful business outcomes. In fact, there are reports of improvements in output quality in an eCommerce use case that actually reduced the conversion rates on the post-edited sections, as this post-edited content was viewed as being potentially advertising-driven and thus less authentic and trustworthy.



True expressions of successful business outcomes for different use cases


Global enterprise communication and collaboration
  • Increased volume in cross-language internal communication and knowledge sharing with safeguarded security and privacy
  • Better monitoring and understanding of global customers 
  • Rapid resolution of global customer problems, measured by volume and degree of engagement
  • More active customer and partner communications and information sharing
Customer service and support
  • Higher volume of successful self-service across the globe
  • Easy and quick access to multilingual support content 
  • Increased customer satisfaction across the globe
  • The ability of monolingual live agents to service global customers regardless of the originating customer’s language 
eCommerce
  • Measurably increased traffic drawn by new language content
  • Successful conversions in all markets
  • Transactions are driven by newly translated content
  • The stickiness of new visitors in new language geographies
Social media analysis
  • Ability to identify key brand impressions 
  • Easy identification of key themes and issues
  • A clear understanding of key positive and negative reactions
Localization
  • Faster turnaround for all MT-based projects
  • Lower production cost as a reflection of lower cost per word
  • Better MTPE experience based on post-editor ratings
  • Adaptability and continuous improvement of the MT system

A more detailed presentation and webinar that goes into much more detail on this subject is available from Brightalk. 


In upcoming posts in this series, we will continue to explore the issue of MT quality assessment from a broad enterprise needs perspective. More informed practices will result in better outcomes and significantly improved MT deployments that leverage the core business mission to solve high-volume multilingual challenges more effectively.

Again, this is a slightly less polished and raw variant of a version published on the SDL site. The first one focused on BLEU scores, which are often improperly used to make decisions on inferred MT quality, where it clearly is not the best metric to draw this inference.

No comments:

Post a Comment