Pages

Thursday, November 7, 2019

The Global Data Explosion in the Legal Industry

This was originally published on SDL.COM.


As we consider and look at the various forces impacting the legal industry today, we see several ongoing trends which are increasingly demanding more attention from both inside and outside counsel. These forces are:
  • The Digital Data Momentum
  • Increasing Concern for Data Security
  • The Growing Importance of Information Governance
  • Increasing Globalization 

 

The Digital Data Momentum


Several studies by IDC, EMC and academics have predicted for years that we are facing an ever-growing data deluge and content explosion. The prediction that the digital universe will be 44 zettabytes by 2020 means little to most of us. But if you state that 500 million tweets, ~300 billion emails, 65 billion Whatsapp messages are sent, and 3.5 billion Google searches are made every single day, many more of us would understand the astounding scale of the modern digital world. While only a small fraction of this data will flow into the purview of the legal profession, the impact is significant and most legal teams will admit this increase in content is a major challenge today.



The enterprise is also affected by this content explosion, and a recent eDiscovery Business Confidence survey identified increasing data volumes as THE primary concern for the coming future. In eDiscovery settings, this also means that the information triage process is complicated since we are seeing not only significant increases in volume, but we are also seeing a greater variety of data types. The modern legal purview can include mobile data, voice and image data from various sources in addition to the data flowing in various enterprise IT systems. 

Gartner Corporate Litigation

 

Increasing Concern for Data Security

 

While data security has not been a concern in the past, it is increasingly being seen as a key concern. At recent Davos conferences, cybersecurity and data privacy breakdowns are seen as the biggest threats to businesses, economies, and societies around the world. According to the World Economic Forum (WEF), attacks against businesses have almost doubled in five years and the costs are rising too. “The world depends on digital infrastructure and people depend on their digital devices and what we’ve found is that these digital devices are under attack every single day,” said Brad Smith, president, and chief legal officer, Microsoft. He added that attacks by organized criminal enterprises are becoming “more prolific and more sophisticated”, often “operating in jurisdictions that are more difficult to reach through the rule of law but use the internet to seek out victims literally everywhere.”

This rise of artificial intelligence and machine learning also means that global enterprises are interested in acquiring and harvesting data, wherever and whenever they can. Businesses are looking to acquire as much information as possible, about customers, interactions, brand opinions, and extracting insights that might give them an edge over the competition. Data-guzzling machine learning processes promise to amplify businesses’ ability to predict, personalize, and produce. However, some of the world’s largest consumer-facing companies have fallen victim to data breaches affecting hundreds of millions of customers. By all measures, the disruptive, data-centric forces of the so-called fourth industrial revolution appear to be outpacing the world’s ability to control them.

Legal professionals will need to play a larger role in managing these new risks, which can be devastating and cost millions in reparations and negative consequences.  Increasingly these threats originate in foreign countries and sometimes even with support from foreign governments

 Internal Investigations

 

The Growing Importance of Information Governance

 

The modern global enterprise has a very different risk tolerance profile from similar companies, even as recently as 10 years ago. The “datafication” of the modern enterprise creates special challenges for both inside and outside counsel.  Recent surveys by Gartner suggest that legal leaders have to start investing in digital skills and capabilities, reflecting the evolving role of the legal department as a strategic business partner.

“How legal departments build capabilities to govern risk within digital initiatives matter more than the legal advice they provide” says Christina Hertzler, Practice Vice President, Gartner.

To be digitally ready, legal departments must shift their approach to manage specific changes created by digitalization — more stakeholders, more speed and iteration, and the increased technical and collaborative nature of digital work, as well as handling new information-related risks.

As organizations change the way they operate, generate revenue and create value for their customers, new compliance risks are emerging — presenting a challenge to compliance, which must identify, assess and mitigate risks like those tied to fundamentally new technologies (e.g., artificial intelligence) and processes.

Information Governance

There is a growing list of US companies already subjected to GDPR-related EU regulatory actions, including, Amazon, Apple, Facebook, Google, Netflix, Spotify, and Twitter. Indeed, the French Data Protection Authority, CNIL, recently levied upon Google a record fine of approximately $57 million dollars for “lack of transparency, inadequate information and lack of valid consent regarding ads personalization.” The risks to US companies include providing proof of measures taken to protect, process, and transfer personal data from the EU to the US in connection with regulatory investigations or litigation.  A report published in late February by DLA Piper cited data from the first eight months of GDPR enforcement, during which 91 fines were imposed. "We expect that 2019 will see more fines for tens and potentially even hundreds of millions of euros, as regulators deal with the backlog of GDPR data breach notifications," the report said. Taking meaningful steps now toward GDPR compliance is the best way for US companies doing business of any kind involving EU personal data—including those with no physical presence in the EU—to prepare for and mitigate their risk.

The penalties of non-compliance with regulatory policies continue to mount.  Google was fined $170 million and asked to make changes to protect children’s privacy on YouTube, as regulators said the video site had knowingly and illegally harvested personal information from children and used it to profit by targeting them with ads. We can only expect that data privacy and compliance regulations will be taken more seriously in the future and that legal teams will play an expanding role in ensuring this.

Facebook agreed to pay a record-breaking $5 billion fine as part of a settlement with the Federal Trade Commission, by far the largest penalty ever imposed on a company for violating consumers' privacy rights. Facebook also agreed to adopt new protections for the data users share on the social network and to measures that limit the power of CEO Mark Zuckerberg. Under the settlement, which concludes a year-long investigation prompted by the 2018 Cambridge Analytica scandal, the social networking giant must expand its privacy protections across Facebook itself, as well as on Instagram and WhatsApp. It must also adopt a corporate system of checks and balances to remain compliant, according to the FTC order. Facebook must also maintain a data security program, which includes protections of information such as users' phone numbers. The issue of data privacy and compliance will continue to build momentum as more people understand the extent of the data harvesting that is going on.

Taking meaningful steps now toward robust information governance and compliance for all kinds of privileged and confidential data will be necessary for the modern digital-centric enterprise, and the modern legal department will need to be able to be an active partner and help the enterprise prepare for and mitigate their risk.


Compliance and Regulation Processes

 

Increasing Globalization = More Multilingual Data

 

While these forces we have just described continue to build momentum, driven by increasing digitalization and the resultant ever expanding content flows, we also have an additional layer of complexity: language. The modern enterprise is now much more rapidly and naturally global, and thus now the modern legal department and outside counsel need to be able to process content and information flows in multiple languages on a regular basis. The variety and volumes of multilingual content that legal professionals need to process and monitor can include any and all of the following:
  • International contract negotiations and disputes
  • Patent-infringement litigation
  • Human Resource communications in global enterprises
  • Customer communications
  • GDPR Compliance related monitoring and analysis 
  • Cross-border regulatory compliance monitoring
  • FCPA compliance monitoring 
  • Anti-trust related matters
The volumes of multilingual content can vary greatly, from very large volumes that might involve tens of thousands of documents in litigation related eDiscovery, to specialized monitoring of customer communications to ensure regulatory compliance, to smaller volumes of sensitive communications with global employees.

Multilingual issues are especially present in cross-border partnerships and business dealings which are now increasingly common across many industries.
The AlixPartners Global Anticorruption Survey polled corporate counsel, legal, and compliance officers at companies based in the US, Europe, and Asia in more than 20 major industries. The perceived corruption risks are elevated in Latin America and China, and Russia, Africa, and the Middle East have emerged as regions of increasing concern. The survey found that 90% and 94% of companies with operations in Latin America and China, respectively, reported their industries are exposed to corruption risk. Of the 66% of respondents who said there are regions where it is impossible to avoid corrupt business practices, 31% said Russia is one such place and 27% cited Africa.

The sheer volume of information companies must collect, translate, and analyze is the biggest obstacle to tackling corruption, according to 75% of survey respondents. 

These concerns surrounding the management of data are expected to increase with increasing data privacy regulation such as the EU’s General Data Protection Regulation.

 Data Growth

 

SDL: End-to-end translation solutions for the legal industry 


Thus, we see today that language translation production capabilities have become imperative for the modern global enterprise and that the needs for translation can range from rapid translation of millions of documents in an eDiscovery scenario to very careful and specialized translation of critical contract and court-ready documentation. Given the volume, variety, and velocity of the information that needs translation, legal professionals must consider a combination of technology and human services. Ideally, solving these kinds of varying translation challenges would be done by technologically informed professionals who solve complex and varied translation problems and who can adapt language technology and human expertise to the challenge at hand. 

Language Translation

SDL’s translation capabilities range from handling large eDiscovery litigation related projects using a combination of MT enhanced with expertly developed client-specific glossaries to improve the ability to identify relevant documents, no matter the language. The company provides around-the-clock, around-the-world service using state of the art linguistic AI tools to ensure greater accuracy, security, reduced costs and turnaround time. The company has a pool of certified and specialized translators across a number of jurisdictions and languages worldwide who have expertise and competence across a wide range of legal documents.  The company is already working with 19 of the top 20 law firms in the world. The translation supply chain is often the hidden weak spot in an organization's data compliance. SDL’s secure translation supply chain gives you fully auditable, data custody of your translation processes and can be cascaded down through your outside counsel and consultants to create a replicable process across all of your legal service partners.

SDL’s secure translation supply chain solution provides an enterprise-class, vendor agnostic, secure translation platform that allows you to combine regulatory compliance and translation best practice. Powered by SDL’s leading linguistic technologies your organization benefits from consistently applied terminology, your teams have full visibility of spend and leverage, and can easily flex your approved supplier pool to meet your needs. Securing the translation supply chain needn’t come at the cost of trusted suppliers, existing relationships or impact time to market.

Multilingual Data Triage

To find out how SDL can support your multilingual data processing and translation strategy, please visit the legal pages at SDL.com, which will provide more insight into what they can do for you.

Tuesday, October 8, 2019

Post-editese is real

Ever since machine translation was introduced into the professional translation industry, there have been questions about what the impact would be on a final delivered translation service product. For much of the history of MT many translators claimed that while translation production work using a post-edited MT (PEMT) process was faster, the final product was not as good. The research suggests that this has been true from a strictly linguistic perspective, but many of us also know that PEMT worked quite successfully with technical content especially with terminology and consistency even in the days of SMT and RBMT. 

As NMT systems proliferate, we are at a turning point, and I suspect that we will see many more NMT systems that are in fact seen as providing useful output that clearly enhances translator productivity, especially on output from systems built by experts. NMT will also quite likely have an influence on the output quality and the difference is also likely to become less prominent. This is what is meant by developers who make claims of achieving human parity. If competent human translators cannot tell that segments they review came from MT or not, we can make a limited claim of having achieved human parity. This does not mean that this will be true for every new sentence submitted to this system. 

We should also understand that MT  provides the greatest value in use scenarios where you have large volumes of content (millions rather than thousands of words), short turnaround times, and limited budgets. Increasingly MT is used in scenarios where little or no post-editing is done, and by many informed estimates, we are already at a run rate of a trillion words a day going through MT engines. While post-editese may be an important consideration in localization use scenarios, this is likely no more than 2% of all MT usage.  

Enterprise MT use is rapidly moving into a phase where it is an enterprise-level IT resource. The modern global enterprise needs to enable and allow millions of words to be translated on demand in a secure and private way and needs to be integrated deeply into critical communication, collaboration, and content creation and management software.

The research presented by Antonio Toral below documents the impact of post-editing on the final output across multiple different language combinations and MT systems. 



==============

This is a summary of the paper “Post-editese: an Exacerbated Translationese” by Antonio Toral, which was presented at MT Summit 2019, where it won the best paper award.


Introduction


Post-editing (PE) is widely used in the translation industry, mainly because it leads to higher productivity than unaided human translation (HT). But, what about the resulting translation? Are PE translations as good as HT? Several research studies have looked at this in the past decade and there seems to be consensus: PE is as good as HT or even better (Koponen, 2016).

Most of these studies measure the quality of translations by counting the number of errors therein. Taking into account that there is more to quality than just the number of mistakes, we ask ourselves the following question instead: are there differences between translations produced with PE vs HT? In other words, does the final output created via PEs and HTs have different traits?

Previous studies have unveiled the existence of translationese, i.e. the fact that HTs and original texts exhibit different characteristics. These characteristics can be grouped along with the so-called translation universals (Baker, 1993) and fundamental laws of translation (Toury, 2012), namely simplification, normalization, explicitation and interference. Along this line of thinking, we aim to unveil the existence of post-editese (i.e. the fact that PEs and HTs exhibit different characteristics) by confronting PEs and HTs using a set of computational analyses that align to the aforementioned translation universals and laws of translation.

Data

We use three datasets in our experiments: Taraxü (Avramidis et al., 2014), IWSLT (Cettolo et al., 2015; Mauro et al., 2016) and Microsoft “Human Parity” (Hassan et al., 2018). These datasets cover five different translation directions and allow us to assess the effect of machine translation (MT) systems from 2011, 2015-16 and 2018 on the resulting PEs.

Analyses

Lexical Variety

We assess the lexical variety of a translation (HT, PE or MT) by calculating its type-token ratio:

In other words, given two translations equally long (number of words), the one with bigger vocabulary (higher number of unique words) would have a higher TTR, being therefore considered lexical richer, or higher in lexical variety.

The following figure shows the results for the Microsoft dataset for the direction Chinese-to-English (zh–en, the results for the other datasets follow similar trends and can be found in the paper). HT has the highest lexical variety, followed by PE, while the lowest value is obtained by the MT systems. A possible interpretation is as follows: (i) lexical variety is low in MT because these systems prefer the translation solutions that are frequent in the training data used to train such systems and (ii) a post-editor will add lexical variety to some degree (difference in the figure between MT and PE), but because MT primes him/her (Green et al., 2013), the resulting PE translation will not achieve the lexical variety of HT.


Lexical Density

The lexical density of a text indicates its amount of information and is calculated as follows:
where content words correspond to adverbs, adjectives, nouns, and verbs. Hence, given two translations equally long, the one with the higher number of content words would be considered to have higher lexical density, in other words, to contain more information.

The following figure shows the results for the three translation directions in the Taraxü dataset: English-to-German, German-to-English and Spanish-to-German. The lexical density in HT is higher than in both PE and MT and there is no systematic difference between the latter two.

Length Ratio

Given a source text (ST) and a target text (TT), where TT is a translation of ST (HT, PE or MT), we compute a measure of how different in length the TT is with respect to the ST:
This means that the bigger the difference in length between the ST and the TT (be it because TT is shorter or longer than the ST), the higher the length ratio.

The following figure shows the results for the Taraxü dataset. The trend is similar to the one in lexical variety; this is, HT obtains the highest result, MT the lowest and PE lies somewhere in between. We interpret this as follows: (i) MT results in a translation of similar length to that of the ST due to how the underlying MT technology works and PE is primed by the MT output while (ii) a translator working from scratch may translate more freely in terms of length.

Part-of-speech Sequences

Finally, we assess the interference of the source language on a translation (HT, PE and MT) by measuring how close the sequence of part-of-speech tags in the translation is to the typical part-of-speech sequences of the source language and to the typical part-of-speech sequences of the target language. If the sequences of a translation are similar to the typical sequences of the source language that would indicate that there is an inference from the source language in the translation.

The following figure shows the results for the IWSLT dataset. The metric used is perplexity difference; the higher it is the lower the interference (full details on the metric can be found in the paper). Again, we find a similar trend as in some of the previous analyses: HT gets the highest results, MT the lowest and PE somewhere in between. The interpretation is again similar: MT outputs exhibit a large amount of interference from the source language, a post-editor gets rid of some of that interference but the resulting translation still has more interference than an unaided translation.


Findings

The findings from our analyses can be summarised as follows in terms of HT vs PE:
  • PEs have lower lexical variety and lower lexical density than HTs. We link these to the simplification principle of translationese. Thus, these results indicate that post-editese is lexically simpler than translationese.
  • Sentence length in PEs is more similar to the sentence length of the source texts, than sentence length in HTs. We link this finding to interference and normalization: (i) PEs have
interference from the source text in terms of length, which leads to translations that follow the typical sentence length of the source language; (ii) this results in a target text whose
length tends to become normalized.
  • Part-of-speech (PoS) sequences in PEs are more similar to the typical PoS sequences of the source language than PoS sequences in HTs. We link this to the interference principle: the sequences of grammatical units in PEs preserve to some extent the sequences that are typical of the source language.

In terms of the role of MT: we have not considered only HTs and PEs but also MT outputs, from the MT systems that were the starting point to produce the PEs. This to corroborate a claim in the literature (Greenet al., 2013), namely that in PE the translator is primed by the MT output. We expected then to find similar trends to those found in PEs also in MT outputs and this was indeed the case in all four analyses. In some experiments, the results of PE were somewhere in between those of HT and MT. Our interpretation is that a post-editor improves the initial MT output, but due to being primed by the MT output, the result cannot attain the level of HT, and the footprint of the MT system remains in the resulting PE.

Discussion

As said in the introduction, we know that PE is faster than HT. The question I wanted to address was then: can PE not only be faster but also be at the level of HT quality-wise? In this study, this is looked at from the point of view of translation universals and the answer is clear: no. However, I'd like to point out three additional elements:
  1. The text types in the 3 datasets that I have used are news and subtitles, both are open-domain and could be considered to a certain extent "creative". I wonder what happens with technical texts, given their relevance for industry, and I plan to look at that in the future.
  2. As mentioned in the introduction, previous studies have compared HT vs PE in terms of the number of errors in the resulting translation. In all the studies I've encountered PE is at the level of HT or even better. Thus, for technical texts where terminology and consistency are important, PE is probably better than HT. I find thus the choice between PE and HT to be a trade-off between consistency on one hand and translation universals (simplification, normalization and interference) on the other.
  3. PE falls behind HT in terms of translation universals because MT falls behind HT in those terms. However, this may not be the case anymore in the future. For example, the paper shows that PE-NMT has less interference than PE-SMT, thanks to the better reordering in the former.




Antonio Toral is an Assistant Professor at the Computational Linguistics group, Center for Language and Cognition, Faculty of Arts, University of Groningen (The Netherlands). His research is in the area of Machine Translation. His main topics include resource acquisition, domain adaptation, diagnostic evaluation and hybrid approaches.


Related Work

Other work has previously looked at HT vs PE beyond the number of errors. The most related papers to this paper are Bangalore et al. (2015), Carl and Schaeffer (2017), Czulo and Nitzke (2016), Daems et al. (2017) and Farrell (2018).

Bibliography


Avramidis, Eleftherios, Aljoscha Burchardt, Sabine Hunsicker, Maja Popovic, Cindy Tscherwinka, David Vilar, and Hans Uszkoreit. 2014. The taraxü corpus of human-annotated machine translations. In LREC, pages 2679–2682.

Baker, Mona. 1993. Corpus linguistics and translation studies: Implications and applications. Text and technology: In honor of John Sinclair, 233:250.

Bangalore, Srinivas, Bergljot Behrens, Michael Carl, Maheshwar Gankhot, Arndt Heilmann, Jean Nitzke, Moritz Schaeffer, and Annegret Sturm. 2015. The role of syntactic variation in translation and post-editing. Translation Spaces, 4(1):119–144.

Carl, Michael and Moritz Jonas Schaeffer. 2017. Why translation is difficult: A corpus-based study of non-literality in post-editing and from-scratch translation. Hermes, 56:43–57.

Cettolo, Mauro, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The iwslt 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation.

Green, Spence, Jeffrey Heer, and Christopher D Manning. 2013. The efficacy of human post-editing for language translation. Chi 2013, pages 439–448.

Hassan, Hany, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, Will Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Zhuangzi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving Human Parity on Automatic Chinese to English News Translation. https://arxiv.org/abs/1803.05567.

Koponen, Maarit. 2016. Is machine translation post-editing worth the effort? A survey of research into post-editing and effort. Journal of Specialised Translation, 25(25):131–148.

Mauro, Cettolo, Niehues Jan, Stüker Sebastian, Bentivogli Luisa, Cattoni Roldano, and Federico Marcello. 2016. The iwslt 2016 evaluation campaign. In International Workshop on Spoken Language Translation.

Toury, Gideon. 2012. Descriptive translation studies and beyond: Revised edition, volume 100. John Benjamins Publishing.

Wednesday, September 25, 2019

In a Funk about BLEU

This is a more fleshed-out version of a blog post by Pete Smith and Henry Anderson of the University of Texas at Arlington already published on SDL.com. They describe initial results from a research project they are conducting on MT system quality measurement and related issues. 

MT quality measurement, like human translation quality measurement, has been a difficult and challenging subject for both the translation industry and for many MT researchers and systems developers as the most commonly used metric BLEU, is now quite widely understood to be of especially limited value with NMT systems. 

Most of the other text-matching NLP scoring measures are just as suspect, and practitioners are reluctant to adopt them as they are either difficult to implement, or the interpretation pitfalls and nuances of these other measures are not well understood. They all can generate a numeric score based on various calculations of Precision and Recall that need to be interpreted with great care. Most experts will say that the only reliable measures are those done by competent humans and increasingly best practices suggest that a trust-but-verify approach is better. There are many variations of superficially accurate measures available today, but on closer examination, they are all lacking critical elements to make them entirely reliable and foolproof.

So, as much as BLEU scores suck, we continue to use them since some, or perhaps even many of us understand them. Unfortunately, many still don't have a real clue, especially in the translation industry. 

I wonder sometimes if all this angst about MT quality measurement is much ado about nothing. We do in fact, need very rough indicators of MT quality to make judgments of suitability in business use cases, but taking these scores as final indicators of true quality is problematic. It is likely that the top 5, or even top 10 systems are essentially equivalent in terms of the MT quality impact on the business purpose. The real difference in business impact comes from other drivers: competence, experience, process efficiency and quality of implementation.

I would argue that even for localization use cases, the overall process design and other factors matter more than the MT output quality.

 As we have said before, technology has value when it produces favorable business outcomes, even if these outcomes can be somewhat challenging to measure with a precise and meaningful grade. MT is a technology that is seldom perfect, but even in its imperfection can provide great value to an enterprise with a global presence. MT systems with better BLEU or Lepor scores do not necessarily produce better business outcomes. I would argue that an enterprise could use pretty much any "serious" MT system without any impact on the final business outcome. 

This is most clear with eCommerce and global customer service and support use cases, where the use of MT can very rapidly yield a significant ROI. 

"eBay’s push for machine translation has helped the company increase Latin American exports by nearly 20%, according to researchers from the Massachusetts Institute of Technology, and illustrates the potential for increased commercial activity as translation technologies gain wider adoption in business."
MT deployment use case presentations shared by practitioners who have used MT to translate large volumes of knowledgebase support content show that what matters is whether the content helps customers across the globe get to answers that solve problems faster. Translation quality matters but only if it helps understandability. In the digital world, speed is crucial and often more important.

Some 100,000 buyers exchange a total of 2 billion translated text messages every week on the Alibaba.com global-trade platform. This velocity and volume of communication that is enabled by MT enable new levels of global commerce and trade. How many of these messages do you think are perfect translations? 

A monolingual Live Support Agent who can service thousands of global customers a week because he/she can quickly understand the question and send back relevant and useful support content back to a customer using MT  is another example. The ability to do this in volume matters more than perfect linguistic quality.

So then the selection of the right MT technology or solution will come down to much more enterprise relevant issues like:

  • Data Security & Privacy 
  • Adaptability to enterprise unique terminology and use cases
  • Scalability - from billions of words to thousands per hour 
  • Deployment Flexibility - On-premise, cloud or combinations of both
  • Integration with key IT infrastructure and platforms
  • Availability of expert consulting services for specialization 
  • Vendor focus on SOTA
  • MT system manageability
  • Cost 
  • Vendor reputation, profile and enterprise account management capabilities

Pete Smith will be presenting more details of his research study at SDL Connect next month.


===============


There is little debate: the machine translation research and practitioner communities are in a funk about BLEU. From recent webinars to professional interviews and scholarly publications, BLEU is being called on the carpet for its technical shortcomings in the face of a rapidly-developing field, as well as the lack of insight it provides to different consumers such as purchasers of MT services or systems.

BLEU itself is used widely, especially in the MT research community, as an outcome measure for evaluating MT. Yet even in that setting, there is considerable rethinking and re-evaluation of the metric, and BLEU has been an active topic of critical discussion and research for some years, including the challenges faced by evaluating automated translation across the language typology spectrum and especially in cases of morphologically rich languages. And the issue is not limited, of course, to machine translation—the metric is also a topic in NLP and natural language generation discussions generally.

BLEU’s strengths and shortcomings are well-known. At its core, BLEU is a string matching algorithm for use in evaluating MT output and is not per se a measure of translation quality. That said, here is no doubt that automated or calculated metrics are of great value, as total global MT output approaches levels of one trillion words per day.

And few would argue that, in producing and evaluating MT or translation in general, context matters. A general-purpose, public-facing MT engine designed for broad coverage among users and use cases is just that—general-purpose, and likely more challenged by perennial source language challenges such as specific domain style/terminology, informal language usage, regional language variations, and other issues.

It is no secret that many MT products are trained (at least initially) on publicly available research data and that there are, overall, real thematic biases in those datasets. News, current events, governmental and parliamentary data sets are available across a wide array of language pairs, as well as smaller amounts of data from domains such as legal, entertainment, and lecture source materials such as TED Talks. Increasingly, datasets are available in the IT and technical domains, but there are few public bilingual datasets available that are suitable for major business applications of MT technology such as e-commerce, communication, and collaboration, or customer service.

Researchers and applied practitioners have all benefited from these publicly-available resources. But the case for clarity is perhaps most evident in the MT practitioner community.

For example, enterprise customers hoping to purchase machine translation services face a dilemma: how might the enterprise evaluate an MT product or service for their particular domain, and with more nuance and depth than simply relying on marketing materials boasting scores or gains in BLEU or LEPOR? How might you evaluate major vendors of MT services specific to your use case and needs?

And as a complicating factor, we know an increasing amount about the “whys” and “hows” of fine-tuning general-purpose engines to better perform in enterprise cases such as e-commerce product listings, technical support knowledgebase content, social media analysis, and user feedback/reviews. In particular, raw “utterances” from customers and customer support personnel in these settings are authentic language, with all of its “messiness.”

The UTA research group has recently been exploring MT engine performance on customer support content, building a specialized test set compiled from source corpora including email and customer communications, communications via social media, and online customer support. In particular, we explored the utilization of automation and standard NLP-style pre-processing to rapidly construct a representative translation test set for the focused use case.

At the start, an initial set of approximately 3 million English sentence strings related to enterprise communication and collaboration were selected. Source corpora represented tasks such as email communication, customer communications, communications via social media, and online customer support.

Candidate sentence strings from these larger corpora were narrowed via a sentence clustering technique, training a FastText model on the input documents to capture both the semantic and non-semantic (linguistic) properties of the corpora. To give some sense of the linguistic features considered in string selection, corpora elements were parsed using the spaCy natural language processing library’s largest English model to consider features in a string such as the number of “stop words”; the number of tokens that were punctuation, numbers, e-mail addresses, URLs, alpha-only, and out-of-vocabulary; the number of unique lemmas and orthographic forms; number of named entities; the number of times each entity type, part-of-speech tag and dependency relation appeared in the text; and the total number of tokens. Dimensionality reduction and clustering were used in the end, to result in 1050 English-language strings for the basic bespoke test set.

The strings from the constructed set were translated into seven languages (French, German, Hindi, Korean, Portuguese, Russian, Spanish) by professional translators. Then the translated sentences from the test set were utilized as translation prompts in seven language pairs (English-French, English-German, English-Hindi, English-Korean, English-Portuguese, English-Russian, English-Spanish) by four major, publicly-available MT engines via API or web interface. At both the corpus as well as the individual string level, BLEU, METEOR, and TER scores were generated for each major engine and language pair (not all of the seven languages were represented in all engine products).

Our overall question was: does BLEU (or any of the other automated scores) support, say, the choice of engine A over engine B for enterprise purchase when the use case is centered on customer-facing and customer-generated communications? 

To be sure, the output scores presented a muddled picture. Composite scores of the general-purpose engines clustered within approximately 5-8 BLEU points of each other in most languages. And although we used a domain-specific test set, little in the results would have provided the enterprise-level customer with a clear path forward. As Kirti Vashee has pointed out recently, in responding effectively to the realities of the digital world, “5 BLEU points this way or that is negligible in most high-value business use cases.”

What are some of the challenges of authentic, customer language? Two known challenges to MT include the formality/informality of language utterances and emotive content. The double-punch of informality and emotion-laden customer utterances pose a particularly challenging case.

As we reviewed in a recent webinar, customer-generated strings in support conversations or online interactions present a translator with a variety of expressions of emotion, tone, humor, sarcasm, all embedded within a more informal and Internet-influenced style of language. Some examples included:

             Support…I f***ing hate you all. [Not redacted in the original.]
            Those late in the day deliveries go missing” a lot.
            Nope didnt turn upjust as expectednow what dude?
            I feel you man, have a good rest of your day!
           Seriously, this is not OK.
           A bunch of robots who repeat the same thing over & over.
           #howdoyoustayinbusiness

Here one can quickly see how an engine trained primarily with formal, governmental or newspaper source would be quickly challenged. But in early results, our attempts to unpack the issues of how MT may perform on emotive content (i.e., not news, legal, or technical content) have provided little insight to date. Early findings suggest surprisingly little interaction between standard ratings of sentiment and emotion run on the test set individual strings (VADER positive, negative, neutral, composite and IBM tone analysis) and variance in downstream BLEU scores.

Interestingly, as an aside in our early work, raw BLEU scores across languages for the entire test set did generally correlate comparatively highly with METEOR scores. Although this correlation is expected, the strength of the relationship was surprising in an NMT context, as high as r=.9 across 1000+ strings in a given language pair. If, as the argument goes, NMT brings strengths in fluency which includes elements METEOR scoring is, by design, more sensitive to (such as synonyms or paraphrasing), one might expect that correlation to be weaker. More broadly, these and other questions around automatic evaluation have a long history of consideration by the MT and WMT communities.

One clearly emerging practice in the field is to combine an automated metric such as BLEU along with human evaluation on a smaller data set, to confirm and assure that the automated metrics are useful and provide critical insight, especially if the evaluation is used to compare MT systems. Kirti Vashee, Alon Lavie, and Daniel Marcu have all written on this topic recently.

Thus, the developing, more nuanced understanding of the value of BLEU may be as automated scores seen as initially most useful during MT research and system development, where they are by far the most widely-cited standard. The recent Machine Translation Summit XVII in Dublin, for example, had almost 500 mentions or references to BLEU in the research proceedings alone.

But this measure may be potentially less accurate or insightful when broadly comparing different MT systems within the practitioner world, and perhaps more insightful again to both researcher and practitioner when paired with human or other ratings. As one early MT researcher has noted, “BLEU is easy to criticize, but hard to get away from!”

Discussions at the recent TAUS Global Content Conference 2019 further developed the ideas of MT engine specialization in the context of the modern enterprise content workflow. Presenters such as SDL and others offered views future visions of content development personalization and use in a multilingual world. These future workflows may contain hundreds or thousands of specialized, specially-trained and uniquely maintained automated translation engines and other linguistic algorithms, as content is created, managed, evaluated, and disseminated globally.

There is little doubt that the automated evaluation of translation will continue to play a key role in this emerging vision. However, a better understanding of the field’s de facto metrics and the broader MT evaluation process in this context is clearly imperative.

And what of use cases that continue to emerge, such as the possibility of intelligent or MT content in the educational space? The UTA research group is also exploring MT applications specific to education and higher education as well. For example, millions of users daily make use of learning materials such as MOOCs—educational content that attracts users across borders, languages, and cultures. A significant portion of international learners come to and potentially struggle with English-language content in edX or other MOOC courses—and thousands of MOOC offerings exist in the world’s languages, untranslated for English-speakers. What role might machine translation potentially play in this educational endeavor?




Dr. Pete Smith, Chief Analytics Officer, and Professor
Mr. Henry Anderson, Data Scientist
Localization and Translation Program
Department of Modern Languages and Office of University Analytics
The University of Texas at Arlington