Tuesday, December 31, 2019

Most Popular Blog Posts of 2019

I did not write as much as I had hoped to in 2019 but I hopefully can correct this in the coming year. I notice that the top two posts of the past year were written by guest writers, and I invite others who may be so moved, to also come forward and add to the content being produced on this blog.

These rankings are based on the statistics given to me by the hosting platform, and in general, they look reasonable and likely. In these days of fake news and fake images, one does need to be wary. I have produced other reports that have produced drastically different rankings which seemed somewhat suspect to me so I am going with the listing presented in this post.

The most popular post of the 2019 year was from a frequent guest writer on eMpTy Pages: Luigi Muzii, who has also written extensively about post-editing best practices elsewhere.

1. Understanding the Realities of Language Data

Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities.

The data is your teacher. It's the data where the real value is. I predict that this will become increasingly clear over the coming year.

Data is valuable when it is properly collected, understood, organized and categorized. Having rich metadata and taxonomy is especially valuable with linguistic data. Luigi has already written about metadata previously, and you can find the older articles here and here. I think that we should also understand that often translation memory does not have the quality and attributes that make it useful for training NMT systems. This is especially true when large volumes of disparate TM are aggregated together and this is contrary to what many in the industry believe. It is often more beneficial to create new, more relevant TM, based on real and current business needs that better fit the source that needs to be translated.

A series of posts that focused on BLEU scores and MT output quality assessment were the next most popular. Hopefully, my efforts to steer the serious user/buyer to look at business impact beyond these kinds of scores has succeeded, and informed buyers now understand that it is possible to have significant score differences that may have a minimal business impact, and thus these scores should not be overemphasized when selecting a suitable or optimal MT solution.

2.  Understanding MT Quality: BLEU Scores

As there are many MT technology options available today, BLEU and its derivatives are sometimes used to select what MT vendor and system to use. The use of BLEU in this context is much more problematic and prone to drawing erroneous conclusions as often comparisons are being made between apples and oranges. The most common error in interpreting BLEU is the lack of awareness and understanding that there is a positive bias towards one MT system because it has already seen and trained on the test data or has been used to develop the test data set.

What is BLEU useful for?

Modern MT systems are built by “training” a computer with examples of human translations. As more human translation data is added, systems should generally get better in quality. Often, new data can be added with beneficial results, but sometimes new data can cause a negative effect especially if it is noisy or otherwise “dirty”. Thus, to measure if progress is being made in the development process, the system developers need to be able to measure the quality impact rapidly and frequently to make sure they are improving the system and are in fact making progress.

BLEU allows developers a means “to monitor the effect of daily changes to their systems in order to weed out bad ideas from good ideas.” When used to evaluate the relative merit of different system building strategies, BLEU can be quite effective as it provides very quick feedback and this enables MT developers to quickly refine and improve translation systems they are building and continue to improve quality on a long term basis.

The enterprise value-equation is much more complex and goes far beyond linguistic quality and Natural Language Processing (NLP) scores. To truly reflect the business value and impact, evaluation of MT technology must factor in non-linguistic attributes including:
  • Adaptability to business use cases
  • Manageability
  • Integration into enterprise infrastructure
  • Deployment flexibility   
To effectively link MT output to business value implications, we need to understand that although linguistic precision is an important factor, it often has a lower priority in high-value business use cases. This view will hopefully take hold as the purpose and use of MT is better understood in the context of a larger business impact scenario, beyond localization.

Ultimately, the most meaningful measures of MT success are directly linked to business outcomes and use cases. The definition of success varies by the use case, but most often, linguistic accuracy as an expression of translation quality is secondary to other measures of success. 

The integrity of the overall solution likely has much more impact than the MT output quality in the traditional sense: not surprisingly, MT output quality could vary by as much as 10-20% on either side of the current BLEU score without impacting the true business outcome. Linguistic quality matters but is not the ultimate driver of successful business outcomes. In fact, there are reports of improvements in output quality in an eCommerce use case that actually reduced the conversion rates on the post-edited sections, as this post-edited content was viewed as being potentially advertising-driven and thus less authentic and trustworthy.

There is also a post by Dr. Pete Smith that is worth a look: In a Funk about BLEU

Your personal data security really does matter
Don't give it away

The fourth most popular post of 2019 was by guest writer Robert Etches with his vision for Blockchain. 

4.  A Vision for Blockchain in the Translation Industry

Cryptocurrency has had a very bad year, but the underlying technology is still regarded as a critical building block for many new initiatives. It is important to be realistic without denying the promise as we have seen the infamous CEOs do. Change can take time and sometimes it needs much more infrastructure than we initially imagine. McKinsey (smart people who also have an Enron and mortgage securitization promoter legacy) have also just published an opinion on this undelivered potential, which can be summarized as:
 "Conceptually, blockchain has the potential to revolutionize business processes in industries from banking and insurance to shipping and healthcare. Still, the technology has not yet seen a significant application at scale, and it faces structural challenges, including resolving the innovator’s dilemma. Some industries are already downgrading their expectations (vendors have a role to play there), and we expect further “doses of realism” as experimentation continues." 
While I do indeed have serious doubts about the deployment of blockchain in the translation industry anytime soon, I do feel that if it happens it will be driven by dreamers, rather than by process crippled NIH pragmatists like Lou Gerstner and Rory. These men missed the obvious because they were so sure they knew all there was to know and because they were stuck in the old way of doing things.  While there is much about blockchain that is messy and convoluted, these are early days yet and the best is yet to come.

Finally, much to my amazement, a post that I wrote in March 2012 was the fifth most-read post of 2019 even though seven years have passed. This proves Luigi's point, (I paraphrase here)  that the more things change in the world at large, the more they stay the same in the translation industry. 

The issue of equitable compensation for the post-editors is an important one, and it is important to understand the issues related to post-editing, that many translators find to be a source of great pain and inequity.  MT can often fail or backfire if the human factors underlying work are not properly considered and addressed. 

From my vantage point, it is clear that those who understand these various issues and take steps to address them are most likely to find the greatest success with MT deployments. These practitioners will perhaps pave the way for others in the industry and “show you how to do it right” as Frank Zappa says. Many of the problems with PEMT are related to ignorance about critical elements, “lazy” strategies and lack of clarity on what really matters, or just simply using MT where it does not make sense. These factors result in the many examples of poor PEMT implementations that antagonize translators. 

My role at SDL was also somewhat inevitable since as long as 7 years ago I was saying:
I suspect that the most compelling evidence of the value and possibilities of PEMT will come from LSPs who have teams of in-house editors/translators who are on fixed salaries and are thus less concerned about the word vs. hourly compensation issues. For these companies, it will only be necessary to prove that first of all MT is producing high enough quality to raise productivity and then ensuring that everybody is working as efficiently as possible. (i.e not "over-correcting"). I would bet that these initiatives will outperform any in-house corporate MT initiative in quality and efficiency.
It is also clear that as more big-data becomes translation worthy, the need for the technologically informed linguistic steering will become more imperative and valuable.SDL is uniquely positioned to do this better than almost anybody else that I can think of. I look forward to helping make this a reality at SDL in 2020.

The SDL blog also had a strong preference for MT-related themes and if you are curious you can check this out: REVEALED: The Most Popular SDL Blogs of 2019

Wishing you all a Happy, Prosperous, 
and Healthy, New Year and Decade

Friday, December 27, 2019

The Issue of Data Security and Machine Translation

As the world becomes more digital and the volume of mission-critical data flows continue to expand, it is becoming increasingly important for global enterprises to adapt to the rapid globalization, and the increasingly digital-first world we live in. As organizations change the way they operate, generate revenue and create value for their customers, new compliance risks are emerging — presenting a challenge to compliance, which must proactively monitor, identify, assess and mitigate risks like those tied to fundamentally new technologies and processes. Digital transformation is driven and enabled by data, and thus the value of data security and governance also rise in importance and organizational impact. At the WEF forum in Davos, CEOs have identified cybersecurity and data privacy as two of the most pressing issues of the day, and even regard breakdown with these issues as a general threat to enterprise, society, and government in general.
While C-level executives understand the need for cybersecurity as their organizations undergo digital transformation, they aren’t prioritizing it enough, according to a recent Deloitte report based on a survey of 500 executives. The report, “The Future of Cyber Survey 2019,” reveals that there is a disconnect between organizational aspirations for a “digital everywhere” future, and their actual cyber posture. Those surveyed view digital transformation as one of the most challenging aspects of cyber risk management, and yet indicated that less than 10% of cyber budgets are allocated to these digital transformation efforts. The report goes on to say that this larger cyber awareness is at the center of digital transformation. Understanding that is as transformative as cyber itself—and to be successful in this new era, organizations should embrace a “cyber everywhere” reality.

Cybersecurity breakdowns and data breach statistics

Are these growing concerns about cybersecurity justified? It certainly seems so when we consider these facts:
  • A global survey in 2018 by CyberEdge across 17 countries and 20 industries found that 78% of respondents had experienced a network breach.
  • The ISACA survey  of cybersecurity professionals points out that it is increasingly difficult to recruit and retain technically adept cybersecurity professionals. They also found that 50% of cyber pros believe that most organizations underreport cybercrime even if they are required to report it, and 60% said they expected at least one attack within the next year.
  • Radware estimates that an average cyber-attack in 2018 costs an enterprise around $1.67M. The costs can be significantly higher, e.g. a breach at Maersk is estimated to have cost around $250 - $300 million, because of the brand damage, loss of productivity, loss of profitability, falling stock prices, and other negative business impacts in the wake of the breach.
  • Risk-Based Security reports that there were over 6500 data breaches and that more than 5 billion records were exposed in 2018. The situation is not better in 2019, and over 4 billion records were exposed in the first six months of 2019.
  • An IBM Security study revealed that the financial impact of data breaches on organizations. According to this study, the cost of a data breach has risen 12% over the past 5 years and now costs $3.92 million on average. The average cost of a data breach in the U.S. is $8.19 million, more than double the worldwide average.
As would be expected, with Hacking as the top breach type, attacks originating outside of the organization were also the most common threat source. However misconfigured services, data handling mistakes and other inadvertent exposure by authorized persons, exposed far more records than malicious actors were able to steal.

 Data security and cybersecurity in the legal profession

Third-party professional services firms are often a target for malicious attacks because of the possibility of acquiring high-value information is high. Records show that law firms relationships with third-party vendors are a frequent point of exposure to cyber breaches and accidental leaks. obtained a list of more than 100 law firms that had reported data breaches and estimate that even more are falling victim to this problem, but simply don’t report it to avoid scaring clients and minimize potential reputational damage.

Austin Berglas, former head of the FBI’s cyber branch in New York and now global head of professional services at cybersecurity company BlueVoyant, said law firms are a top target among hackers because of the extensive high-value client information they possess. Hackers understand that law firms are a “one-stop-shop” for sensitive and proprietary corporate information, merger & acquisitions related data, and emerging intellectual property information.

As custodians of highly sensitive information, law firms are inviting targets for hackers.

The American Bar Association reported in 2018 that 23% of firms had reported a breach at some point, up from 14% in 2016. Six percent of those breaches resulted in the exposure of sensitive client data. Legal documents have to pass through many hands as a matter of course, reams of sensitive information pass through the hands of lawyers and paralegals, and then they go through the process of being reviewed and signed by clients, clerks, opposing counsels, and judges. When they finally get to the location where records are stored, they are often inadvertently exposed to others—even firm outsiders—who shouldn’t have access to them at all.

A Logicforce legal industry score for cybersecurity health among law firms have increased from 54% in 2018 to 60% in 2019, but this is still lower than many other sectors. Increasingly clients are also asking for audits to ensure that security practices are current and robust. A recent ABA Formal Opinion states: “Indeed, the data security threat is so high that law enforcement officials regularly divide business entities into two categories: those that have been hacked and those that will be.

Lawyers are failing on cybersecurity, according to the American Bar Association Legal Technology Resource Center’s ABA TechReport 2019. “The lack of effort on security has become a major cause for concern in the profession.”

“A lot of firms have been hacked, and like most entities that are hacked, they don’t know that for some period of time. Sometimes, it may not be discovered for a minute or months and even years.” Vincent I. Polley, a lawyer, and co-author of a recent book on cybersecurity for the ABA.

As the volume of multilingual content explodes, a new risk emerges: public, “free” machine translation provided by large internet services firms who systematically harvest and store the data that passes through these “free” services.  With the significantly higher volumes of cross-border partnerships, globalization in general, and growth in international business, employee use of public MT has become a new source of confidential data leakage.

Public machine translation use and data security

In the modern era, it is estimated that on any given day, several trillion words are run through the many public machine translation options available across the internet today. This huge volume of translation is done largely by the average web consumer, but there is increasing evidence that a growing portion of this usage is emanating from the enterprise when urgent global customer, collaboration, and communication needs are involved. This happens because publicly available tools are essentially frictionless and require little “buy-in” from a user who doesn’t understand the data leakage implications.  The rapid rate of increase in globalization has resulted in a substantial and ever-growing volume of multilingual information that needs to be translated instantly as a matter of ongoing business practice. This is a significant risk for the global enterprise or law firm as this short video points out. Content transmitted for translation by users is clearly subject to terms of use agreements that entitle the MT provider to store, modify, reproduce, distribute, and create derivative works. At the very least this content is fodder for machine learning algorithms that could also potentially be hacked or expose data inadvertently.

Consider the following:
  • At the SDL Connect 2019 conference recently, a speaker from a major US semiconductor company described the use of public MT at his company. When this activity was carefully monitored by IT management, they found that as much as 3 to 5 GB of enterprise content was being cut and pasted into public MT portals for translation on a daily basis. Further analysis of the content revealed that the material submitted for translation included future product plans, customer problem-related communications, sensitive HR issues, and other confidential business process content.
  • In September 2017, the Norwegian news agency NRK reported data that they found that had been free translated on a site called Translate.Com that included “notices of dismissal, plans of workforce reductions and outsourcing, passwords, code information, and contracts”. This was yet another site that offered free translation, but reserved the right to examine the data submitted “to improve the service.” Subsequently, searches by Slator uncovered other highly sensitive data of both personal and corporate content.
  • A recent report from the Australian Strategic Policy Institute (ASPI) makes some claims about how China uses state-owned companies, which provide machine translation services, to collect data on users outside China. The author, Samantha Hoffman, argues that the most valuable tools in China’s data-collection campaign are technologies that users engage with for their own benefit; machine translation services being a prime example. This is done through a company called GTCOM, which Hoffman said describes itself as a “cross-language big data” business, offers hardware and software translation tools that collect data — lots of data. She estimated that GTCOM, which works with both corporate and government clients, handles the equivalent of up to five trillion words of plain text per day, across 65 languages and in over 200 countries. GTCOM is a subsidiary of a Chinese state-owned enterprise that the Central Propaganda Department directly supervises, and thus data collection is presumed to be an active and ongoing process.
After taking a close look at the enterprise market needs and the current realities of machine translation use we can summarize the situation as follows:
  • There is a growing need for always-available, and secure enterprise MT solutions to support the digitally-driven globalization that we see happening in so many industries today. In the absence of having such a secure solution available, we can expect that there will be substantial amounts of “rogue use” of public MT portals with resultant confidential data leakage risks.
  • The risks of using public MT portals are now beginning to be understood. The risk is not just related to inadvertent data leakage but is also closely tied to the various data security and privacy risks presented by submitting confidential content into the data-grabbing, machine learning infrastructure, that underlie these “free” MT portals. There is a growing list of US companies already subjected to GDPR-related EU regulatory actions, including, Amazon, Apple, Facebook, Google, Netflix, Spotify and Twitter. Experts have stated that Chinese companies are likely to be the next wave of regulatory enforcement, and the violators' list is expected to grow. 
  • The executive focus on digital transformation is likely to drive more attention to the concurrent cybersecurity implications of hyper-digitalization. Information Governance is likely to become much more of a mission-critical function as the digital footprint of the modern enterprise grows and becomes much more strategic.

 The legal market requirement: an end to end solution

Thus, we see today, having language-translation-at-scale capabilities have become imperative for the modern global enterprise.  The needs for translation can range from rapid translation of millions of documents in an eDiscovery or compliance scenario, to the very careful and specialized translation of critical contract and court-ready documentation on to an associate collaborating with colleagues from a foreign outpost. Daily communications in global matters are increasingly multilingual. Given the volume, variety, and velocity of the information that needs translation, legal professionals must consider translation solutions that involve both technology and human services. The requirements can vary greatly and can require different combinations of man-machine collaboration, that includes some or all of these different translation production models:
  • MT-Only for very high volumes like in eDiscovery, and daily communications
  • MT + Human Terminology Optimization
  • MT + Post-Editing
  • Specialized Expert Human Translation

Machine Translation: designed for the Enterprise

MT for the enterprise will need all of the following (and solutions are available from several MT vendors in the market). The author provides consulting services to select and develop optimal solutions :
  • Guaranteed data security & privacy
  • Flexible deployment options that include on-premise, cloud or a combination of both as dictated by usage needs
  • Broad range of adaptation and customization capabilities so that MT systems can be optimized for each individual client
  • Integration with primary enterprise IT infrastructure and software e.g. Office, Translation Management Systems, Relativity, and other eDiscovery platforms
  • Rest API that allows connectivity to any proprietary systems that you may employ. 
  • Broad range of expert consulting services both on the MT technology aspects and the linguistic issues
  • Tightly integrated with professional human translation services to handle end-to-end translation requirements.

This is a post that was originally published on SDL.COM in a modified form with more detail on SDL MT technology. 

Saturday, December 21, 2019

Efficient and Effective Multilingual eDiscovery Practices Using MT

As outlined in a previous post, the global data explosion is creating new challenges for the legal industry that requires balancing the use of emerging technologies and human resources in optimal ways to handle the data deluge effectively.

The continuing digital communication momentum and the much more rapid pace of globalization today often create specialized legal challenges. The rapid increase in global business interactions, varying regulatory laws, business practices, and cultural customs of international partners and competitors are confounding and often frustrating to participants. The impact of all these concurrent trends is driving the volume of cross-border litigation up, and necessitates that corporate general counsel in global enterprises, and large law firms find the means to perform the critical functions related to manage the unique requirements of legal eDiscovery in these particular scenarios.

A recent Norton Rose Fulbright survey of litigation trends highlights the need for technology to enhance efficiency in legal departments and also points out the growth of cybersecurity and data protection disputes increasing across all industries. Additionally, the survey states that increasingly, international business operations lead to an increase in cross-border discovery and related data protection issues. The survey found that within the life sciences and healthcare and technology and innovation sectors, the most concerning area is IP/Patent disputes. IP/Patent disputes are regarded as relatively costly in comparison to other legal matters, and technology and life sciences companies, in particular, face large exposure in this area.

By understanding the unique discovery requirements of different regions, instilling transparency and consistency throughout the discovery team and process, and taking advantage of powerful technology and workflow tools, companies can be better equipped to meet the discovery demands of litigation and regulatory investigations. The multilingual impact of this data deluge is just now being understood, and as we move to a global reality where the largest companies and markets in the globe are increasingly not English-speaking regions, the ability to handle huge volumes of flowing multilingual data become a way to build competitive advantage, and avoid becoming commercially irrelevant. Being able to handle large volumes of multilingual data effectively is a critical requirement for the modern enterprise.

What is eDiscovery?

Electronic discovery (sometimes known as e-discovery, eDiscovery, or e-Discovery) is the electronic aspect of identifying, collecting and producing electronically stored information (ESI) in response to a request for production in a lawsuit or an internal corporate investigation. ESI includes, but is not limited to, emails, documents, presentations, databases, voicemail, audio and video files, social media content, and websites.

The processes and technologies around eDiscovery are often complex because of the sheer volume/variety of electronic data produced and stored. Additionally, unlike hard-copy evidence, electronic documents are more dynamic and often contain metadata such as time-date stamps, author and recipient information, and file properties. Preserving the original content and metadata for electronically stored information is required to eliminate claims of spoliation or tampering with evidence later in a litigation scenario.

EDiscovery is typically a culling process, of moving from unstructured to structured data – from unstructured data to matter-specific relevance, and the highest value and most directly relevant information.

Thus, while there are three primary activities typically in eDiscovery, namely, collection, processing, and review, it is clear to practitioners and analysts that the review-related activity is the bulk of the cost of the overall eDiscovery process.

One analyst estimates that review-related software and services are estimated to constitute approximately 70% of worldwide eDiscovery software and services spending in 2018. While the percentage of spending on the eDiscovery task of review is estimated to decrease to around 65% of overall eDiscovery spending through 2023, the overall spend in dollars for eDiscovery review is estimated to grow to $12.15B by 2023.

A respected RAND Institute study is even more explicit about the costs and shows very clearly that managing your data volume is critical to managing your costs. The Rand Institute for Civil Justice estimates that the per-gigabyte costs break down to $125 to $6,700 for collection, $600 to $6,000 for processing, and, in the most expensive stage, $1,800 to $210,000 for review. The costs for multilingual review are very likely even higher and by some estimates could be as much as 3X times higher.

"The RAND Institute for Civil Justice has estimated that each gigabyte of data reviewed costs a company approximately $18,000."

This means that a conscientious, defensible, proactive approach to information governance can lead to tremendous savings. Every gigabyte of outdated unnecessary ESI that you delete in following a uniform data destruction policy saves you, on average, $18,000 per case.

What is document review?

Also known simply as review, document review is the stage of the EDRM in which organizations examine documents connected to a litigation matter to determine if they are relevant, responsive, or privileged. The value of having robust information governance policies in place makes the overall process both more effective and more efficient. Due to outsourcing and the high cost of using lawyers, document review is the most expensive stage of eDiscovery. It is generally responsible for 70% or more of the total cost of eDiscovery.

The cost per hour for document review attorneys to review documents during the review phase of eDiscovery is one of the most expensive steps in the overall process, something which is only further exacerbated when the attorneys have to be bilingual at a high level of proficiency.

To control those extravagant costs, litigants strive to narrow the field of documents that they must review. The processing stage of eDiscovery is intended in large part to eliminate redundant information and to organize the remaining data for efficient, cost-effective document review. Technology that assists in the culling and close examination process is essential, and we see that eDiscovery platforms that assist professional services, law firms, and information technology organizations to find, store, review and create legal documents are increasingly pervasive.

Document review can be used in more than just legal eDiscovery for litigation. It may also be used in regulatory investigations, internal investigations, and due diligence assessments for mergers and acquisitions and other information governance-related activities. Wherever it is employed, it serves the same purpose of designating information for production and requires a similar approach.

The Multilingual eDiscovery process

It is possible to identify the critical steps involved in a typical multilingual eDiscovery use case where the key objective is to extract the most relevant information form a large volume of submitted material. The multilingual characteristics of much of the data that needs to be reviewed today adds a significant layer of complexity and an additional cost to the process.

The typical process involves the following key steps:
  • Text Extraction: It is often necessary to extract multilingual text from scanned documents to ensure that all relevant documents are identified and sent to review.  OCR technology and native file processing technology to enable an enterprise to do this at scale. Sometimes it is also required to extract text from audio. 
  • Automated Language Identification Processing:  Linguistic AI technology capabilities make automatic detection of languages and data sets within any content an efficient and highly automated process.
  • Multilingual Search Term Optimization: Linguists work together with MT experts to generate critical search and terminology to ensure that multilingual data goes through optimal discovery related processing. This ensures that high volume automatic translations get critical terminology correct, and also enables the most relevant foreign language data to be discovered and presented for timely review. The multilingual search term consultant’s understanding of linguistic and cultural nuances can mean the difference between capturing critical information and missing it completely. Competent linguists ensure that grammatical, linguistic and cultural issues are taken into consideration during search term list development.
  • Secure, Private, State-of-the-Art Machine Translation: Firms should work and develop secure, private, scalable enterprise-ready MT technology that can be deployed on-premise or in the private cloud. Integration with Relativity (and other eDiscovery platforms) makes it easy for companies to handle anything related to large corporate legal matters, from analyzing and translating millions of documents to preparing critical contracts and court-presentable documents.
  • Specialized Human Translation Services: Many firms provides around-the-clock, around-the-world service using state-of-the-art linguistic AI tools to ensure greater accuracy, security reduced costs and turnaround time. The company has a pool of certified and specialized translators across multiple jurisdictions and languages worldwide who have expertise and competence across a wide range of legal documents. The company is already working with 19 of the top 20 law firms in the world. The translation supply chain is often the hidden weak spot in an organization's data compliance. Several firms provide a secure translation supply chain that gives you fully auditable, data custody of your translation processes and can be cascaded down through your outside counsel and consultants to create a replicable process across all of your legal service partners.

This is a post that was originally published on SDL.COM with more detail on SDL products  

Thursday, November 7, 2019

The Global Data Explosion in the Legal Industry

As we consider and look at the various forces impacting the legal industry today, we see several ongoing trends which are increasingly demanding more attention from both inside and outside counsel. These forces are:
  • The Digital Data Momentum
  • Increasing Concern for Data Security
  • The Growing Importance of Information Governance
  • Increasing Globalization 


The Digital Data Momentum

Several studies by IDC, EMC and academics have predicted for years that we are facing an ever-growing data deluge and content explosion. The prediction that the digital universe will be 44 zettabytes by 2020 means little to most of us. But if you state that 500 million tweets, ~300 billion emails, 65 billion Whatsapp messages are sent, and 3.5 billion Google searches are made every single day, many more of us would understand the astounding scale of the modern digital world. While only a small fraction of this data will flow into the purview of the legal profession, the impact is significant and most legal teams will admit this increase in content is a major challenge today.

The enterprise is also affected by this content explosion, and a recent eDiscovery Business Confidence survey identified increasing data volumes as THE primary concern for the coming future. In eDiscovery settings, this also means that the information triage process is complicated since we are seeing not only significant increases in volume, but we are also seeing a greater variety of data types. The modern legal purview can include mobile data, voice and image data from various sources in addition to the data flowing in various enterprise IT systems. 

 Increasing Concern for Data Security


While data security has not been a concern in the past, it is increasingly being seen as a key concern. At recent Davos conferences, cybersecurity and data privacy breakdowns are seen as the biggest threats to businesses, economies, and societies around the world. According to the World Economic Forum (WEF), attacks against businesses have almost doubled in five years and the costs are rising too. “The world depends on digital infrastructure and people depend on their digital devices and what we’ve found is that these digital devices are under attack every single day,” said Brad Smith, president, and chief legal officer, Microsoft. He added that attacks by organized criminal enterprises are becoming “more prolific and more sophisticated”, often “operating in jurisdictions that are more difficult to reach through the rule of law but use the internet to seek out victims literally everywhere.”

This rise of artificial intelligence and machine learning also means that global enterprises are interested in acquiring and harvesting data, wherever and whenever they can. Businesses are looking to acquire as much information as possible, about customers, interactions, brand opinions, and extracting insights that might give them an edge over the competition. Data-guzzling machine learning processes promise to amplify businesses’ ability to predict, personalize, and produce. However, some of the world’s largest consumer-facing companies have fallen victim to data breaches affecting hundreds of millions of customers. By all measures, the disruptive, data-centric forces of the so-called fourth industrial revolution appear to be outpacing the world’s ability to control them.

Legal professionals will need to play a larger role in managing these new risks, which can be devastating and cost millions in reparations and negative consequences.  Increasingly these threats originate in foreign countries and sometimes even with support from foreign governments

 Internal Investigations


The Growing Importance of Information Governance


The modern global enterprise has a very different risk tolerance profile from similar companies, even as recently as 10 years ago. The “datafication” of the modern enterprise creates special challenges for both inside and outside counsel.  Recent surveys by Gartner suggest that legal leaders have to start investing in digital skills and capabilities, reflecting the evolving role of the legal department as a strategic business partner.

“How legal departments build capabilities to govern risk within digital initiatives matter more than the legal advice they provide” says Christina Hertzler, Practice Vice President, Gartner.

To be digitally ready, legal departments must shift their approach to manage specific changes created by digitalization — more stakeholders, more speed and iteration, and the increased technical and collaborative nature of digital work, as well as handling new information-related risks.

As organizations change the way they operate, generate revenue and create value for their customers, new compliance risks are emerging — presenting a challenge to compliance, which must identify, assess and mitigate risks like those tied to fundamentally new technologies (e.g., artificial intelligence) and processes.

Information Governance

There is a growing list of US companies already subjected to GDPR-related EU regulatory actions, including, Amazon, Apple, Facebook, Google, Netflix, Spotify, and Twitter. Indeed, the French Data Protection Authority, CNIL, recently levied upon Google a record fine of approximately $57 million dollars for “lack of transparency, inadequate information and lack of valid consent regarding ads personalization.” The risks to US companies include providing proof of measures taken to protect, process, and transfer personal data from the EU to the US in connection with regulatory investigations or litigation.  A report published in late February by DLA Piper cited data from the first eight months of GDPR enforcement, during which 91 fines were imposed. "We expect that 2019 will see more fines for tens and potentially even hundreds of millions of euros, as regulators deal with the backlog of GDPR data breach notifications," the report said. Taking meaningful steps now toward GDPR compliance is the best way for US companies doing business of any kind involving EU personal data—including those with no physical presence in the EU—to prepare for and mitigate their risk.

The penalties of non-compliance with regulatory policies continue to mount.  Google was fined $170 million and asked to make changes to protect children’s privacy on YouTube, as regulators said the video site had knowingly and illegally harvested personal information from children and used it to profit by targeting them with ads. We can only expect that data privacy and compliance regulations will be taken more seriously in the future and that legal teams will play an expanding role in ensuring this.

Facebook agreed to pay a record-breaking $5 billion fine as part of a settlement with the Federal Trade Commission, by far the largest penalty ever imposed on a company for violating consumers' privacy rights. Facebook also agreed to adopt new protections for the data users share on the social network and to measures that limit the power of CEO Mark Zuckerberg. Under the settlement, which concludes a year-long investigation prompted by the 2018 Cambridge Analytica scandal, the social networking giant must expand its privacy protections across Facebook itself, as well as on Instagram and WhatsApp. It must also adopt a corporate system of checks and balances to remain compliant, according to the FTC order. Facebook must also maintain a data security program, which includes protections of information such as users' phone numbers. The issue of data privacy and compliance will continue to build momentum as more people understand the extent of the data harvesting that is going on.

Taking meaningful steps now toward robust information governance and compliance for all kinds of privileged and confidential data will be necessary for the modern digital-centric enterprise, and the modern legal department will need to be able to be an active partner and help the enterprise prepare for and mitigate their risk.

Compliance and Regulation Processes


Increasing Globalization = More Multilingual Data


While these forces we have just described continue to build momentum, driven by increasing digitalization and the resultant ever expanding content flows, we also have an additional layer of complexity: language. The modern enterprise is now much more rapidly and naturally global, and thus now the modern legal department and outside counsel need to be able to process content and information flows in multiple languages on a regular basis. The variety and volumes of multilingual content that legal professionals need to process and monitor can include any and all of the following:
  • International contract negotiations and disputes
  • Patent-infringement litigation
  • Human Resource communications in global enterprises
  • Customer communications
  • GDPR Compliance related monitoring and analysis 
  • Cross-border regulatory compliance monitoring
  • FCPA compliance monitoring 
  • Anti-trust related matters
The volumes of multilingual content can vary greatly, from very large volumes that might involve tens of thousands of documents in litigation related eDiscovery, to specialized monitoring of customer communications to ensure regulatory compliance, to smaller volumes of sensitive communications with global employees.

Multilingual issues are especially present in cross-border partnerships and business dealings which are now increasingly common across many industries.
The AlixPartners Global Anticorruption Survey polled corporate counsel, legal, and compliance officers at companies based in the US, Europe, and Asia in more than 20 major industries. The perceived corruption risks are elevated in Latin America and China, and Russia, Africa, and the Middle East have emerged as regions of increasing concern. The survey found that 90% and 94% of companies with operations in Latin America and China, respectively, reported their industries are exposed to corruption risk. Of the 66% of respondents who said there are regions where it is impossible to avoid corrupt business practices, 31% said Russia is one such place and 27% cited Africa.

The sheer volume of information companies must collect, translate, and analyze is the biggest obstacle to tackling corruption, according to 75% of survey respondents. 

These concerns surrounding the management of data are expected to increase with increasing data privacy regulation such as the EU’s General Data Protection Regulation.

 Data Growth


End-to-end translation solutions for the legal industry 

Thus, we see today that language translation production capabilities have become imperative for the modern global enterprise and that the needs for translation can range from rapid translation of millions of documents in an eDiscovery scenario to very careful and specialized translation of critical contract and court-ready documentation. Given the volume, variety, and velocity of the information that needs translation, legal professionals must consider a combination of technology and human services. Ideally, solving these kinds of varying translation challenges would be done by technologically informed professionals who solve complex and varied translation problems and who can adapt language technology and human expertise to the challenge at hand. 

Language Translation

Several MT and language service vendors provide an enterprise-class, vendor agnostic, secure translation platform that allows you to combine regulatory compliance and translation best practice. Securing the translation supply chain needn’t come at the cost of trusted suppliers, existing relationships or impact time to market.

Multilingual Data Triage

This blog was originally published on SDL.COM with more SDL product information.

Tuesday, October 8, 2019

Post-editese is real

Ever since machine translation was introduced into the professional translation industry, there have been questions about what the impact would be on a final delivered translation service product. For much of the history of MT many translators claimed that while translation production work using a post-edited MT (PEMT) process was faster, the final product was not as good. The research suggests that this has been true from a strictly linguistic perspective, but many of us also know that PEMT worked quite successfully with technical content especially with terminology and consistency even in the days of SMT and RBMT. 

As NMT systems proliferate, we are at a turning point, and I suspect that we will see many more NMT systems that are in fact seen as providing useful output that clearly enhances translator productivity, especially on output from systems built by experts. NMT will also quite likely have an influence on the output quality and the difference is also likely to become less prominent. This is what is meant by developers who make claims of achieving human parity. If competent human translators cannot tell that segments they review came from MT or not, we can make a limited claim of having achieved human parity. This does not mean that this will be true for every new sentence submitted to this system. 

We should also understand that MT  provides the greatest value in use scenarios where you have large volumes of content (millions rather than thousands of words), short turnaround times, and limited budgets. Increasingly MT is used in scenarios where little or no post-editing is done, and by many informed estimates, we are already at a run rate of a trillion words a day going through MT engines. While post-editese may be an important consideration in localization use scenarios, this is likely no more than 2% of all MT usage.  

Enterprise MT use is rapidly moving into a phase where it is an enterprise-level IT resource. The modern global enterprise needs to enable and allow millions of words to be translated on demand in a secure and private way and needs to be integrated deeply into critical communication, collaboration, and content creation and management software.

The research presented by Antonio Toral below documents the impact of post-editing on the final output across multiple different language combinations and MT systems. 


This is a summary of the paper “Post-editese: an Exacerbated Translationese” by Antonio Toral, which was presented at MT Summit 2019, where it won the best paper award.


Post-editing (PE) is widely used in the translation industry, mainly because it leads to higher productivity than unaided human translation (HT). But, what about the resulting translation? Are PE translations as good as HT? Several research studies have looked at this in the past decade and there seems to be consensus: PE is as good as HT or even better (Koponen, 2016).

Most of these studies measure the quality of translations by counting the number of errors therein. Taking into account that there is more to quality than just the number of mistakes, we ask ourselves the following question instead: are there differences between translations produced with PE vs HT? In other words, does the final output created via PEs and HTs have different traits?

Previous studies have unveiled the existence of translationese, i.e. the fact that HTs and original texts exhibit different characteristics. These characteristics can be grouped along with the so-called translation universals (Baker, 1993) and fundamental laws of translation (Toury, 2012), namely simplification, normalization, explicitation and interference. Along this line of thinking, we aim to unveil the existence of post-editese (i.e. the fact that PEs and HTs exhibit different characteristics) by confronting PEs and HTs using a set of computational analyses that align to the aforementioned translation universals and laws of translation.


We use three datasets in our experiments: Taraxü (Avramidis et al., 2014), IWSLT (Cettolo et al., 2015; Mauro et al., 2016) and Microsoft “Human Parity” (Hassan et al., 2018). These datasets cover five different translation directions and allow us to assess the effect of machine translation (MT) systems from 2011, 2015-16 and 2018 on the resulting PEs.


Lexical Variety

We assess the lexical variety of a translation (HT, PE or MT) by calculating its type-token ratio:

In other words, given two translations equally long (number of words), the one with bigger vocabulary (higher number of unique words) would have a higher TTR, being therefore considered lexical richer, or higher in lexical variety.

The following figure shows the results for the Microsoft dataset for the direction Chinese-to-English (zh–en, the results for the other datasets follow similar trends and can be found in the paper). HT has the highest lexical variety, followed by PE, while the lowest value is obtained by the MT systems. A possible interpretation is as follows: (i) lexical variety is low in MT because these systems prefer the translation solutions that are frequent in the training data used to train such systems and (ii) a post-editor will add lexical variety to some degree (difference in the figure between MT and PE), but because MT primes him/her (Green et al., 2013), the resulting PE translation will not achieve the lexical variety of HT.

Lexical Density

The lexical density of a text indicates its amount of information and is calculated as follows:
where content words correspond to adverbs, adjectives, nouns, and verbs. Hence, given two translations equally long, the one with the higher number of content words would be considered to have higher lexical density, in other words, to contain more information.

The following figure shows the results for the three translation directions in the Taraxü dataset: English-to-German, German-to-English and Spanish-to-German. The lexical density in HT is higher than in both PE and MT and there is no systematic difference between the latter two.

Length Ratio

Given a source text (ST) and a target text (TT), where TT is a translation of ST (HT, PE or MT), we compute a measure of how different in length the TT is with respect to the ST:
This means that the bigger the difference in length between the ST and the TT (be it because TT is shorter or longer than the ST), the higher the length ratio.

The following figure shows the results for the Taraxü dataset. The trend is similar to the one in lexical variety; this is, HT obtains the highest result, MT the lowest and PE lies somewhere in between. We interpret this as follows: (i) MT results in a translation of similar length to that of the ST due to how the underlying MT technology works and PE is primed by the MT output while (ii) a translator working from scratch may translate more freely in terms of length.

Part-of-speech Sequences

Finally, we assess the interference of the source language on a translation (HT, PE and MT) by measuring how close the sequence of part-of-speech tags in the translation is to the typical part-of-speech sequences of the source language and to the typical part-of-speech sequences of the target language. If the sequences of a translation are similar to the typical sequences of the source language that would indicate that there is an inference from the source language in the translation.

The following figure shows the results for the IWSLT dataset. The metric used is perplexity difference; the higher it is the lower the interference (full details on the metric can be found in the paper). Again, we find a similar trend as in some of the previous analyses: HT gets the highest results, MT the lowest and PE somewhere in between. The interpretation is again similar: MT outputs exhibit a large amount of interference from the source language, a post-editor gets rid of some of that interference but the resulting translation still has more interference than an unaided translation.


The findings from our analyses can be summarised as follows in terms of HT vs PE:
  • PEs have lower lexical variety and lower lexical density than HTs. We link these to the simplification principle of translationese. Thus, these results indicate that post-editese is lexically simpler than translationese.
  • Sentence length in PEs is more similar to the sentence length of the source texts, than sentence length in HTs. We link this finding to interference and normalization: (i) PEs have
interference from the source text in terms of length, which leads to translations that follow the typical sentence length of the source language; (ii) this results in a target text whose
length tends to become normalized.
  • Part-of-speech (PoS) sequences in PEs are more similar to the typical PoS sequences of the source language than PoS sequences in HTs. We link this to the interference principle: the sequences of grammatical units in PEs preserve to some extent the sequences that are typical of the source language.

In terms of the role of MT: we have not considered only HTs and PEs but also MT outputs, from the MT systems that were the starting point to produce the PEs. This to corroborate a claim in the literature (Greenet al., 2013), namely that in PE the translator is primed by the MT output. We expected then to find similar trends to those found in PEs also in MT outputs and this was indeed the case in all four analyses. In some experiments, the results of PE were somewhere in between those of HT and MT. Our interpretation is that a post-editor improves the initial MT output, but due to being primed by the MT output, the result cannot attain the level of HT, and the footprint of the MT system remains in the resulting PE.


As said in the introduction, we know that PE is faster than HT. The question I wanted to address was then: can PE not only be faster but also be at the level of HT quality-wise? In this study, this is looked at from the point of view of translation universals and the answer is clear: no. However, I'd like to point out three additional elements:
  1. The text types in the 3 datasets that I have used are news and subtitles, both are open-domain and could be considered to a certain extent "creative". I wonder what happens with technical texts, given their relevance for industry, and I plan to look at that in the future.
  2. As mentioned in the introduction, previous studies have compared HT vs PE in terms of the number of errors in the resulting translation. In all the studies I've encountered PE is at the level of HT or even better. Thus, for technical texts where terminology and consistency are important, PE is probably better than HT. I find thus the choice between PE and HT to be a trade-off between consistency on one hand and translation universals (simplification, normalization and interference) on the other.
  3. PE falls behind HT in terms of translation universals because MT falls behind HT in those terms. However, this may not be the case anymore in the future. For example, the paper shows that PE-NMT has less interference than PE-SMT, thanks to the better reordering in the former.

Antonio Toral is an Assistant Professor at the Computational Linguistics group, Center for Language and Cognition, Faculty of Arts, University of Groningen (The Netherlands). His research is in the area of Machine Translation. His main topics include resource acquisition, domain adaptation, diagnostic evaluation and hybrid approaches.

Related Work

Other work has previously looked at HT vs PE beyond the number of errors. The most related papers to this paper are Bangalore et al. (2015), Carl and Schaeffer (2017), Czulo and Nitzke (2016), Daems et al. (2017) and Farrell (2018).


Avramidis, Eleftherios, Aljoscha Burchardt, Sabine Hunsicker, Maja Popovic, Cindy Tscherwinka, David Vilar, and Hans Uszkoreit. 2014. The taraxü corpus of human-annotated machine translations. In LREC, pages 2679–2682.

Baker, Mona. 1993. Corpus linguistics and translation studies: Implications and applications. Text and technology: In honor of John Sinclair, 233:250.

Bangalore, Srinivas, Bergljot Behrens, Michael Carl, Maheshwar Gankhot, Arndt Heilmann, Jean Nitzke, Moritz Schaeffer, and Annegret Sturm. 2015. The role of syntactic variation in translation and post-editing. Translation Spaces, 4(1):119–144.

Carl, Michael and Moritz Jonas Schaeffer. 2017. Why translation is difficult: A corpus-based study of non-literality in post-editing and from-scratch translation. Hermes, 56:43–57.

Cettolo, Mauro, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The iwslt 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation.

Green, Spence, Jeffrey Heer, and Christopher D Manning. 2013. The efficacy of human post-editing for language translation. Chi 2013, pages 439–448.

Hassan, Hany, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, Will Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Zhuangzi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving Human Parity on Automatic Chinese to English News Translation.

Koponen, Maarit. 2016. Is machine translation post-editing worth the effort? A survey of research into post-editing and effort. Journal of Specialised Translation, 25(25):131–148.

Mauro, Cettolo, Niehues Jan, Stüker Sebastian, Bentivogli Luisa, Cattoni Roldano, and Federico Marcello. 2016. The iwslt 2016 evaluation campaign. In International Workshop on Spoken Language Translation.

Toury, Gideon. 2012. Descriptive translation studies and beyond: Revised edition, volume 100. John Benjamins Publishing.