Pages

Monday, October 16, 2017

The Use of Machine Translation in eDiscovery

There are some kinds of translation applications where MT just makes sense, and it would be foolish to even attempt these kinds of projects without decent MT technology as a foundation. Usually, this is because these applications have some combination of the following factors:
  • Very large volume of source content that simply could NOT be translated without MT in any useful time frame
  • Rapid turnaround requirement (days, hours or minutes) for the content to have any value to the translation consumers
  • A user tolerance for lower quality translations at least in early stages of information review
  • To enable information and document triage when dealing with large document collections and help to identify highest priority content from a large mass of undifferentiated content. This process also helps to identify the most important and relevant documents to send to higher quality human translation.
  • Translation Cost prohibitions (usually related to volume)
One can find this combination of requirements in several customer communications oriented applications like technical support knowledge-base, eCommerce product listings, customer service, and CX reviews for all kinds of products and service experiences. However, in an increasingly digital world, we see the need to be able to process large volumes of business content to identify what is most relevant and valuable for ongoing business mission needs as well. One such business information triage application is eDiscovery. In my time in working with MT, I have seen that this is an ongoing need that will continue to build momentum as we become digitally focused workers.

SYSTRAN has been a leader amongst MT solution providers in the eDiscovery segment, and have a long track record of success in this segment, and from my vantage point, a greater sensitivity to the customer needs of this segment than most others. Recently, they gave me unhindered access to a few of their eDiscovery customers, who provided insight into what really matters in terms of MT from the user perspective. This post will describe some key requirements from an active user’s perspective, especially Alvarez & Marsal in London.  In particular, their willingness to share their insights enabled me to provide and validate my own observations made in the substance of this post. I have also had a previous guest post from iQwest that also described the use of MT in eDiscovery applications from a service provider perspective.



What is eDiscovery?


Electronic discovery (sometimes known as e-discovery, eDiscovery, or e-Discovery) is the electronic aspect of identifying, collecting and producing electronically stored information (ESI) in response to a request for production in a lawsuit or internal corporate investigation. ESI includes, but is not limited to, emails, documents, presentations, databases, voicemail, audio and video files, social media content, and websites.

The processes and technologies around eDiscovery are often complex because of the sheer volume/variety of electronic data produced and stored. Additionally, unlike hard-copy evidence, electronic documents are more dynamic and often contain metadata such as time-date stamps, author and recipient information, and file properties. Preserving the original content and metadata for electronically stored information is required in order to eliminate claims of spoliation or tampering with evidence later in a litigation scenario.

What typically happens with an initially large mass of documents in an eDiscovery scenario is that some combination of the following activities is run to help organize and identify the most important material from a large document mass (Not sure it is quite a corpus – usually it is much too unstructured to call it that). Practitioners use phrases like “analytics phase”, “predictive analytics”, “predictive coding”, or “analysis phase” to the process they apply to winnow the document mass into a relevant set of high-value documents. It usually includes:

Classification: Users gather a select representative set of the documents from the existing document mass that represents the key interests and relevance of subject matters to be analyzed.
Clustering: They build out documents selected in the classification stage to find similar documents that match required cluster definitions and algorithms of the representative documents.
Summarization: This organization assists the user in selecting key sections of these documents as keywords, phrases, and summaries for use in litigation or corporate governance applications.
N-Grams: N-Grams are the basic co-occurrence of multiple words that are within any context. These could help identify a set of documents that have higher relevance and value in specific investigations and review and be useful in the winnowing process, or in understanding the linguistic profile of the mass of documents
The EDRM model overviews the typical process journey to increased relevance

Thus, after organization, collation and identification documents are sent to a translation process which will often require MT because of the sheer volume. MT allows the right documents to be identified for further refinement (with human translation) or analysis and review. This identification of a smaller set of more important documents from a large set is the essence of the triage process.

“Our projects are varied and are not all focused around litigation. For example we often perform regulatory exercises and investigations. In these situations, it is often not known at the onset what is required; therefore, the culling of data is based more upon an investigative nous [investigative mindset] and the utilization of analytics features such as document categorization or clustering. In this instance, samples of various documents, related to different investigatory routes, are sent for translation to [MT to] help our teams develop an understanding of the data. The ability to provide our investigators with the option to translate documents on the fly is also a massive benefit in these types of matters.” Alvarez & Marsal, UK

In terms of languages that matter in eDiscovery, the sense I get from my investigation is that it is quite diverse, but a lot of the work involves going from a variety of source languages into English (or German). Some say that CJK and FIGS matter most in an increasingly global world, but the needs are always case-specific so it can be as far ranging as Greek, Norwegian, and Swedish. In terms of subject domains of focus, we see that in the litigation scenarios, product liability, and patent infringement tend to dominate, but these categories could cover a wide range of domains ranging from consumer electronics, IT, automotive, pharmaceuticals/medical equipment, to financial and also extractive industries.

While many equate eDiscovery projects only with litigation related content, the market beyond litigation seems to be growing just as rapidly. In an increasingly digital world, the need to understand electronic data flows within a global enterprise for information governance needs can be useful for many different reasons as A & M again point out:
“Alvarez & Marsal get instructed on a very wide range of matters, including contentious projects around internal investigations, dispute resolution, insolvency, and compliance programs. However, not all of them are contentious in nature – for example, performance improvement and valuations. A common thread is that they are document ‘heavy’ and therefore require our skill sets to effectively conduct them. The use of the technology differs in each scenario. As a result, understanding the client requirements and the capabilities of the technology allows us to devise suitable workflows for handling the documents. However, where foreign languages are involved we use Systran translation technologies to the same effect. “
eDiscovery is basically a data culling and relevance ranking process

What Matters in an MT Solution for eDiscovery?

  • Rapid and Straightforward Accessibility: Attorneys, corporate governance and compliance professionals who function from within an eDiscovery platform environment need to be able to operate MT with ease. And most typically this will be from directly within the document analysis and organization platform that is the key application for many of these professionals. However, in very large cases documents may be sent in bulk to MT, but again the ability to manage and review relevant documents from within the review platform is a key requirement.
  • Language Identification: One of the first steps in classification and organization of documents is to group documents by source language and thus this is a critical step in the process. The ease and efficiency of this language identification process is very important for many users, as it is the first level of triage. Also, some languages may need different processing flows if MT is not available and non-automated procedures need to be incorporated. The ability to automatically identify the source language on-the-fly for a large variety of languages is also a key requirement, as reviewers follow relevance threads and need ad-hoc translations of documents on-the-fly that are related to investigation subject matter. Often reviewers will submit a batch of documents that may be in different languages, thus an MT solution that can automatically identify and translate is an advantage, and allows batches of files to be uploaded without concern regarding what language they are in.
  • Integration with the eDiscovery Platform: This needs to be much deeper than being able to pass source and target text files back and forth. Relativity is a particularly important document review platform in eDiscovery, especially in litigation scenarios. They also have been used extensively as the review platform of choice by many who care about processing multilingual content. One reason that SYSTRAN dominates in the eDiscovery segment is that they have a native Relativity connector. This is a “deep integration” that is built to integrate seamlessly into the software interface already familiar to Relativity users, and is built with Relativity best practices in mind, and validated by Relativity and their existing customers to provide value in real-world multilingual discovery cases. The deep integration with this platform not only allows single language identification and translation but also allows for multiple language identifications and translation within a single document, which is especially important for email threads. I have noticed over many years in the MT business that integration with a document review platform is a particularly important requirement, and while Relativity is not the only eDiscovery platform available, it is probably the most important one. Here is a Gartner Magic Quadrant for eDiscovery software where you can see that kCura (Relativity) is a leader.
  • Ability to Process Primary Document Formats: This would at a minimum be emails, Office documents, text files, PDFs, web content, and increasingly social media content from Twitter and Facebook, as well as audio and video content. More and more, we see that emails are the most common document format that is processed in a review platform. Often an email thread could be in two or more languages and thus the market need for MT solutions that can handle multiple languages within the same document has become much more urgent and even a mandatory requirement.
  • Security and Data Privacy: For some matters, users care that systems can be installed on-premise and that no data is transported outside a secure firewall. There are often data custody restrictions linked to projects which also greatly constrain what MT solutions can be used.
  • Scalability - Ability to process Very Large Data Sets in addition to Ad-Hoc needs: Some cases may require that terabytes and even petabytes of data are involved. In such cases, MT efficiency can be a significant factor and drive MT system selection. On these very large PB sized projects, RBMT solutions have a clear advantage (in terms of performance and raw processing efficiency) and this perhaps also explains why SYSTRAN has been a long-term and dominant player in this market segment. They can provide a range of MT solutions that can meet different user requirements. The degree of automation should be such that 10,000 documents can be submitted with the same ease as 10 documents can.
  • Easily Customizable: Customization of MT systems can vary in complexity and time investment requirements. It can be done rapidly with dictionaries and glossaries, or in some cases some vendors provide pre-built domain focused baselines MT engines e.g. automotive, financial, chemical, IT, legal. For very long-running and high-value cases/subject matter the need may arise for translation memory based customization, but the most common scenario in eDiscovery seems to be rapid customization. The availability of a range of domain glossaries and domain focused engines make higher quality MT output possibly with minimum effort. There seems to a market need for a web-based simple point-and-click interface for adding dictionary terms or translation memories (TMs), that can include integrated testing and deployment features, and also out-of-the-box domain-specific MT for a variety of domains as described above. Also, a typical flow may involve that limited customization is done on the bulk level but once a document set is culled, it makes sense to customize the MT system to improve MT output quality. MT output quality is an important determinant of selection, as we see from the user comment below. An effective customization process also helps to extract the most relevant set of documents for human translation efforts.
  • Special Features: There are several things that MT vendors can do to help users get better output results, and some vendors provide ways to perform rapid customization with glossaries that are driven by n-gram analysis, use monolingual data to improve fluency and quickly incorporate available TM to tune the engine on the subject matter of interest. Other capabilities that also exist in MT solutions include:
    • Some systems allow for anonymization and/or pseudonym-enabling of review data to enable and facilitate cross-border data transfers & reviews. This allows data sharing between work groups, while still complying with international data privacy laws and legal chain of custody requirements. 
    • For advanced and more technical users there are also some vendors who provide toolkits to do corpus analysis and modification. This would allow users to add linguistically informed routines to enhance the data above and beyond what the eDiscovery platform can do.
    • Audio & Video. The need to be able to handle digital “documents” now increasingly includes voicemails, conference call recordings and video.

While I am not suggesting that SYSTRAN is the only MT vendor who could service eDiscovery market MT needs, I am saying that they have solved several very specific problems that really matter to an eDiscovery user, and thus are likely to be a preferred vendor in many cases related to multilingual eDiscovery, in the same way that Relativity is for eDiscovery applications in general. In support Alvarez & Marsal comments:
“A key reason for using SYSTRAN was the depth of integration with Relativity, which means our clients see it is as one connected, flexible and effective solution – providing them with reassurance and comfort in only having to use one tool [Relativity]. In addition, the speed and accuracy of the translations were impressive when benchmarked against other providers, as well as the simplicity of accurately translating documents with a few mouse clicks.
The outlook for the future suggests that the eDiscovery will only gain momentum as corporate governance begins to monitor social media, and as we realize that email is increasingly understood to be a source of problems for information governance issues and compliance. Emerging regulations, especially in Europe, suggest the need will be even greater in the EU. Several eDiscovery service providers I talk to have suggested that multilingual documents are now increasingly common and this trend will only gain momentum in future. A closing comment from A & M:
“The need for accurate and efficient translations is definitely growing within the eDiscovery market… We are consulting more and more with clients whose data contains a mix of various languages and we do not see this need slowing down in the near future. “

Thursday, September 28, 2017

Enabling Authenticity in Global Branding

This is a guest post by Aaron Schliem, who writes on  fundamental globalization questions. Most of the agencies in the translation industry are involved in brokering human translation services, which is increasingly under price pressure, because most agencies add very little value to the production process beyond brokered project management, and we also see that MT is getting "good enough" to solve many enterprise needs to communicate multilingually. However, value is added by humans who understand the bigger picture, and tune business content creation processes to improve the overall customer experience regardless of locale and language.

As all major enterprises today become more global, both in their internal workforce composition, and their primary market outlook, new, more culturally informed approaches are needed. The problem is not just at agencies, as Aaron says:
Global business presently operates within an overly simplistic paradigm that assumes translation of product and marketing content is sufficient to be successful in the global economy. We believe that the localization industry has become consumed with what amounts to “computer-enabled translation,” content to simply move content along a conveyor belt to deliver words to market. Both agencies and the corporate buyers of their services typically fail to focus the human side of globalization.
 I believe this kind of broader and more global human focus will be part of the makeup of the best agencies in future, and is already part of the culture and DNA at truly global enterprises.

As an aside, but somewhat related to changing trends, there is a lot more content out there that really matters, to drive global revenue, and is needed to get involved in many different customer related conversations that are key to international business success. I saw in a recent interview with the Moravia CEO, Tomas Kratochvil, stated that “We have 80% of content already going through machine translation. The number of words we are able to process for our customers is much higher than it used to be in the past,” he points out. “We’ve changed the way we do business from close to zero machine-translated words to 70-80% of words which are machine-translated.” Slator characterizes the recent past of Moravia as a "quiet rise and strategic shift" and apparently they are a 160M company today. Probably 2X to 3X what they were 5 years ago.

SDL is at about $365M in revenue (with a weak Brexit Pound) and are the most engaged with MT 99.5%  of words they process coming from MT (100M HT and 20B MT per month), which also means they are really engaged in a much broader range of customer conversations that define the global customer experience. Both these companies have grown significantly over 5 years. 

Now contrast this to Lionbridge which has hovered around $500M for 5-10 years. Rory Cowan of Lionbridge recently said that "Machine translation has been the classic dark horse, of course, waiting for its hour of glory."  (Wake up, dude!) The cost for missing the boat in competitive businesses is to lose market share e.g. LIOX. I am going to bet that the 5 year success metrics for both Moravia and SDL were significantly better than what we have seen with Lionbridge. MT is only a small part of this of course, but it was strategic many years ago for those who were clued in. I am going to bet using it well will become an even more important element over the next five years.


-----=======-----

Let me describe for you what global branding often looks like in US companies. The self-assured marketing team typically develops a brand strategy that is unconsciously steeped in the culture that prevails at headquarters. The well-meaning team makes decisions and assumptions about what people value, how they behave, what their history is, what they may find compelling, all the while not realizing that they are silently defining “people” as “Americans.” Once the team at headquarters feels the brand is properly characterized, guidelines, story lines, talking points and imagery are developed to facilitate communication of the brand identity and strategy to staff and consumers. Downstream content and assets are translated into other with the intention of bring teams from diverse geographies into the fold. After translation, headquarters might reach out to local market colleagues and ask them to review the translations. But not even that level of engagement with international colleagues is a sure thing. The bottom line is that, as with most areas of business, branding teams approach globalization as an after-thought that can be addressed through translation once the company vision is set – from headquarters.

At its core this approach makes a rather righteous assumption about the global dominance of US business and consumer culture. Everyone loves America, after all, right? It’s easy to think that people all over the world are so used to “buying” American culture that no additional efforts to connect need to be undertaken. This is reinforced by the fact sometimes it is, in fact, the very Americanness of the brand that people are buying. According to the Wall Street Journal, Cadillac sales are up 23% globally this year (through July 2017), with year-to-date sales in China jumping 69% relative to the same 7-month period in 2016. Why? A Shanghainese Cadillac owner indicated that his car sets him apart and “represents American heritage.” You see the ubiquity of American culture among the cosmopolitan elite in Europe in daily conversation where, whether you are speaking Spanish, French or German, the word “cool” has become cool.

Given the dominance of the United States in the global economy since WWII, complacency is an easy trap to fall into. It’s easy to think that if you are selling internationally and increasing your sales year-over-year, the current approach must be working just fine. But this sort of thinking dramatically over simplified the issue. Just because a company is growing internationally doesn’t mean it could not be growing much more quickly internationally or that the growth might not be longer-lasting if more nuanced global strategic thinking were applied. Implicit in this over-simplification is the idea, made famous in the film Field of Dreams, that if we just “build it” then naturally “they will come.” However, this reeks of American over-confidence.

In their defense, I don’t think that branding professionals are consciously putting on airs. The lack of vision results from lack of experience in global settings. Living in a large and diverse country like the United States, it’s easy to live in a bubble. Even when Americans put on their explorer hats and plan vacation adventures designed to broaden our horizons, we tend to focus on relatively less-expensive domestic travel. Why visit Orleans when you have New Orleans in your backyard? OK, maybe a bad example – New Orleans is pretty objectively awesome regardless of how you look at it – but you see my point. And in a country where work is king, paid time off is scarce, and only 36% of citizen have valid passports, one can understand how Americans end up having fewer international experiences. Americans live in a massive cultural silo so it is logical that our approach to branding would have similar dimensions, or lack thereof.


 Nonetheless, developing a global brand strategy while living in self-imposed cultural isolation is problematic. When the approach to global branding becomes mono-cultural, it is more difficult to apply in new markets. In not leveraging ground-floor observations from a variety of cultures to assemble an inclusive and authentic brand vision, the approach reinforces a wide-spread skepticism and resentment of American cultural arrogance. It plays into the common view that American companies are brute force neoimperialists, self-aggrandizing egomaniac, or “my way or the highway” managers who could care less how the rest of the world functions. Developing a cohesive global brand is difficult in the best of circumstances. This sort of monolithic thinking does little to set the company up for success. On the contrary, it alienates those who are best positioned to help the company to be successful.

Why does it matter?

Modern business is global by definition. Being global it is not an optional marketing strategy that can be employed down the road once domestic growth has tapped out. According to the UN Conference on Trade and Development’s 2017 World Investment Report, over the past 25 years the top 100 multinational enterprises (MNEs) have seen the majority of their sales, assets and employees shift to foreign markets. The shift is even more dramatic when you look only at companies from the digital economy, which has traditionally formed the backbone of the globalization industry. Why does this matter? Well, first off, no global business is going to be successful unless its employees understand and value the brand. If most of your employees do not reside within the headquarters culture, extra effort is required to ensure that the work of foreign employees will support and strengthen the brand. And if most of your customers are foreign, your brand strategy needs to be broad enough to include a range of cultural realities.

 

Tips for building globally-enabled brands

 

  1. Explore your own cultural biases: The number one thing you can do to improve your global brand is to start by examining your own cultural baggage. To connect with people and ideas that are outside our cultural frame of reference, we first need to be break free from the assumptions we make, day in and day out, without even realizing we are doing so. A great way to break biases is to engage in cross-cultural communications training (hint, hint – yes, Idiosynch offers this). Even something as simple as attempting to learn a new language forces us to move outside of our rote thinking. If taking a class is not feasible, we can extend culturally by simple acts in everyday life. Take a moment to talk with someone from a different culture, whether they are your doctor, your grocery bagger or the barista at your local café. Exploring the world begins with exploring people and if we are willing to open our eyes, there are a surprising number of opportunities to grow our cultural knowledge in our local communities. Often it is not hard to see our own biases once we have a little perspective. Getting perspective, however, requires conscious effort and a commitment to honest and humble self-examination.
  2. Open your ears to global voices: You would be surprised how much global experience exists on even small domestic teams. Because we are so used to thinking and acting from within our bubbles we rarely ask ourselves and our colleagues about the experiences that have forced us to question our cultural assumptions. One need not have travelled the world to have had such experiences. We have them every day with our neighbors and friends. Sharing our diverse experiences primes our ability to think outside of our own bubble. If your organization is already global, a powerful approach is to engage with local champions from a variety of disciplines in the company. By including diverse voices, we can learn to understand and codify brand identity in ways that drill down to the core message, retaining a common cultural denominator that does not preclude any single culture from participating.
  3. Universal but flexible: Brands cannot be infinitely malleable. After all, each company needs a level of brand consistency across all markets. That said, the Apples of the world, with their one-size-fits-all approach, are rare and exceedingly difficult to create. With global branding we want to develop a core universal brand identity, a basic set of values and emotional connections we hope to make in all markets, while retaining enough flexibility to adapt to local cultural realities. Let’s think about brand in terms of storytelling. The global brand can be quite specific about how it defines the “moral of the story,” or, simply put, the brand message. The moral of the story is what matters and needs to be conveyed in every market. By communicating the moral without prescribing the characters, plot, setting, etc, we can empower local teams to tell a wider range of stories, tailored to local cultural realities.
  4. Customization guidelines: A strong global branding strategy that will require adaptation by region or market. But allowing local teams to adapt brand identity willy-nilly is a surefire way to lose control. Through communication and coordination with local-market teams and by making use of cultural consultants to provide heuristic analysis, a branding team can develop market-specific guidelines (linguistic, visual, value mapping, etc.) that allow local market teams to bring their superior cultural knowledge to bear while ensuring an adequate level of control and transparency for headquarters.
  5. Authenticity and values: Brands are striving for authenticity and emotional connections with consumers. But to be authentic, you need to know what authenticity looks like in other cultures. To build a brand identity without viewing it through diverse lenses of cultural authenticity is naïve at best and potentially devastating. Not knowing what cultural myths one is conjuring, what history one is evoking, what references to art or pop culture one is unwittingly making, is the surest way to alienate consumers. For not only does the brand miss an opportunity to connect; it is damaging its chances of ever connecting. Americans are less accustomed to this dynamic because many of the most powerful brands are in fact American. But ill-conceived branding is common in other markets. Both the American perpetrators and the local recipients tend to laugh off these branding faux pas, but at the end of the day, the brand that fails to demonstrate authenticity in culturally specific ways is bound to be perceived as out of touch or worse, disrespectful.
  6. Brand is culture: Culture, ideology, history and politics inform and affect the way we build brands, how we perceive brand meaning and what values are ultimately targeted. With the emergence of earned content (user-generated content) as a key conduit for demonstrating authenticity, the dance between culture and brand is now happening in real time and it can be difficult to tell who is leading and who is following. Brand is understood and experienced by consumers through their cultural lens. But culture itself can be observed and understood through brand as consumers assert their voice and equal role in the dance. We are emerging into an era where brand strength can arguably be judged by the degree to which it is effective in pushing the culture needle, playing a role in culture very evolve.
  7. Feedback loops and brand integrity: Once an authentic, adaptable brand globalization strategy has been developed, with all the incumbent research and guideline development, you might think the journey is complete, at least until the brand wants to take a new direction. But brand has a funny way of taking on a life of its own. You will need a strategy for monitoring the brand in every market. It is necessary to build expectations and processes that allow the brand team at headquarters to understand how the brand is being expressed in foreign cultures. It is here where many global brands fail. It’s tempting to do the hard work initially and then simply abandon the local manifestations of your brand to the winds of change (or if you are unlucky, to the storms of controversy). In developing a long-term brand audit plan, headquarters can ensure cohesion and consistency across markets and over time, while also mining the experiences in each market for nuggets of wisdom that may work in other markets as well.






Aaron Schliem is the Principal Consultant & Founder of IdioSynch



Idiosynch is an advisory firm that shows companies of all sizes how to harness the power of cultural authenticity in their workplace culture and global branding. Aaron acts as a virtual Chief Globalization Officer to evaluate identity and systems and to deliver human-centered strategies for smarter global growth.


A 20-year veteran of the language services industry, Aaron is a serial entrepreneur who launched his first company, Horizon Learning, a Chilean adult second language acquisition firm, in 1996. More recently Aaron helped found and lead Glyph Language Services, where as CEO he positioned the firm as a leader in global communications, cross-cultural learning, and adaptive localization of creative media.


A regular speaker at international conferences and workshops Aaron is a thought leader in fields that range from executive compensation to mobile apps and games. Aaron’s writings have been published in Multilingual Magazine, The Content Wrangler and The Savvy Client's Guide to Translation Agencies. You can read more about Aaron’s approach and philosophy on his blog at www.idiosynch.com

Tuesday, September 19, 2017

About Clever CATs and TeMpTing Free MT Offers

This is a guest post by Christine Bruckner, that looks into the Data and Information Security issue with free online MT services from a translators perspective. She was kind enough to do a summary translation of a longer article she recently wrote in German, referenced below.This subject is closely related to my previous post and shows that this issue is gaining in visibility and prominence.

Jost Zetzsche has also written about this MT security and data privacy issue, and some of you may have noticed our Twitter banter on the Google Translate security policy. Jost feels that this text from a FAQ suggests that use of the Translator API overrides the "Your Content in our Services" policy, even though the legal language in the Terms of Service very clearly states the following: "When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. The rights you grant in this license are for the limited purpose of operating, promoting, and improving our Services, and to develop new ones." I guess it is a matter of interpretation, and my lack of trust perhaps, given how Google presents facts sometimes. Knowing what I do about how an API works, I see that the GT GUI (which is just a front-end software interface like any other,) connects via an API to the Translate Service, thus, I maintain that the aforementioned TOS is very much in effect whenever your data touches the Translate Service. Good luck to anyone who wants Google to confirm or deny this, because they tend to NOT RESPOND.  Don DePalma's quote from his 2014 article shown below, also seems to support the view that the TOS is in effect. It is up to you to decide for yourself, my sense is that they tell you very clearly what they have  the right to do, so be wary if you don't like this policy. 

It is good that this issue is getting attention, so that all MT use in professional settings can be more informed on the data security issue. I thank Christine for also clarifying the MyMemory privacy policy below and also alerting us to the more stringent security requirements in the EU. We should not fault these MT service providers for using your data if you use their services for free, as it can sometimes be useful to improve the MT capabilities. "There is no such thing as a free lunch" , as they say in America. Given the sheer volume of the MT use on any given day, I doubt it is possible to analyze the translated text in any way other than through some machine learning process. The risks are always higher when you have a case of incompetence like translate.com, or when the MT provider is under-resourced  and unable to implement proper safeguards. Some providers do give you an option to buy the privacy and that too is a fair and reasonable policy, as the options to go on-premise and private cloud also come at a cost.


------------------------


Most of the CAT (computer-aided translation) tools used in the professional translator’s workplace offer integrations with online machine translation (MT) solutions. Better MT quality and self-learning capabilities thanks to neural and adaptive MT technologies make the classic TM (translation memory) and innovative MT combination (also called “augmented translation”) more attractive for professional translators, too.

Much has been written and said about MT post-editing, quality, pricing, process impacts, etc. – but it seems that so far, minimal awareness has been provided about potential and actual information security leaks when professional translators plug-in free online MT into their translation environments.

For an article written in August 2017 for the 04/17 edition of MDÜ (the journal of Germany’s Association of Interpreters and Translators, BDÜ), I have taken a closer look at the most popular MT plugins that are available for common CAT tools and the information security aspects of such MT offers. A free reading sample of my German article is available under http://www.bdue-fachverlag.de/download/mdue/1870).


Recently, Slator called MDÜ’s spotlight on information security “timely”, as one day after publication of this MDÜ edition, news spread about the massive data privacy breach in Norway due to use of free online MT (see https://slator.com/technology/translate-com-exposes-highly-sensitive-information-massive-privacy-breach). In my opinion and experience, such problems have also existed in the past, but have been largely unnoticed by a wider audience. But now with several MT solutions providers heavily advertising their secure MT solutions, the marketing departments of MT solution providers and the media seem to be much more attentive to such issues. (The elevated  concern for data security may also have been exacerbated by incidents like the Equifax and Russian hacker stories that fill our news sources today.)

MT Integration within CAT tools

In my MDÜ article, I have focused on the four CAT tools that, according to a Slator research in April 2017 are the most popular ones among German technical translators: SDL Trados Studio, Across, memoQ and STAR Transit.

The plug-ins available for these CAT tools offer access to MT technology solutions in two modes:
  • batch mode / via pre-translation
  • interactive lookup and use during translation


CAT Tools
Plugins for Free Online MT
Other MT Plugins for Paid MT Solutions
SDL Trados Studio 2017
SDL Language Cloud, Google Cloud Translation, MyMemory, Microsoft Translator (via Enhanced MT Plugin), iTranslate4.eu;
additional free MT plugins available via SDL AppStore
Systran, Omniscien, KantanMT, CrossLang Gateway MT, Promt, LucyLT (now OctaveMT) and other providers (available via SDL AppStore or directly from the MT vendor)

Across 6.3

Google MT

Moses, Omniscien, Reverso, SmartMATE, LucyLT (now OctaveMT), Systran; additional connectors can be ordered from Across
MemoQ 8.1.5
MyMemory, Google MT, Bing/Microsoft Translator, iTranslate4.eu
KantanMT, Omniscien, CrossLang Gateway MT, Iconic IPTranslator MT, Let’sMT!, PangeaMT, Systran, Slate Desktop, tauyou, Tilde MT and other providers
Transit NXT SP 9
Google MT, Microsoft Translator, MyMemory, iTranslate4.eu
(all of them only in interactive mode)
Systran, SmartMATE, Omniscien, STAR-MT (also for pre-translation; only available in Transit Freelance Pro and Professional versions; needs to be activated via license number)




These modes of integrating MT with TM are by no means recent advances, such combinations were already available in the 1990s, for example in the Trados Translator's Workbench (see article by Matthias Heyn on pp. 111-123 of the MT archive).

The interaction between CAT tool and MT solution takes place on the segment (i.e. usually sentence) level. The MT suggestions are returned to the CAT tool both on the segment level, and often also on the sub-segment level: individual words or phrases from the MT system are presented interactively via predictive typing to the translator, or via MT enhanced (repaired) fuzzy matches in pre-translation and / or interactive mode.

I have selected the following online MT services for my further investigation:
  • Google Cloud Translation (integrated in all four CAT tools)
  • Microsoft Translator (available in three of the CAT tools)
  • MyMemory (available in three CAT tools)
  • SDL Language Cloud AdaptiveMT (available since SDL Trados Studio version 2017 for some language directions; free access for owners of a Studio Freelance or Professional license)
I have only taken into consideration the respective free, non-payable MT service offers: This means that the MT service is generally restricted regarding translation volume and, in some solutions, also restricted in features (e.g. only SMT and no NMT). Except for the MyMemory MT service, all of these free MT services require user registration.

Figure 1: MT plug-ins and MT pre-translation in SDL Trados Studio 2017 SR-1 via Google Cloud Translation API with NMT option

Information Security Aspects

When talking about sending/uploading data – or in the context of professional translation, often complete text – on the internet, there are two main aspects that need to be considered: data privacy and information security.

In my BDÜ article, I focused on information security aspects (as this was the topic of this MDÜ edition; for data privacy aspects see Addendum 1 below), and its 3 three basic components: confidentiality, availability, and integrity.

Availability of free MT online services is usually not crucial for professional translators: they will still be able to work if the online MT system fails. And in the free MT offering range, none of the service providers guarantee availability anyway.

Integrity is also a minor concern for professional translators as they do not expect that the machine-translated data will be "complete" and "unchanged" – neither in terms of content nor structure (formatting, tags, etc. are often lost during MT processing).

The crucial topics are related to the question of data confidentiality.

In an article published in 2014, Don DePalma of Common Sense Advisory mentioned two problem areas associated with using online MT in general:
  • a) The “wrong” people can see information in transit.
  • b) MT sites can use your data in ways you did not intend.

My research regarding aspect a) for the TeMpTing MT plugins showed:
The API of some cloud MT providers such as Google or Microsoft offer encrypted data transfer options (e. g. via SSL protocol), but this is not used in most CAT tool integrations or not available in the free version or not recognizable by a normal user. Only the CAT tools MemoQ and Across provide an option to configure the own computer or server as the referrer in the API key.

And concerning aspect b): With the help of AI methods, Internet data collectors are able to re-construct the whole text even when the free TeMpTing MT plug-ins are used only in interactive translation mode, whereby the individual segments are sent to the online MT provider at irregular intervals.

And what Don DePalma found in 2014: "While content ownership remains with the creator, free MT providers claim usage rights under their terms and conditions. For example, Google notes that it “does not claim any ownership in the content that you submit or in the translations of that content returned by the API.” However, as you follow the policy links, you learn that “When you upload or otherwise submit content to our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.”, still seems to be true.

I started to look for links to the Terms of Service of the free online MT offers in the CAT tool interfaces and documentation, but I have found only a few “road signs” with warnings and links.


Figure 2: MT connection screen in Across Freelance Edition

When I dug deeper and went into the “jungle” of the Terms of Service on the websites of the MT service providers, I found some quite worrisome facts: In their long and dispersed terms of use and service, providers like Google, Microsoft, and SDL try to assure the user that they really care about your data as much as you do and that your data belongs to you. But none of these three providers offer MT as a regional service, which means that the servers are not guaranteed to be located in the EU (even SDL’s servers are located in the US) – and the stricter European terms of service of these vendors and/or European data protection regulations might not apply.

SDL explains in the SDL Language Cloud FAQs that "With SDL Language Cloud Machine Translation you can rest assured that your content is safe. SDL guarantees that your data is not saved or used outside of the scope or timeframe that is necessary to provide you with the service”. However, when you read the Internet Security section of SDL’s Terms and Conditions for Language Cloud Translation Services, they also warn: "Because the Internet is an inherently open and insecure means of communication, any Data or information a user transmits over the Internet may be susceptible to interception and alteration. SDL makes no guarantee regarding, and assumes no liability for, the security and integrity of any Data or information you transmit over the Internet, including any Data or information transmitted via any server designated as "secure". You should not have an expectation of privacy in any content, including accounts of files transmitted through the internet.”

And according to an SDL statement at the European Trados User Group Conference in June 2017, the SDL Servers for Language Cloud Machine Translation are located in the US, and none in Europe.


For further details on the Terms of Service of Microsoft and Google, see also Kirti Vashee's most recent blog on Data Security Risks with Generic and Free Machine Translation.
 
Translated srl, the Italian service provider behind MyMemory, even boldly claims in its Service Terms and Conditions of Use:  "We collect any segment submitted and store it on a long term basis, whether it’s public or private. [..] The contributions to the archive, whether they are "Public Data" or "Private Data", are collected, processed and used by Translated to create statistics, set up new services and improve existing ones.”


Clever CAT Recommendations for Translators

So should translators and companies employing translators keep their fingers away from any MT plugin in CAT tools? This is, of course, true for confidential or otherwise classified texts, and whenever use of online MT services for processing texts is explicitly forbidden by the client or company.

If the classification status of the texts is not clear, translators are advised to apply common sense and beware of the dog behind the MT-augmented CATs by taking a closer look at the Terms of Use / Service of online MT services.

While companies and (large) LSPs can buy or build their own secure MT solution (on-premises or in a secure private cloud environment), individual translators could – and should - also benefit from and keep up-to-date with advances in MT technology by:
  • using free (unsecured) online MT for freely available test texts or translation jobs where they have the consent of the author or client
  • use offline MT solutions which can be acquired for a few hundred euros/dollars. In addition to being free from information security issues, they also provide more customization options like terminology import.

Addendum 1: Data Privacy Aspects when Using MT


The recommendation to use offline or secure cloud-based MT solutions also holds true for the data protection aspect – for online MT use in general. The authors of a recent, very informing article on “Data protection in Machine Translation under the GDPR” (GDPR = General Data Protection Regulation) confirm: Offline MT […] does not pose any special problems with regards to personal data processing.”[1]

With regards to online MT services and protection of personal data, they recommend: “The user should generally avoid online MT services where he wishes to have information translated that concerns a third party (or is not sure whether it does or not).” And in a footnote, they even conclude: “This means users may be advised to use online MT services only for translating text from their own language into another language, and not vice versa (where they cannot be sure of the content)”[2].

The General Data Protection Regulation (GDPR) will enter into force on 25 May 2018, and although it is EU legislation, “[…] the GDPR expressly applies to the processing by controllers outside the EU, as long as the controller offers services to EU citizens (art. 3(2) of the GDPR).”[3]

[1] Kamocki, Dr. Tauch (2017): Data protection in Machine Translation under the GDPR, in: Porsiel, Jörg (ed.): Machine Translation - What Language Professionals Need to Know, 2017, p. 71 ff.
[2] ibid. p. 81
[3] ibid. p. 69

 

Addendum 2: DeepL – a TeMpTing New MT Player?

At the end of August 2017, a new NMT player with headquarters in Germany and servers in Iceland (by the way, not an EU member state) has entered the scene – DeepL

Their free MT offer is currently only accessible via their website. However, Jost Zetzsche mentions in his 278th Tool Box Journal published on Sept 9, 2017, that DeepL will develop an API and this “is particularly important if DeepL is to be used by professional translators who would want to use it not on a web page but integrated into a translation environment tool”. Jost writes that according to his conversation with DeepL's CTO there seems to be some hope that “DeepL would commit itself to not using the data that is being translated for training purposes (which it does right now)” – in exchange for payment for this use of such API.

Post Script



Regarding EU aspects and Jost’s comments: Google actually has Terms of Service for Germany (and possibly for other EU countries) that do not include the passage regarding “you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works “

The German TOS are more recent than the international ones, probably they have been updated in order to prepare for the EU General Data Protection Regulation. But given that Google MT is not a regional service, it remains unclear which TOS apply.

I have explained this in my German article in more detail, but left this out in my “international” article as I have considered this too Germany/EU-specific.



=================




Christine Bruckner has more than 20 years of professional experience as a freelance translator and in CAT/term/MT administration, support, training & consulting in the German government/military, corporate and LSP area. Since 2014, she leads the Technical Services team at a German LSP.

Christine holds university degrees in translation and in computational linguistics; she has been one of the early adopters of TM technology in the early 1990s and has introduced and administrated several MT solutions and translation memory and management systems at different employers. She is a member of EAMT (www.eamt.org) and BDÜ (www.bdue.de), enjoys reading MT research and testing TM+MT combinations. She tries to get the best out of the TM+MT coupling in her occasional translations, mostly in the human rights' area.

Web site: http://www.cattmatters.de/English/