Wednesday, September 13, 2017

Data Security Risks with Generic and Free Machine Translation

Together with all the news of catastrophic hurricane activity in the US, that we are being bombarded with recently, we also are seeing stories of serious data security breaches of privileged information, even in the world of business translation. As always, the security and privacy of the data can only be as good as the data security practices and the sophistication of the technological implementations, and thus the knee-jerk response of pointing to “bad” MT technology per se, and assigning blame at MT use practices is not quite fair. It is indeed possible to make MT services available for public and/or corporate community uses safe and secure if you know what you are doing and are careful. It may seem obvious to some, but effective use of any technology requires both competence and skill, and we see too many cases of inept implementations that result in sub-optimal outcomes.
Generally, even professional translation work done entirely by humans requires source content to be distributed to translators and editors across the web to enable them to perform their specific tasks. The manner in which sensitive data or confidential information requiring translation can leak is twofold. First, information can be stolen "in transit" by transferring or accessing it over unsecured public Wi-fi hot spots or by storing it on unsecured cloud servers. Such risks have already been widely publicized and it is clear that weak processes and lax oversight are responsible for most of these data leakage cases.

Less considered, however, is what online machine translation providers do with the data users input. This risk was publicized by Slator last week, when employees of Norwegian state-run oil giant Statoil had “discovered text that had been typed in on [] could be found by anyone conducting a [Google] search.”

Slator reported that: “Anyone doing the same simple two-step Google search will concur. A few searches by Slator uncovered an astonishing variety of sensitive information that is freely accessible, ranging from a physician’s email exchange with a global pharmaceutical company on tax matters, late payment notices, a staff performance report of a global investment bank, and termination letters. In all instances, full names, emails, phone numbers, and other highly sensitive data were revealed.”

In this case, the injured parties apparently have little, or no recourse, as the “Terms of Use” policies of the MT supplier clearly stated that privacy is not guaranteed: “cannot and do not guarantee that any information provided to us by you will not become public under any circumstances. You should appreciate that all information submitted on the website might potentially be publicly accessible.”
Several others in the translation industry have pointed out other examples of the risks and have named other risky MT and shared data players involved with translation data.

Translation technology blogger Joseph Wojowski has written in some detail on the Google and Microsoft terms of use agreements in this post a few years ago. The information he presents is still quite current. From my vantage point these two MT services are the most secure and reliable “free” translation services available on the web today and a significant step above offerings like and many others. However, if you are really concerned about privacy these too have some risk, as the following analysis points out.

His opening statement is provocative and true at least to some extent:
“An issue that seems to have been brought up once in the industry and never addressed again are the data collection methods used by Microsoft, Google, Yahoo!, Skype, and Apple as well as the revelations of PRISM data collection from those same companies, thanks to Edward Snowden. More and more, it appears that the [translation] industry is moving closer and closer to full Machine Translation Integration and Usage, and with interesting, if alarming, findings being reported on Machine Translation’s usage when integrated into Translation Environments, the fact remains that Google Translate, Microsoft Bing Translator, and other publicly-available machine translation interfaces and APIs store every single word, phrase, segment, and sentence that is sent to them.”

The Google Terms of Service

Both Google and Microsoft very clearly state that any (or at least some) data used on their translation servers is viable for further processing and re-use, generally by machine learning technologies. (I would be surprised if any single individual does actually sit and watch this MT user data stream, even though it may be technically possible to do.) Their terms of use are considerably better than the one at who might as well have reduced it to: “User Beware: Use at your risk and we are not liable for anything that can go wrong in any way whatsoever.” Many people around the world use Google Translate daily, but very few of them are aware of the Google Terms of Service. Here is the specific legalese from the Google Translate Terms of Use Agreement that I include here, as it good to see it as specifically as possible to properly understand the potential risk.
When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. The rights you grant in this license are for the limited purpose of operating, promoting, and improving our Services, and to develop new ones. This license continues even if you stop using our Services.(Google Terms of Service – April 14th 2014 accessed on September 11th 2017.
Some other highlights from the Google TOS which basically IMO mean, if something goes wrong, tough shit, and if you can somehow prove it is our fault, we only owe you what you paid us unless you can somehow prove it was reasonably foreseeable. The terms are even less favorable if you use the MT service for “Business Use”:



The Microsoft Terms of Service

Microsoft is a little better, and they are more forthcoming about their use of your data in general, but you can judge for yourself. Heavy users even have a way to bypass the possibility of their data being used or analyzed at all with a paid, volume subscription. Heavy use is defined as 250 million characters per month or more, which by my calculations is anywhere from 30 million to 50 million words per month. Here are some key selections from the Microsoft Translator Terms of Use statement.
"Microsoft Translator does not use the text or speech audio you submit for translation for any purpose other than to provide and improve the quality of Microsoft’s translation and speech recognition services. For instance, we do not use the text or speech audio you submit for translation to identify specific individuals or for advertising. The text we use to improve Translator is limited to a sample of not more than 10% of randomly selected, non-consecutive sentences from the text you submit, and we mask or delete numeric strings of characters and email addresses that may be present in the samples of text. The portions of text that we do not use to improve Translator are deleted within 48 hours after they are no longer required to provide your translation. If Translator is embedded within another service or product, we may group together all text samples that come from that service or product, but we do not store them with any identifiers associated with specific users. We may keep all speech audio indefinitely for product improvement purposes. We do not share the text or speech audio samples with third parties without your consent or as otherwise described, below.

We may share or disclose personal information with other Microsoft controlled subsidiaries and affiliates, and with suppliers or agents working on our behalf to assist with management and improvement to the Translator service.

In addition, we may access, disclose and preserve information when we have a good faith belief that doing so is necessary to:
  1. comply with applicable law or respond to valid legal process from competent authorities, including from law enforcement or other government agencies; (Like PRISM for the NSA)
  2. protect our customers, for example to prevent spam or attempts to defraud users of the services, or to help prevent the loss of life or serious injury of anyone;"
And for those LSPs and Enterprises who customize (train) the MSFT Translator Baseline engines with their own TM data the following terms additionally apply:
"The Microsoft Translator Hub (the “Hub”) is an optional feature that allows you to create a personalized translation system with your preferred terminology and style by submitting your own documents to train on, or using community translations. The Hub retains and uses submitted documents in full in order to provide your personalized translation system and to improve the Translator service. After you remove a document from your Hub account we may continue to use it for improving the Translator service."
Again, if you are a 50 million words per month kind of user, you can choose (opt-out) for your data not to be used for anything else.

Blogger Joseph Wojowski concludes after his review of these tacit agreements, that translators need to be wary, even though he mentions that there are real and meaningful productivity benefits, in some cases, for translators by using MT.

“In the end, I still come to the same conclusion, we need to be more cognizant of what we send through free, public, and semi-public Machine Translation engines and educate ourselves on the risks associated with their use and the safer, more secure solutions available when working with confidential or restricted-access information.”

Invisible Access via Integration

If your source data already exists on the web anyway, some may say what is the big deal anyway? Some MT use cases I am aware of that focus on translating technical support knowledge bases or eCommerce product listings may not care about these re-use terms. But the larger risk is that once translation infrastructure is connected to MT via an API, users may inadvertently start sending out less suitable documents out for MT without understanding the risks and potential data exposure. For a random unsophisticated user in a translation management system (TMS), it is quite possible to inadvertently send out and translate an upcoming earnings announcement, internal memos to staff about emerging product designs, or other restricted data to an MT server that is governed by these terms of use. In global enterprises, there is an ongoing need to translate many types of truly confidential information. 

Memsource presented research recently on MT usage from within their TMS environment across their whole user base and showed that about 40 million segments are being translated/month via Microsoft Translator and Google through their API. Given that the volume barely meets the opt-out limits, we have to presume all the data is reused and analyzed. A previous post in the ATA Chronicle by Jost Zetzsche (page 26 in December 2014 issue) showed that almost 14,000 translators, were using the same “free” MT services in Memsource. If you add Trados and other TM and TMS systems that have integrated API access to these public MT systems, I am sure the volume of MT use is significant. Thus, if you care about privacy and security, the first thing you might need to do is address these MT API integrations that are cloaked within widely used TM and TMS products. While there are many cases where it might not matter, it would be good for users to understand the risks when it does.
Human error, often inadvertent, is a leading cause of data leakage
 Common Sense Advisory's Don DePalma writes that "employees and your suppliers are unconsciously conspiring to broadcast your confidential information, trade secrets, and intellectual property (IP) to the world.” CSA also reports that in a recent survey of Enterprise localization managers, 64% of them say their fellow employees use free MT frequently or very frequently. 62% also told Common Sense Advisory that they are concerned or very concerned about “sensitive content” (e-mails, text messages, project proposals, legal contracts, merger and acquisition documents) being translated. CSA points out two risks:
  1. Information seen by hackers or geeks in transit across non-secure web connections
  2. Look at the Google TOS section described above to see how and what Google can do even when you are not using the services anymore.
The problem is compounded because, while it may be possible to enforce usage policies within the firewall, suppliers and partners may lack the sophistication to do the same. This is especially so in an ever expanding global market. Many LSPs and their translators are now using MT through the API integration interfaces mentioned above. CSA lists these issues as follows:
  • Service providers may not tell clients that they use MT.
  • Most buyers haven’t caught up yet with data leakage.
  • Subcontractors might not follow the agreed-upon rules.
  • No matter what anyone says, linguists can and will use MT when it is convenient regardless of stated policies.
Within the localization groups, there may be some ways to control this. As CSA again points out by:
  • Locking down content workflows (e.g. turn off MT access within TMS systems)
  • Finding MT providers that will comply with your data security provisions
However, the real risk is in the larger enterprise, outside the localization department, where the acronym TMS is unknown. It may be possible to some extent, to anonymize all translation requests through specialized software, or block all free translation requests, or force them through special gateways that rinse the data before it goes out beyond the firewall. While these anonymization tools might be useful, they are still primitive, and much of the risk can be mitigated by establishing a corporate controlled MT capability that provides universal access to all employees and remains behind the firewall.

In addition to the Secure Corporate MT Service described above, I think we will also see much more use of MT in e-discovery applications. Both in litigation related applications and broad corporate governance and compliance applications. Here is another opinion on the risks of using Generic MT services in the corporate litigation scenario.

Considering Secure MT Deployment Options

Many global organizations are now beginning to realize the information leakage risk presented by unrestricted use and access to free MT. While one way to address this leakage risk is to build your own MT systems, it has also become clear to many that most DIY (Do It Yourself) systems tend to be inferior in terms of output quality to these free generic systems. When users are aware of the quality advantage of Free MT, they will often double-check on these "better" systems, thus defeating the purpose of private and controlled access on DIY systems. Controlled, optimized (for the corporate subject domain) and secure MT solutions from vendors who have certified competence in doing this, seems to me is the most logical and cost effective way to proceed to solve this data leakage problem. “On Premise” systems make sense for those who have the IT staff to do the ongoing management and are available, and able, to protect and manage MT servers at both the customer and the vendor end of the equation. Many large enterprises have this kind of internal IT competence, but very few LSPs do.

It is my opinion that the vendors that allow both on premise and scalable private clouds to be setup are amongst the best options available in the market today. Some say that a private cloud option provides both professional IT management and verifiable data security and is better for those with less qualified IT staff. Most MT vendors tend to provide cloud based solutions today, and for adaptive MT this may be the only option. There are few MT vendors that can do both cloud-based and on premise, and even fewer that can do both competently. MT vendors who tend to provide non-cloud solutions infrequently are less likely to provide reliable and stable offerings. Setting up a corporate MT server that may have hundreds or thousands of users is a non-trivial affair. Like most things in life, it takes repeated practice and broad experience in multiple different user scenarios to do both on premise and cloud solutions well. Thus, one would expect that those vendors who have a large and broad installed base of on premise and private cloud installations (e.g. more than 10 varying types of customers) are preferred to those who do it as an exception and have an installed base of less than at least 10 customer sites.  There are two companies who names start with S that I think meet these requirements best in terms of broad experience and widely demonstrated technical competence. As we head into a world where Neural MT is more pervasive, I think it is likely that Private Clouds will assume more importance in future and be a preferred option to actually having your own IT staff manage GPU or TPU or FPGA arrays and servers on site. However, it is still wise to ask your MT vendor to provide complete details on data security provisions in  their cloud offering.

What seems more and more certain is that MT definitely provides great value in keeping global enterprises actively sharing and communicating, and the need for better, more secure MT solutions has a bright future. And what incidents like this latest fiasco show is, that broadly available MT services are valuable enough for any globally focused enterprise to explore more seriously and carefully, rather than leave it to na├»ve users to find their own way to using risky “free” solutions that undermine corporate privacy and expose high-value confidential data to anyone who knows how to use a search engine or has basic hacking skills.


  1. One thought on deploying cloud-based vendors in the language sector (MT and otherwise): Even for an LSP with competent IT staff, and even if the MT vendor offers contract level assurances of security, my experience is that these vendors tend to make confirming security extremely difficult, at least if you are not an enterprise client (big LSP or big buyer). This is a major challenge for LSPs who seek to remain compliance with their client side security contracts and puts everyone in the position of needing to pretty blindly trust that the cloud vendors are doing their due diligence.

  2. Thanks Aaron for bringing this up. The Trust but Verify principal is a a very good one to keep in mind with any cloud service. If security cannot be verified with actual certifications or an audit by a competent professional, you may be at risk. We all assume that the cloud service takes care of these things but they may not do it to required or verifiable levels. At a minimum ISO 27018 certification may be required. MSFT Azure provides a thorough overview of the many issues to consider here: I have seen that often MT vendors use cloud services that are not likely to meet the most rigorous security requirements that may be needed.

  3. Data stored on some third party servers can never be secure. Binfer on the other hand bypasses cloud storage servers making it safe to send secure data. The link is

  4. Good point that needs to be reiterated periodically. A rare example of corporate data governance not keeping pace with that of the public sector rather than vice versa.

  5. This is an excerpt from an article in Finextra, a British "independent newswire for the worldwide financial technology community" linking the Equifax and the data leakage (improperly referred to as a privacy breach.)

    SDL's VP Katie Rigby-Brown who signed the piece (not exactly an independent source) defines free online MT engines at large "insecure".

    In an article for issue 3 of the now defunct TAUS Review ( I wrote that the hostility against machine translation comes from the anthropomorphization of computers. In other words, it's easier for people to blame computers by attributing their own mistakes.

    We all should learn from experience. A fool is a man who does not learn from his own mistakes, when a smart man learns from them; a wise man learns from the mistakes of others.

    The Equifax and the stories tells an uncomfortable truth: sentient machines would not commit errors, if only because they would strictly comply with rules and procedures. Although fallacious, when arranged by humans.

  6. SDL offers both client-managed (on-premises or private cloud)or SDL managed (public cloud multi-tenant). Both are secure offerings as SDL does not store or use the data that is being translated.

  7. Excellent post. Clients often don't ask enough questions about how their vendors are able to provide such quick turnaround times or low costs on their translations projects. They also fail to appreciate the risks associated with "too good to be true" translation solutions. Thanks for sharing this information.