Showing posts with label data security. Show all posts

Tuesday, October 27, 2020

Anonymization Regulations and Data Privacy with MT

This is a guest post from Pangeanic that focuses on very specific data privacy issues and highlights some of the concerns that any enterprise must address when using MT technology on a large scale across large volumes of customer data.

I recently wrote about the robust cloud data security that Microsoft MT offers in contrast to all the other major Public MT services. Data privacy and security continue to grow into a touchstone issue for enterprise MT vendors and legislation like GDPR makes it an increasingly critical issue for any internet service that gathers customer data.

Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets so that the people whom the data describe remain anonymous.

Data anonymization has been defined as a "process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party." ^[1] Data anonymization may enable the transfer of information across a boundary, such as between two departments within an agency or between two agencies while reducing the risk of unintended disclosure, and in certain environments in a manner that enables evaluation and analytics post-anonymization.

This is clumsy to describe, and even harder to do, but is likely to be a key requirement when dealing with customer data that spans the globe. Thus, I thought it was worth a closer look.

*** ===== ***

Anonymization Regulations, Privacy Acts and Confidentiality Agreements

How do they differ and what do they protect us from?

One of the possible definitions of privacy is the right that all people have to control information about themselves, and particularly who can access personal information, under what conditions and with what guarantees. In many cases, privacy is a concept that is intertwined with security. However, security is a much broader concept that encompasses different mechanisms.

Security provides us with tools to help protect privacy. One of the most widely used security techniques to protect information is data encryption. Encryption allows us to protect our information from unauthorized access. So, if by encrypting I am protecting my data and access to it, isn't that enough?

Encryption is not enough for Anonymization because…

in many cases, the information in the metadata is unprotected. For example, the content of an email can be encrypted. This gives us a [false] idea about some protection. When we send the message, there is a destination address. If the email sent is addressed, for example, to a political party, that fact would be revealing sensitive information despite having protected the content of the message.

On the other hand, there are many scenarios in which we cannot encrypt the information. For example, if we want to outsource the processing of a database or release it for third parties to carry out analyses or studies for statistical purposes. In these types of scenarios we often encounter the problem that the database contains a large amount of personal or sensitive information, and even if we remove personal identifiers (e.g., name or passport number), it may not be sufficient to protect the privacy of individuals.

Anonymization: protecting our privacy

Anonymization (also known as “data masking”) is a set of techniques that allows the user to protect the privacy of the documents or information by modifying the data. This means anonymization with gaps (deletion), anonymization with placeholders (substitution) or pseudoanonymizing data.

[Interfaz de usuario gráfica, Aplicación Descripción generada automáticamente]
In general, anonymization aims to alter the data in such a way that, even if it is subsequently processed by a third party, the identity or sensitive attributes of the persons whose data is being processed cannot be revealed.

Privacy management is regulated similarly across legal jurisdictions in the world. In Europe, it is known as GDPR (General Data Protection Regulation). which was approved in 2016 and implemented in 2018. In the US, the California Consumer Privacy Act (CCPA) was approved in January 2018 and is applicable to businesses that

have annual gross revenues in excess of $25 million;
buys, receive, or sell the personal information of 50,000 or more consumers or households; or
earn more than half of its annual revenue from selling consumers' personal information

It is expected that most other States will follow the spirit of California’s CPA any time soon. This will affect the way organizations collect, hold, release, buy, and sell personal data.

In Japan, the reformed privacy law came into full force on May 30, 2017, and it is known as the Japanese Act on Protection of Personal Information (APPI). The main differences with the European GDPR are the specific clauses defining private identifiable information which in Europe are “Personal data means any information relating to an identified or identifiable natural person” but APPI itemizes.

In general, all privacy laws want to provide citizens with the right to:

Know what personal data is being collected about them.
Know whether their personal data is sold or disclosed and to whom.
Say no to the sale of personal data.
Access their personal data.
Request a business to delete any personal information about a consumer collected from that consumer.[9]
Not be discriminated against for exercising their privacy rights.

The new regulations seek to regulate the processing of our personal data. Each one of them establishes that data must be subject to adequate guarantees, minimizing personal data.

What is PangeaMT doing about Anonymization?

PangeaMT is Pangeanic’s R&D arm. We lead the MAPA Project – the first multilingual anonymization effort making deep use of bilingual encoders for transformers in order to identify actors, personal identifiers such as names and surnames, addresses, job titles and functions, and a deep taxonomy.

Together with our partners (Centre National pour la Recherche Scientifique in Paris, Vicomtech, etc.) we are developing the first truly multilingual anonymization software. The project will release a fully customizable, open-source solution that can be adopted by Public Administrations to start their journey in de-identification and anonymization. Corporations will also be able to benefit from MAPA as the commercial version will be released on 01.01.2021.

https://www.pangeanic.com/

Monday, June 29, 2020

Understanding Data Security with Microsoft Translator

In this time of the pandemic, many experts have pointed out that enterprises that have a broad and comprehensive digital presence are more likely to survive and thrive in these challenging times. The pandemic has showcased the value of digital operating models and is likely to force many companies to speed up their digital innovation and transformation initiatives. The digital transformation challenge for a global enterprise is even greater, as the need to share content and expand the enterprise's digital presence is massively multilingual, thus putting it beyond the reach of most localization departments who have a much narrower and much more limited focus.

Thus, today we are seeing that truly global enterprises and agencies have a growing need to make large volumes of flowing content multilingual, to enable communication, problem resolution, collaboration, and knowledge sharing possible, within and without the organization. Most often this needs to be as close to real-time as possible. The greater the enterprise commitment to digital transformation, the greater the need, and urgency. Sophisticated, state-of-the-art machine translation enables multilingual communication and content sharing to happen at scale across many languages in real-time, and is thus becoming an increasingly important core component of enterprise information technology infrastructure. Enterprise tailored MT is now increasingly a must-have for the digitally agile global enterprise.

However, MT is extremely complex and is best handled by focused, well funded, and committed experts who build unique competence over many years of experience. Many in the business translation world dabble with open source tools, and build mostly sub-optimal systems that do not reach the capabilities of generic public systems, and thus create friction and resistance from translators who are well aware of this shortcoming. MT system development still remains a challenge for even the biggest and brightest, and thus, in my opinion, is best left to committed experts.

Given the confidential, privileged and mission-critical nature of the content that is increasingly passing through MT systems today, the issue of data security and privacy is becoming a major factor in the selection of MT systems by enterprises concerned with being digitally agile, but who also wish to ensure that their confidential data is not used by MT technology providers to refine, train, and further improve their MT technology.

While some believe that the only way to accomplish true security is by building your own on-premise MT systems, this task as I have often said, is best left to large companies with well-funded and long-term committed experts. Do-it-yourself (DIY) technology with open source options makes little sense if you don't really know, understand, and follow what you are doing with technology this complex.

It is my feeling that MT is also a technology that truly belongs in the cloud for enterprise use, and also usually makes more sense on mobile devices for consumer use. While in some rare cases, on-premise MT systems do make sense for truly massive scale users like national security government agencies (CIA, NSA) who can appropriate the resources to do it competently, for most commercial enterprise MT provides the greatest ROI when it is delivered and implemented in the cloud by an expert and focused team who do not have to re-invent the wheel. Customization on a robust and reliable expert MT foundation appears to be the optimal approach. MT is also a technology that is constantly evolving as new tools, algorithms, new data, and processes come to light to enable ongoing incremental improvements, and this too suggests that MT is better suited to cloud deployment. Neural MT requires relatively large computing resources, deep expertise, and significant data resources and management capabilities to be viable. All these factors point to MT best being a cloud-based deployment, as it essentially remains a work-in-progress, but I am aware that there are still many who disagree on this, and that the cloud versus on-premise issue is one where it is best to agree to disagree.

I recently sat down with Chris Wendt, Group Program Manager and others in his team responsible for Microsoft Translator services, including Bing Translator and Skype Translator. They also connect Microsoft’s research activities with its practical use in services and applications. My intent in our conversation was to specifically investigate, better understand, and clarify the MT data security issues and the many adaptation capabilities that they offer to enterprise customers, as I am aware that the actual facts are often misrepresented, misunderstood, or unclear to many potential users and customers.

Microsoft is a pioneer in the use of MT to serve the technical support information needs of a global customer base, and was the first to make massive support knowledge bases available in MT'd local language for their largest international markets. They were also very early users of Statistical MT (SMT) at scale (tens of millions of words translated for millions of users) and were building actively used systems around the same time that Language Weaver was commercializing SMT. Many of us are aware that the Microsoft Translator services are used both by large enterprises and many LSP agencies in the language services industry because of the relative ease of use, straightforward adaptation capabilities, and relatively low cost. Among the public MT portals, Microsoft is second only to Google in terms of MT traffic, and their consumer platforms solutions on the web and mobile platforms are probably used by millions of users across the world on a daily basis.

It is important to differentiate between Microsoft’s consumer products and their commercial products when considering the data security policies that are in place when using their machine translation capabilities, as they are quite different.

Consumer Products:

The consumer products are Bing, the Edge browser, and the Microsoft Translator app for the phone. These products run under the consumer terms of use, which make it possible for Microsoft to use the processed data for quality improvement purposes. Microsoft keeps a very small portion of the data, non-consecutive sentences, and without any information about the customer who submitted the translation. There is really nothing to learn from performing a translation. The value only comes when the data is annotated and then used as Test or Training data. The annotation is expensive, so there are only a few thousand sentences used per language every year, at most.

Some people read the consumer terms of use and assume the same applies to commercial enterprise products.

That is not the case.

Enterprise Products:

The Translator API is provided via an Azure subscription, which runs under the Azure terms of use. The Azure terms of use do not allow Microsoft to see any of the data being processed. Azure services generally run as a GDPR processor, and Translator ensures compliance by not ever writing translated content to persistent storage.

The typical process flow for a submitted translation is as follows:

Decrypt > translate > encrypt > send back > and > forget.

The Translator API only allows encrypted access, to ensure data is safe in transit. When using the global endpoint, the request will be processed in the nearest available data-center. The customer can also control the specific processing location by choosing a geography-specific endpoint from ten locations which are described here.

Microsoft Translator is certified for compliance with the GDPR processor and confidentiality rules. It is also compliant with all of the following:

CSA STAR: The Cloud Security Alliance (CSA) defines best practices to help ensure a more secure cloud computing environment, and to helping potential cloud customers make informed decisions when transitioning their IT operations to the cloud. The CSA published a suite of tools to assess cloud IT operations: the CSA Governance, Risk Management, and Compliance (GRC) Stack. It was designed to help cloud customers assess how cloud service providers follow industry best practices and standards and comply with regulations. Translator has received CSA STAR Attestation.

FedRAMP: The US Federal Risk and Authorization Management Program (FedRAMP) attests that Microsoft Translator adheres to the security requirements needed for use by US government agencies. The US Office of Management and Budget requires all executive federal agencies to use FedRAMP to validate the security of cloud services. Translator is rated as FedRAMP High in both the Azure public cloud and the dedicated Azure Government cloud.

GDPR: The General Data Protection Regulation (GDPR) is a European Union regulation regarding data protection and privacy for individuals within the European Union and the European Economic Area. Translator is GDPR compliant as a data processor.

HIPAA: The Translator service complies with the US Health Insurance Portability and Accountability Act (HIPAA) Health Information Technology for Economic and the Clinical Health (HITECH) Act, which governs how cloud services can handle personal health information. This ensures that health services can provide translations to clients knowing that personal data is kept private. Translator is included in Microsoft’s HIPAA Business Associate Agreement (BAA). Health care organizations can enter into the BAA with Microsoft to detail each party’s role in regard to security and privacy provisions under HIPAA and HITECH.

HITRUST: The Health Information Trust Alliance (HITRUST) created and maintains the Common Security Framework (CSF), a certifiable framework to help healthcare organizations and their providers demonstrate their security and compliance in a consistent and streamlined manner. Translator is HITRUST CSF certified.

PCI: Payment Credit Industry (PCI) is the global certification standard for organizations that store, process or transmit credit card data. Translator is certified as compliant under PCI DSS version 3.2 at Service Provider Level 1.

SOC: The American Institute of Certified Public Accountants (AICPA) developed the Service Organization Controls (SOC) framework, a standard for controls that safeguard the confidentiality and privacy of information stored and processed in the cloud, primarily in regard to financial statements. Translator is SOC type 1, 2, and 3 compliant.

US Department of Defense (DoD) Provisional Authorization: US DoD Provisional Authorization enables US federal government customers to deploy highly sensitive data on in-scope Microsoft government cloud services. Translator is rated at Impact Level 4 (IL4) in the government cloud. Impact Level 4 covers Controlled Unclassified Information and other mission-critical data. It may include data designated as For Official Use Only, Law Enforcement Sensitive, or Sensitive Security Information.

ISO: Translator is ISO certified with five certifications applicable to the service. The International Organization for Standardization (ISO) is an independent nongovernmental organization and the world’s largest developer of voluntary international standards. Translator’s ISO certifications demonstrate its commitment to providing a consistent and secure service. Translator’s ISO certifications are:

ISO 27001 Information Security Management Standards
ISO 9001:2015 Quality Management Systems Standards
27018:2014 Code of Practice for Protecting Personal Data in the Cloud
20000-1:2011: Information Technology Service Management
ISO 27017:2015: Code of Practice for Information Security Controls

The Translator service is subject to annual audits on all of its certifications to ensure the service continues to be compliant.

These standards force Microsoft to review every change to the live site with two employees, and to enforce minimal access to the runtime environment, as well as having processes in place to protect against external attacks on the data center hardware and software. The standards that Microsoft Translator is certified for, or compliant with, include specific ones for the financial industry and health care providers.

Different from the content submitted for translation, the documents the customer uses to train a custom system are stored on a Microsoft server. Microsoft doesn’t see the data and can’t use it for any purpose other than building the custom system. The customer can delete the custom system as well as the training data at any time, and there won’t be any residue of the training data on any Microsoft system after deletion, or after account expiration.

Translation in Microsoft’s other commercial products like Office, Dynamics, Teams, Yammer, SharePoint, and others follow the same data security rules described above.

Chris also mentioned that, "German customers have been very hesitant to recognize that trustworthy translation in the cloud is possible, for most of the time I have been working on Translator, and I am glad to see now that even the Germans are now warming up to the concept of using translation in the cloud." He pointed me to a VW case study where I found the following quote, and also rationale on the benefits of a cloud-centric translation service to a global enterprise that seeks to enable and enhance multilingual communication, collaboration and knowledge sharing. A deciding factor for the team responsible at VW [in selecting Microsoft] was that none of the data – translation memories, documents to be translated, and trained models – was to leave the European Union (EU) for data protection reasons.

“Ultimately, we expect the Azure environment to provide the same data security as our internal translation portal has offered thus far,”

Tibor Farkas, Head of IT Cloud at Volkswagen

Chris closed with a compelling statement, pointing to the biggest data security problem that exists in business translation: incompetent implementation. Cloud services properly implemented can be as secure as any connected on-premise solution, and in my opinion the greatest risk is often introduced by untrustworthy or careless translators who interact with MT systems, or incompetent IT staff that maintain an MT portal as the Translate.com fiasco showed. .

"Your readers may want to consider whether their own computing facilities are equally well secured against data grabbing and whether their language service provider is equally well audited and secured. It matters which cloud service you are using, and how the cloud service protects your data."

While I have not focused much on the speech-to-text issue in this post, we should understand that Microsoft also offers SOTA (state-of-the-art) speech-to-text capabilities and that the Skype and Phone app experience also gives them a leg up on speech-related applications that go across languages.

I also gathered some interesting information on the Microsoft Translator customization and adaptation capabilities and experience. I will write a separate post on that subject once I gather a little more information on the matter.

Friday, December 27, 2019

The Issue of Data Security and Machine Translation

As the world becomes more digital and the volume of mission-critical data flows continue to expand, it is becoming increasingly important for global enterprises to adapt to the rapid globalization, and the increasingly digital-first world we live in. As organizations change the way they operate, generate revenue and create value for their customers, new compliance risks are emerging — presenting a challenge to compliance, which must proactively monitor, identify, assess and mitigate risks like those tied to fundamentally new technologies and processes. Digital transformation is driven and enabled by data, and thus the value of data security and governance also rise in importance and organizational impact. At the WEF forum in Davos, CEOs have identified cybersecurity and data privacy as two of the most pressing issues of the day, and even regard breakdown with these issues as a general threat to enterprise, society, and government in general.

While C-level executives understand the need for cybersecurity as their organizations undergo digital transformation, they aren’t prioritizing it enough, according to a recent Deloitte report based on a survey of 500 executives. The report, “The Future of Cyber Survey 2019,” reveals that there is a disconnect between organizational aspirations for a “digital everywhere” future, and their actual cyber posture. Those surveyed view digital transformation as one of the most challenging aspects of cyber risk management, and yet indicated that less than 10% of cyber budgets are allocated to these digital transformation efforts. The report goes on to say that this larger cyber awareness is at the center of digital transformation. Understanding that is as transformative as cyber itself—and to be successful in this new era, organizations should embrace a “cyber everywhere” reality.

Cybersecurity breakdowns and data breach statistics

Are these growing concerns about cybersecurity justified? It certainly seems so when we consider these facts:

A global survey in 2018 by CyberEdge across 17 countries and 20 industries found that 78% of respondents had experienced a network breach.
The ISACA survey of cybersecurity professionals points out that it is increasingly difficult to recruit and retain technically adept cybersecurity professionals. They also found that 50% of cyber pros believe that most organizations underreport cybercrime even if they are required to report it, and 60% said they expected at least one attack within the next year.
Radware estimates that an average cyber-attack in 2018 costs an enterprise around $1.67M. The costs can be significantly higher, e.g. a breach at Maersk is estimated to have cost around $250 - $300 million, because of the brand damage, loss of productivity, loss of profitability, falling stock prices, and other negative business impacts in the wake of the breach.
Risk-Based Security reports that there were over 6500 data breaches and that more than 5 billion records were exposed in 2018. The situation is not better in 2019, and over 4 billion records were exposed in the first six months of 2019.
An IBM Security study revealed that the financial impact of data breaches on organizations. According to this study, the cost of a data breach has risen 12% over the past 5 years and now costs $3.92 million on average. The average cost of a data breach in the U.S. is $8.19 million, more than double the worldwide average.

As would be expected, with Hacking as the top breach type, attacks originating outside of the organization were also the most common threat source. However misconfigured services, data handling mistakes and other inadvertent exposure by authorized persons, exposed far more records than malicious actors were able to steal.

Data security and cybersecurity in the legal profession

Third-party professional services firms are often a target for malicious attacks because of the possibility of acquiring high-value information is high. Records show that law firms relationships with third-party vendors are a frequent point of exposure to cyber breaches and accidental leaks. Law.com obtained a list of more than 100 law firms that had reported data breaches and estimate that even more are falling victim to this problem, but simply don’t report it to avoid scaring clients and minimize potential reputational damage.

Austin Berglas, former head of the FBI’s cyber branch in New York and now global head of professional services at cybersecurity company BlueVoyant, said law firms are a top target among hackers because of the extensive high-value client information they possess. Hackers understand that law firms are a “one-stop-shop” for sensitive and proprietary corporate information, merger & acquisitions related data, and emerging intellectual property information.

As custodians of highly sensitive information, law firms are inviting targets for hackers.

The American Bar Association reported in 2018 that 23% of firms had reported a breach at some point, up from 14% in 2016. Six percent of those breaches resulted in the exposure of sensitive client data. Legal documents have to pass through many hands as a matter of course, reams of sensitive information pass through the hands of lawyers and paralegals, and then they go through the process of being reviewed and signed by clients, clerks, opposing counsels, and judges. When they finally get to the location where records are stored, they are often inadvertently exposed to others—even firm outsiders—who shouldn’t have access to them at all.

A Logicforce legal industry score for cybersecurity health among law firms have increased from 54% in 2018 to 60% in 2019, but this is still lower than many other sectors. Increasingly clients are also asking for audits to ensure that security practices are current and robust. A recent ABA Formal Opinion states: “Indeed, the data security threat is so high that law enforcement officials regularly divide business entities into two categories: those that have been hacked and those that will be.”

Lawyers are failing on cybersecurity, according to the American Bar Association Legal Technology Resource Center’s ABA TechReport 2019. “The lack of effort on security has become a major cause for concern in the profession.”

“A lot of firms have been hacked, and like most entities that are hacked, they don’t know that for some period of time. Sometimes, it may not be discovered for a minute or months and even years.” Vincent I. Polley, a lawyer, and co-author of a recent book on cybersecurity for the ABA.

As the volume of multilingual content explodes, a new risk emerges: public, “free” machine translation provided by large internet services firms who systematically harvest and store the data that passes through these “free” services. With the significantly higher volumes of cross-border partnerships, globalization in general, and growth in international business, employee use of public MT has become a new source of confidential data leakage.

Public machine translation use and data security

In the modern era, it is estimated that on any given day, several trillion words are run through the many public machine translation options available across the internet today. This huge volume of translation is done largely by the average web consumer, but there is increasing evidence that a growing portion of this usage is emanating from the enterprise when urgent global customer, collaboration, and communication needs are involved. This happens because publicly available tools are essentially frictionless and require little “buy-in” from a user who doesn’t understand the data leakage implications. The rapid rate of increase in globalization has resulted in a substantial and ever-growing volume of multilingual information that needs to be translated instantly as a matter of ongoing business practice. This is a significant risk for the global enterprise or law firm as this short video points out. Content transmitted for translation by users is clearly subject to terms of use agreements that entitle the MT provider to store, modify, reproduce, distribute, and create derivative works. At the very least this content is fodder for machine learning algorithms that could also potentially be hacked or expose data inadvertently.

Consider the following:

At the SDL Connect 2019 conference recently, a speaker from a major US semiconductor company described the use of public MT at his company. When this activity was carefully monitored by IT management, they found that as much as 3 to 5 GB of enterprise content was being cut and pasted into public MT portals for translation on a daily basis. Further analysis of the content revealed that the material submitted for translation included future product plans, customer problem-related communications, sensitive HR issues, and other confidential business process content.
In September 2017, the Norwegian news agency NRK reported data that they found that had been free translated on a site called Translate.Com that included “notices of dismissal, plans of workforce reductions and outsourcing, passwords, code information, and contracts”. This was yet another site that offered free translation, but reserved the right to examine the data submitted “to improve the service.” Subsequently, searches by Slator uncovered other highly sensitive data of both personal and corporate content.
A recent report from the Australian Strategic Policy Institute (ASPI) makes some claims about how China uses state-owned companies, which provide machine translation services, to collect data on users outside China. The author, Samantha Hoffman, argues that the most valuable tools in China’s data-collection campaign are technologies that users engage with for their own benefit; machine translation services being a prime example. This is done through a company called GTCOM, which Hoffman said describes itself as a “cross-language big data” business, offers hardware and software translation tools that collect data — lots of data. She estimated that GTCOM, which works with both corporate and government clients, handles the equivalent of up to five trillion words of plain text per day, across 65 languages and in over 200 countries. GTCOM is a subsidiary of a Chinese state-owned enterprise that the Central Propaganda Department directly supervises, and thus data collection is presumed to be an active and ongoing process.

After taking a close look at the enterprise market needs and the current realities of machine translation use we can summarize the situation as follows:

There is a growing need for always-available, and secure enterprise MT solutions to support the digitally-driven globalization that we see happening in so many industries today. In the absence of having such a secure solution available, we can expect that there will be substantial amounts of “rogue use” of public MT portals with resultant confidential data leakage risks.
The risks of using public MT portals are now beginning to be understood. The risk is not just related to inadvertent data leakage but is also closely tied to the various data security and privacy risks presented by submitting confidential content into the data-grabbing, machine learning infrastructure, that underlie these “free” MT portals. There is a growing list of US companies already subjected to GDPR-related EU regulatory actions, including, Amazon, Apple, Facebook, Google, Netflix, Spotify and Twitter. Experts have stated that Chinese companies are likely to be the next wave of regulatory enforcement, and the violators' list is expected to grow.
The executive focus on digital transformation is likely to drive more attention to the concurrent cybersecurity implications of hyper-digitalization. Information Governance is likely to become much more of a mission-critical function as the digital footprint of the modern enterprise grows and becomes much more strategic.

The legal market requirement: an end to end solution

Thus, we see today, having language-translation-at-scale capabilities have become imperative for the modern global enterprise. The needs for translation can range from rapid translation of millions of documents in an eDiscovery or compliance scenario, to the very careful and specialized translation of critical contract and court-ready documentation on to an associate collaborating with colleagues from a foreign outpost. Daily communications in global matters are increasingly multilingual. Given the volume, variety, and velocity of the information that needs translation, legal professionals must consider translation solutions that involve both technology and human services. The requirements can vary greatly and can require different combinations of man-machine collaboration, that includes some or all of these different translation production models:

MT-Only for very high volumes like in eDiscovery, and daily communications
MT + Human Terminology Optimization
MT + Post-Editing
Specialized Expert Human Translation

Machine Translation: designed for the Enterprise

MT for the enterprise will need all of the following (and solutions are available from several MT vendors in the market). The author provides consulting services to select and develop optimal solutions :

Guaranteed data security & privacy
Flexible deployment options that include on-premise, cloud or a combination of both as dictated by usage needs
Broad range of adaptation and customization capabilities so that MT systems can be optimized for each individual client
Integration with primary enterprise IT infrastructure and software e.g. Office, Translation Management Systems, Relativity, and other eDiscovery platforms
Rest API that allows connectivity to any proprietary systems that you may employ.
Broad range of expert consulting services both on the MT technology aspects and the linguistic issues
Tightly integrated with professional human translation services to handle end-to-end translation requirements.

This is a post that was originally published on SDL.COM in a modified form with more detail on SDL MT technology.

Tuesday, April 10, 2018

The Data Security Issues Around Public MT - A Translator Perspective

This is a guest post by Mats Linder on the data privacy and security issues around the use of public MT services in professional translator use scenarios.

As I put this post together, I can hear Mark Zuckerberg giving his testimony on Capitol Hill to shockingly ignorant questions from legislators who don't really have a clue. This is not so different from the naive and somewhat ignorant comments I also see in blogs in the translation industry, on the data privacy issue with MT. The looming deadlines of the GDPR legislation have raised the volume of discussion on the privacy issue, but unfortunately not the clarity. GDPR will now result in some companies being fined, and since there is a possibility to calculate what it costs not to do it right, many companies are being much more careful, at least in Europe. But as the Guardian said: " If it’s rigorously enforced (which could be a big “if” unless data protection authorities are properly resourced) it could blow a massive hole in the covert ad-tracking racket – and oblige us to find less abusive and dysfunctional business models to support our online addiction."

As the Guardian wrote recently:

"This is what security guru Bruce Schneier meant when he observed that “surveillance is the business model of the internet”. The fundamental truth highlighted by Schneier’s aphorism is that the vast majority of internet users have entered into a Faustian bargain in which they exchange control of their personal data in return for “free” services (such as social networking, [MT], and search) and/or easy access to the websites of online publications, YouTube and the like.

Big though Facebook is, however, it’s only the tip of the web iceberg. And it’s there that change will have to come if the data vampires are to be vanquished. "

In our current online world, only the paranoid thrive.

Richard Stallman, President of the Fress Software Foundation had this to say:

"To restore privacy, we must stop surveillance before it even asks for consent.

Finally, don’t forget the software on your own computer. If it is the non-free software of Apple, Google or Microsoft, it spies on you regularly. That’s because it is controlled by a company that won’t hesitate to spy on you. Companies tend to lose their scruples when that is profitable. By contrast, free (libre) software is controlled by its users. That user community keeps the software honest."

Apparently, there is a special term for this kind of data acquisition and monitoring effort. Shoshana Zuboff calls this surveillance capitalism.

Also here is Valeria Maltoni on this issue:

"Breaches expose information the other way. They shine a light on the depth and breadth of data gathering practices — and on the business models that rely on them. Awareness changes the perception of knowledge and its use. Anyone not living under a rock now is aware that we likely don't know all the technical implications, but we know enough to start making different decisions on how we browse and communicate online.

Business models are the most problematic, because they create dependency on data and an incentive to collect as much as possible. Beyond advertising, lack of transparency on third party sharing and usage merit further scrutiny. Perhaps the time has come to evolve business practices — how platforms and people interact — and standards — based on laws and regulations"

So when I read about how Google says in a FAQ no less, that they really with all their little heart promise not to use your data, or when Microsoft tells me they have a "No-trace Policy" about your MT data I am more than a little skeptical. Especially, when just last week I get an email from Microsoft about an update to the Terms of the Service Agreement which contains some big updates in clause 2 related to what they can do with "Your Content".

While some may feel that it is possible to trust these companies I remain unconvinced and suggest that you consider the following:

What are the Terms of Service agreement governing your use of the MT service? (not the FAQ or some random policy page).The only legally enforceable contract an MT user has is what is stated in the TOS and I would not be surprised if there are not several loopholes in there as well.

Once a large web services company sets a data harvesting ad-supported infrastructure in motion, it is not easily turned off, and while it is possible there may be more privacy in the EU, I have already seen that Google has made it very clear that they are using my data every time I use the Google Translate service. So my advice to you is Caveat Emptor if it really matters that your data privacy is intact. But if you send your translation content back and forth via email it does not make any differnece anyway. Does it?

Mats has made a valiant attempt to wade through the vague and ambiguous legalese that surrounds the use of these, mostly ad-supported MT services, in his post below.

=======

How (un)safe is machine translation?
Some time ago there were a couple of posts on this site discussing data security risks with machine translation (MT), notably by Kirti Vashee and by Christine Bruckner. Since they covered a lot of ground and might have created some confusion as to what security options are offered, I believe it may be useful to take a closer look with a more narrow perspective, mainly from the professional translator’s point of view. And although the starting point is the plugin applications for SDL Trados Studio, I know that most of these plugins are available also for other CAT tools.

About half a year ago, there was an uproar about Statoil’s discovery that some confidential material had become publicly available due to the fact that it had been translated with the help of a site called translate.com (not to be confused with translated.net, the site of the popular MT provider MyMemory). The story was reported in several places; this report gives good coverage.

Does this mean that all, or at least some, machine translation runs the risk of compromising the material being translated? Not necessarily – what happened to Statoil was the result of trying to get something for nothing; i.e. a free translation. The same thing happens when you use the free services of Google Translate and Microsoft’s Bing. Frequently quoted terms of use for those services state, for instance, that “you give Google a worldwide license to use, host, store, reproduce - - - such content”, and (for Bing): “When you share Your Content with other people, you understand that they may be able to, on a worldwide basis, use, save, record, reproduce - - - Your Content without compensating you”. This should indeed be off-putting to professional translators but should not be cited to scare them from using services for which those terms are not applicable.

The principle is this: If you use a free service, you can be almost certain that your text will be used to “improve the translation services provided”; i.e. parts of it may be shown to other users of the same service if they happen to feed the service with similar source segments. However, the terms of use of Google’s and Microsoft’s paid services – Google Cloud Translate API and Microsoft Text Translator API – are totally different from the free services. Not only can you select not to send back your finalized translations (i.e. update the provider’s data with your own translations); it is in fact not possible – at least not if you use Trados Studio – to do so.

Google and Microsoft are the big providers of MT services, but there are a number of others as well (MyMemory, DeepL, Lilt, Kantan, Systran, SDL Language Cloud…). In essence, the same principle applies to most of them. So let us have a closer look at how the paid services differ from the free.

Google’s and Microsoft’s paid services

Google states, as a reply to the question Will Google share the text I translate with others: “We will not make the content of the text that you translate available to the public, or share it with anyone else, except as necessary to provide the Translation API service. For example, sometimes we may need to use a third-party vendor to help us provide some aspect of our services, such as storage or transmission of data. We won’t share the text that you translate with any other parties, or make it public, for any other purpose.”

And here is the reply to the question after that, Will the text I send for translation, the translation itself, or other information about translation requests be stored on Google servers? If so, how long and where is the information kept?: “When you send Google text for translation, we must store that text for a short period of time in order to perform the translation and return the results to you. The stored text is typically deleted in a few hours, although occasionally we will retain it for longer while we perform debugging and other testing. Google also temporarily logs some metadata about translation requests (such as the time the request was received and the size of the request) to improve our service and combat abuse. For security and reliability, we distribute data storage across many machines in different locations.”

For Microsoft Text Translator API the information is more straightforward, on their “API and Hub: Confidentiality” page: “Microsoft does not share the data you submit for translation with anybody.” And on the "No-Trace" page: “Customer data submitted for translation through the Microsoft Translator Text API and the text translation features in Microsoft Office products are not written to persistent storage. There will be no record of the submitted text, or portion thereof, in any Microsoft data center. The text will not be used for training purposes either. – Note: Known previously as the “no trace option”, all traffic using the Microsoft Translator Text API (free or paid tiers) through any Azure subscription is now “no trace” by design. The previous requirement to have a minimum of 250 million characters per month to enable No-Trace is no longer applicable. In addition, the ability for Microsoft technical support to investigate any Translator Text API issues under your subscription is eliminated.”

Other major players

As for DeepL, there is the same difference between free and paid services. For the former, it is stated – on their "Privacy Policy DeepL" page, under Texts and translations – DeepL Translator (free) – that “If you use our translation service, you transfer all texts you would like to transfer to our servers. This is required for us to perform the translation and to provide you with our service. We store your texts and the translation for a limited period of time in order to train and improve our translation algorithm. If you make corrections to our suggested translations, these corrections will also be transferred to our server in order to check the correction for accuracy and, if necessary, to update the translated text in accordance with your changes. We also store your corrections for a limited period of time in order to train and improve our translation algorithm.”

To the paid service, the following applies (stated on the same page but under Texts and translations – DeepL Pro): “When using DeepL Pro, the texts you submit and their translations are never stored, and are used only insofar as it is necessary to create the translation. When using DeepL Pro, we don't use your texts to improve the quality of our services.” And interestingly enough, DeepL seems to consider their services to fulfill the requirements stipulated – currently as well as in the coming legislation – by the EU Commission (see below).

Lilt is a bit different in that it is free of charge, yet applies strict Data Security principles: “Your work is under your control. Translation suggestions are generated by Lilt using a combination of our parallel text and your personal translation resources. When you upload a translation memory or translate a document, those translations are only associated with your account. Translation memories can be shared across your projects, but they are not shared with other users or third parties.”

MyMemory – a very popular service which in fact is also free of charge, even though they use the paid services of Google, Microsoft, and DeepL (but you cannot select the order in which those are used, nor can you opt out from using them at all) – uses also its own translation archives as well as offering the use of the translator’s private TMs. Your own TM material cannot be accessed by any other user, and as for MyMemory’s own archive, this is what they say, under Service Terms and Conditions of Use:

“We will not share, sell or transfer ’Personal Data’ to third parties without users' express consent. We will not use ’Private Contributions’ to provide translation memory matches to other MyMemory's users and we will not publish these contributions on MyMemory’s public archives. The contributions to the archive, whether they are ’Public Data’ or ’Private Data’, are collected, processed and used by Translated to create statistics, set up new services and improve existing ones.” One question here is of course what is implied by “improve” existing services. But MyMemory tells me that it means training their machine translation models, and that source segments are never used for this.

And this is what the SDL Language Cloud privacy policy says: “SDL will take reasonable efforts to safeguard your information from unauthorized access. – Source material will not be disclosed to third parties. Your term dictionaries are for your personal use only and are not shared with other users using SDL Language Cloud. – SDL may provide access to your information if SDL plc believes in good faith that disclosure is reasonably necessary to (1) comply with any applicable law, regulation or legal process, (2) detect or prevent fraud, and (3) address security or technical issues.”

Is this the whole truth?

Most of these terms of services are unambiguous, even Microsoft’s. But Google’s leaves room for interpretation – sometimes they “may need to use a third-party vendor to help us provide some aspect of [their] services”, and occasionally they “will retain [the text] for longer while [they] perform debugging and other testing”. The statement from MyMemory about improving existing services also raises questions, but I am told that this means training their machine translation models, and that source segments are never used for this. However, since MyMemory also utilizes Google Cloud Translate API (and you don’t know when), you need to take the same care with both MyMemory and Google.

There is also the problem with companies such as Google and Microsoft that you cannot get them to reply to questions if you want clarifications. And it is very difficult to verify the security provided, so that the “trust but verify” principle is all but impossible to implement (and not only with Google and Microsoft).

Note, however, that there are plugins for at least the major CAT tools that offer possibilities to anonymize (mask) data in the source text that you send to the Google and Microsoft paid services, which provides further security. This is also to some extent built into the MyMemory service.
But even if you never send back your translated target segments, what about the source data that you feed into the paid services? Are they deleted, or are they stored so that another user might hit upon them even if they are not connected to translated (target) text?

Yes and no. They are generally stored, but – also generally – in server logs, inaccessible to users and only kept for analysis purposes, mainly statistical. Cf. the statement from MyMemory.

My conclusion, therefore, is that as long as you do not return your own translations to the MT provider, and you use a paid service (or Lilt), and you anonymize any sensitive data, you should be safe. Of course, your client may forbid you to use such services anyway. If so, you can still use MT but offline; see below.

What about the European Union?

Then there is the particular case of translating for the European Union, and furthermore, the provisions in the General Data Protection Regulation (GDPR), to enter into force on 25 May 2018. As for EU translations, the European Commission uses the following clause in their Tender specifications:

”Contractors intending to use web-based tools or any other web-based service (e.g. cloud computing) to execute the /framework contract/ must ensure full compliance with the terms of this call for tenders when using such services. In particular, the provisions on confidentiality must be respected throughout any web-based process and the Union's intellectual and industrial property rights must be safeguarded at all times.” The commission considers the scope of this clause to be very broad, covering also the use of web-based translation tools.

A consequence of this is that translators are instructed not to use “open translation services” (beggars definition, does it not?) because of the risk of losing control over the contents. Instead, the Commission has its own MT-system, e-Translation. On the other hand, it seems possible that the DG Translation is not quite up-to-date as concerns the current terms of service – quoted above – of Google Cloud Translate API and Microsoft Text Translation API, and if so, there may be a slight possibility that they might change their policy with regard to those services. But for now, the rule is that before a contractor uses web-based tools for an EU translation assignment, an authorisation to do so must be obtained (and so far, no such requests have been made).

As for the GDPR, it concerns mainly the protection of personal data, which may be a lesser problem generally for translators. In the words of Kamocki & Stauch on p. 72 of Machine Translation, “The user should generally avoid online MT services where he wishes to have information translated that concerns a third party (or is not sure whether it does or not)”.

Offline services and beyond

There are a number of MT programs intended for use offline (as plugins in CAT tools), which of course provides the best possible security (apart from the fact that transfer back and forth via email always constitutes a theoretical risk, which some clients try to eliminate by using specialized transfer sites). The drawback – apart from the fact that being limited to your own TMs – is that they tend to be pretty expensive to purchase.

The ones that I have found (based on investigations of plugins for SDL Trados Studio) are, primarily, Slate Desktop translation provider, Transistent API Connector, and Tayou Machine Translation Plugin. I should add that so far in this article I have only looked at MT providers which are based on providers of statistical machine translation or its further development, neural machine translation. But it seems that one offline contender which for some language combinations (involving English) also offers pretty good “services” is the rule-based PROMT Master 18.

However, in conclusion I would say that if we take the privacy statements from the MT providers at face value – and I do believe we can, even when we cannot verify them – then for most purposes the paid translation services mentioned above should be safe to use, particularly if you take care not to pass back your own translations. But still, I think both translators and their clients would do well to study the risks described and advice given by Don DePalma in this article. Its topic is free MT, but any translation service provider who wants to be honest in the relationship with the clients, while taking advantage of even paid MT, would do well to study it.

Mats Dannewitz Linder has been a freelance translator, writer and editor for the last 40 years alongside other occupations, IT standardization among others. He has degrees in computer science and languages and is currently studying national economics and political science. He is the author of the acclaimed Trados Studio Manual and for the last few years has been studying machine translation from the translator’s point of view, an endeavour which has resulted in several articles for the Swedish Association of Translators as well as an overview of Trados Studio apps/plugins for machine translation. He is self-employed at Nattskift Konsult.

eMpTy Pages

Pages