This is a guest post from Pangeanic that focuses on very specific data privacy issues and highlights some of the concerns that any enterprise must address when using MT technology on a large scale across large volumes of customer data.

I recently wrote about the robust cloud data security that Microsoft MT offers in contrast to all the other major Public MT services. Data privacy and security continue to grow into a touchstone issue for enterprise MT vendors and legislation like GDPR makes it an increasingly critical issue for any internet service that gathers customer data.

Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets so that the people whom the data describe remain anonymous.

Data anonymization has been defined as a "process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party." ^[1] Data anonymization may enable the transfer of information across a boundary, such as between two departments within an agency or between two agencies while reducing the risk of unintended disclosure, and in certain environments in a manner that enables evaluation and analytics post-anonymization.

This is clumsy to describe, and even harder to do, but is likely to be a key requirement when dealing with customer data that spans the globe. Thus, I thought it was worth a closer look.

*** ===== ***

Anonymization Regulations, Privacy Acts and Confidentiality Agreements

How do they differ and what do they protect us from?

One of the possible definitions of privacy is the right that all people have to control information about themselves, and particularly who can access personal information, under what conditions and with what guarantees. In many cases, privacy is a concept that is intertwined with security. However, security is a much broader concept that encompasses different mechanisms.

Security provides us with tools to help protect privacy. One of the most widely used security techniques to protect information is data encryption. Encryption allows us to protect our information from unauthorized access. So, if by encrypting I am protecting my data and access to it, isn't that enough?

Encryption is not enough for Anonymization because…

in many cases, the information in the metadata is unprotected. For example, the content of an email can be encrypted. This gives us a [false] idea about some protection. When we send the message, there is a destination address. If the email sent is addressed, for example, to a political party, that fact would be revealing sensitive information despite having protected the content of the message.

On the other hand, there are many scenarios in which we cannot encrypt the information. For example, if we want to outsource the processing of a database or release it for third parties to carry out analyses or studies for statistical purposes. In these types of scenarios we often encounter the problem that the database contains a large amount of personal or sensitive information, and even if we remove personal identifiers (e.g., name or passport number), it may not be sufficient to protect the privacy of individuals.

Anonymization: protecting our privacy

Anonymization (also known as “data masking”) is a set of techniques that allows the user to protect the privacy of the documents or information by modifying the data. This means anonymization with gaps (deletion), anonymization with placeholders (substitution) or pseudoanonymizing data.

[Interfaz de usuario gráfica, Aplicación Descripción generada automáticamente]
In general, anonymization aims to alter the data in such a way that, even if it is subsequently processed by a third party, the identity or sensitive attributes of the persons whose data is being processed cannot be revealed.

Privacy management is regulated similarly across legal jurisdictions in the world. In Europe, it is known as GDPR (General Data Protection Regulation). which was approved in 2016 and implemented in 2018. In the US, the California Consumer Privacy Act (CCPA) was approved in January 2018 and is applicable to businesses that

have annual gross revenues in excess of $25 million;
buys, receive, or sell the personal information of 50,000 or more consumers or households; or
earn more than half of its annual revenue from selling consumers' personal information

It is expected that most other States will follow the spirit of California’s CPA any time soon. This will affect the way organizations collect, hold, release, buy, and sell personal data.

In Japan, the reformed privacy law came into full force on May 30, 2017, and it is known as the Japanese Act on Protection of Personal Information (APPI). The main differences with the European GDPR are the specific clauses defining private identifiable information which in Europe are “Personal data means any information relating to an identified or identifiable natural person” but APPI itemizes.

In general, all privacy laws want to provide citizens with the right to:

Know what personal data is being collected about them.
Know whether their personal data is sold or disclosed and to whom.
Say no to the sale of personal data.
Access their personal data.
Request a business to delete any personal information about a consumer collected from that consumer.[9]
Not be discriminated against for exercising their privacy rights.

The new regulations seek to regulate the processing of our personal data. Each one of them establishes that data must be subject to adequate guarantees, minimizing personal data.

What is PangeaMT doing about Anonymization?

PangeaMT is Pangeanic’s R&D arm. We lead the MAPA Project – the first multilingual anonymization effort making deep use of bilingual encoders for transformers in order to identify actors, personal identifiers such as names and surnames, addresses, job titles and functions, and a deep taxonomy.

Together with our partners (Centre National pour la Recherche Scientifique in Paris, Vicomtech, etc.) we are developing the first truly multilingual anonymization software. The project will release a fully customizable, open-source solution that can be adopted by Public Administrations to start their journey in de-identification and anonymization. Corporations will also be able to benefit from MAPA as the commercial version will be released on 01.01.2021.