Comments about translation technology, new collaboration models, and inspiration
Tuesday, October 27, 2020
Anonymization Regulations and Data Privacy with MT
This is a guest post from Pangeanic that focuses on very specific data privacy issues and highlights some of the concerns that any enterprise must address when using MT technology on a large scale across large volumes of customer data.
I recently wrote about the robust cloud data security that Microsoft MT offers in contrast to all the other major Public MT services. Data privacy and security continue to grow into a touchstone issue for enterprise MT vendors and legislation like GDPR makes it an increasingly critical issue for any internet service that gathers customer data.
Data anonymization has been defined as a "process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party."  Data anonymization may enable the transfer of information across a boundary, such as between two departments within an agency or between two agencies while reducing the risk of unintended disclosure, and in certain environments in a manner that enables evaluation and analytics post-anonymization.
This is clumsy to describe, and even harder to do, but is likely to be a key requirement when dealing with customer data that spans the globe. Thus, I thought it was worth a closer look.
*** ===== ***
Anonymization Regulations, Privacy
Acts and Confidentiality Agreements
How do they differ and what do
they protect us from?
One of the possible definitions of
privacy is the right that all people have to control information
about themselves, and particularly who can access personal
information, under what conditions and with what guarantees. In
many cases, privacy is a concept that is intertwined with security.
However, security is a much broader concept that encompasses different mechanisms.
Security provides us with tools to help protect privacy. One of the most widely used security techniques to protect information is data encryption. Encryption allows us to protect our information from unauthorized access. So, if by encrypting I am protecting my data and access to it, isn't that enough?
Encryption is not enough for
in many cases, the information in the metadata is unprotected. For example, the content of an email can be encrypted. This gives us a [false] idea about some protection. When we send the message, there is a destination address. If the email sent is addressed, for example, to a
political party, that fact would be revealing sensitive information despite having protected the content of the message.
On the other hand, there are many scenarios in which we cannot
encrypt the information. For example, if we want to outsource the
processing of a database or release it for third parties to carry
out analyses or studies for statistical purposes. In these types of
scenarios we often encounter the problem that the database contains
a large amount of personal or sensitive information, and even if we
remove personal identifiers (e.g., name or passport number), it may
not be sufficient to protect the privacy of individuals.
Anonymization: protecting our
Anonymization (also known as “data
masking”) is a set of techniques that allows the user to protect
the privacy of the documents or information by modifying the data.
This means anonymization with gaps (deletion), anonymization with
placeholders (substitution) or pseudoanonymizing data.
[Interfaz de usuario gráfica, Aplicación Descripción generada automáticamente]
In general, anonymization aims to alter the data in such a way that, even if it is subsequently processed by a third party, the identity or sensitive attributes of the persons whose data is being processed cannot be revealed.
Privacy management is regulated similarly across legal jurisdictions in the world. In Europe, it is known as GDPR (General Data Protection
Regulation). which was approved in 2016 and implemented in
2018. In the US, the California Consumer Privacy Act (CCPA) was
approved in January 2018 and is applicable to businesses that
have annual gross
revenues in excess of $25 million;
buys, receive, or
sell the personal information of 50,000 or more consumers or
earn more than half
of its annual revenue from selling consumers' personal
It is expected that most other
States will follow the spirit of California’s CPA any time soon.
This will affect the way organizations collect, hold, release, buy, and sell personal data.
In Japan, the reformed privacy lawcame into full force on May 30, 2017, and it is known as the Japanese Act on Protection of Personal Information (APPI). The main differences with the European GDPR are the specific clauses defining private identifiable information which in Europe are “Personal data means any information relating to an identified or identifiable natural person” but APPI itemizes.
In general, all privacy laws want
to provide citizens with the right to:
Know what personal data is being collected about them.
Know whether their personal data is sold or disclosed and to whom.
Say no to the sale of personal data.
Access their personal data.
Request a business to delete any personal information about a consumer collected from that consumer.
discriminated against for exercising their privacy rights.
The new regulations seek to regulate the processing of our personal data. Each one of them establishes that data must be subject to adequate guarantees, minimizing personal data.
What is PangeaMT doing about
PangeaMT is Pangeanic’s R&D
arm. We lead the MAPA Project –
the first multilingual anonymization effort making deep use of
bilingual encoders for transformers in order to identify actors,
personal identifiers such as names and surnames, addresses, job
titles and functions, and a deep taxonomy.
Together with our partners (Centre National pour la Recherche
Scientifique in Paris, Vicomtech, etc.) we are developing the first truly multilingual anonymization software. The project will release
a fully customizable, open-source solution that can be adopted by
Public Administrations to start their journey in de-identification
and anonymization. Corporations will also be able to benefit from
MAPA as the commercial version will be released on 01.01.2021.
Post a Comment