Monday, May 18, 2020

Data Preparation Best Practices for Neural MT

In any machine learning task, the quality and volume of training data available is a critical determinant of the system that is developed. The importance of data is real for both Statistical MT and Neural MT, which are both data-driven, and produces output that is deeply influenced by the data used to train them. Some believe that Statistical MT systems have a higher tolerance for noisy data. Thus it is assumed that more data volume is better even if it is "noisy," but in my experience, all data-driven MT systems are better when you have quality data. Research shows that Neural MT is more sensitive to noise than Statistical MT. Still, as SMT has been around for 15+ years now, many of the SMT data preparation practices in use historically often continue and are carried over to NMT model building today.

This problem has raised interest in the field of parallel data filtering to identify and correct the most problematic issues for NMT, e.g., segments where source and target are the same, and misaligned sentences. This presentation by eBay provides an overview of the importance of parallel data filtering and its best practices. It adds to the useful points made by Doctor-sahib in this post. Data cleaning and preparation have always been necessary for developing superior MT engines, and most of us agree that it is even more critical now with neural network-based models.

This guest post is by Raymond Doctor, who is an old and wise acquaintance of mine who has spent over a decade at the Centre for Development of Advanced Computing (C-DAC) in Pune, India. He is a pioneer in digital Indic language work and was involved in several Indic language NLP based initiatives conducting research on Indic language Parsers, Segmentation, Tokenization, Stemming, Lemmatization, NER, Chunking, Machine Translation, and Opinion Mining.

The success of these MT experiments is yet more proof that the best MT systems come from those who have a deep understanding of both the underlying linguistics, as well as the MT system development methodology.

He and I also share two Indian languages in common (Hindi and Gujarati). Over the years, he has shown me many examples of output from MT systems he has developed in his research that were the best I had seen for these two languages going into and out of English. The success of his MT experiments is yet more proof that the best MT systems come from those who have a deep understanding of both the underlying linguistics, as well as the MT system development methodology. 

Overview of the SMT data alignment processes

"True inaccuracy and errors in data are at least relatively straightforward to address, because they are generally all logical in nature. Bias, on the other hand, involves changing how humans look at data, and we all know how hard it is to change human behavior."

- Michiko Wolcott

Some other wisdom about data from Michiko:

Truth #1: Data are stupid and lazy.

Data are not intelligent. Even artificial intelligence must be taught before it learns to learn on its own (even that is debatable). Data have no ability on their own. It is often said that insights must be teased out of data.

Truth #2: Data are rarely an objective representation of reality (on their own).

I want to clarify this statement: it does not say that data is rarely accurate or error-free. Accuracy and correctness are dimensions of quality of what is in the data themselves.

The text below is written by the guest author.


Over the years, I have been studying the various recommendations given to prepare training data before submitting it to an NMT learning engine. I feel these recommended practices mainly emerged as best practices at the time of SMT, and have been carried over to NMT with less beneficial results.

I have identified six major pitfalls that data analysts make when preparing training data for NMT models. These data cleaning and preparation practices originated as best practices with SMT, where they were of benefit. Many data practices that made sense with SMT are still being followed today, and it is my opinion that these should be avoided and are likely to result in better outcomes.

While I have listed a few practices that I feel should be avoided, many other SMT-based data prepping practices ensure that the training data is likely to produce a sub-optimal NMT system. But the factors I have listed below are the most common practices which have resulted in lower output quality than would be possible by ignoring these practices. I disregarded the advice given regarding punctuations, deduping, removing truncations, MWEs, and found the quality of NMT output considerably improves in my research with Indic language MT systems.

As far as possible, examples have been provided from a Gujarati <> English NMT system I have developed. But the same can apply to any other parallel corpus.


Quite a few sites tell you to remove punctuations before submitting the data for learning. It is my observation that this is not optimal practice.

Punctuations are markers that allow for understanding the meaning. In a majority of languages word order does not necessarily show interrogation

Tu viens? =You are coming?

Removing the interrogation marker creates confusion and dupes [see my remark below]

See what happens when a comma is removed:

Anne Marie, va manger mon enfant=Anne Marie. Come have your lunch

Anne Marie va manger mon enfant=Anne Marie is going to eat my child


The mayor says, the commissioner is a fool.

The mayor, says the commissioner is a fool.

I feel that in preparing a corpus the punctuation markers should be retained.


Quite a few sites advise you to remove short sentences. Doing this, in my opinion, is a serious error. Short sentences are crucial for translating headlines, one of the stumbling blocks of NMT. Some have no verbs and are pure nominal structures.

Curfew declared: Noun + Verb

Sweep of Covid19 over the continent: Nominal Phrase


Google does not handle nominal structures well, and here is an example:

Sweep of Covid over India= ભારત ઉપર કોવિડનો સ્વીપ

I have found that retaining such structures strengthens and improves the quality of NMT output.


Multiword expressions (MWEs) are expressions that are made up of at least two words, and which can be syntactically and/or semantically idiosyncratic in nature. Moreover, they act as a single unit at some level of linguistic analysis.

Like short sentences, MWEs are often ignored and removed from the training corpus. These MWEs are very often fixed patterns found in a given language. These can be short expressions, titles, or phrasal constructs, just to name a few of the possibilities. MWEs cannot be literally translated and need to be glossed accurately. My experience has been that the higher the volume of MWEs provided, the better the quality of learning. A few MWEs in Gujarati are provided below:

agreement in absence =અભાવાન્વય

agreement in presence =ભવાન્વય

agriculture parity =કૃષિમૂલ્ય સમાનતા

aid and advice =સહાય અને સલાહ

aider and abettor =સહાયક અને મદદગાર

aim fire =નિશાન લગાવી ગોળી ચલાવવી


A large number of sites providing recommendations on NMT training data preparation tell you to remove duplicates, both in the Source and Target texts. This action in popular parlance is termed as deduping. The argument being that deduping the corpus makes for greater accuracy. However, it is common to have an English sentence that can map to two or more strings in the target language. This variation can be because of synonyms used in the target languages, or because of a flexible word order that is especially common in Indic languages. De-duping the data results in weakening the quality of MT output. The only case where deduping needs to be done is when we have two identical strings, both in the Source and Target language. Higher quality NMT engines incorporate these slight variations on a single segment to enable the MT engines to produce multiple variants.

Change of verbal expression and word order:

How are the trade talks between China and the US moving forward now. =ચીન તથા અમેરિકા વચ્ચે વેપાર વ્યવહાર વિષયક વાતચીત હવે કેવી આગળ વધે છે.

How are the trade talks between China and the US moving forward now. =ચીન તથા અમેરિકા વચ્ચે હવે વેપાર વિષયક વાતચીત કેવી આગળ વધે છે.


Experts believe. =એક્સપર્ટ્સ માને છે.

Experts believe. =જાણકારોનું માનવું છે.

Experts believe. =નિષ્ણાતોનું માનવું છે.

Deduping the data in such cases results in reducing the quality of output.

The only case where deduping needs to be done is where we have two identical strings, both in the Source and Target language. In other words, an exact duplicate. High-end NMT engines do not practice deduping since this deprives the MT system from being able to provide variants, which can be seen by clicking on full or part of the gloss.


The inability to handle these are the Achilles heel of a majority of NMT engines, including Google, insofar as English to Indic languages are concerned. Attention to this area is ignored because it is felt that the corpus will handle all verbal patterns in both the source and target language. Even the best of corpora does not handle this.

Providing a set of sentences with the Verbal Pattern of both the source and target languages goes a long way.

Gujarati admits around 40+ verbal patterns and NMT fails on quite a few:

They ought to have been listening to the PM's speech =તેઓએ વડા પ્રધાનનું ભાષણ સાંભળ્યું હોવું જોઈએ

Shown below is a sample of Gujarati verbal patterns with “to eat “ as a paradigm

You are eating =તમે ખાઓ છો
You are not eating =તમે ખાતા નથી
You ate =તમે ખાધું
You can eat =તમે ખાઈ શકો છો
You cannot eat =તમે નહીં ખાઈ શકો
You could not eat =તમે ખાઈ શક્યા નહીં
You did not eat =તમે ખાધું નહીં
You do not eat =તમે ખાતા નથી
You eat =તમે ખાધું
You had been eating =તમે ખાતા હતા
You had eaten =તમે ખાધું હતું
You have eaten =તમે ખાધું છે
You may be eating =તમે ખાતા હોઈ શકો છો
You may eat =તમે ખાઈ શકો છો
You might eat =તમે કદાચ ખાશો
You might not eat =તમે કદાચ ખાશો નહીં
You must eat =તમારે ખાવું જ જોઇએ
You must not eat =તમારે ખાવું ન જોઈએ
You ought not to eat =તમારે ખાવું ન જોઈએ
You ought to eat =તમારે ખાવું જોઈએ
You shall eat =તમે ખાશો

Similarly, the use of a habitual marker used when glossed into French by a high-quality NMT system


This construct is very common in Indic languages and often leads to mistranslation.

Thus,  Gujarati uses જવું કરવું as an adjunct to the main verb. The combination of the pole and the vector verb such as જવું creates a new meaning.

 મરી જવું is not translated as die go, but is simply die

Gujarati admits around 15-20 such verbs, as do Hindi and other Indic languages, and once again, a corpus needs to be fed this type of data in the shape of sentences to produce better output.

 In the case of English it is the prepositional phrases that often create issues:

Pick up, pick someone up, pick up the tab


We noticed that when training data that ignores some of the frequent data preparation recommendations are sent in for training, the quality of MT output markedly improves. However, there is a caveat. If the threshold of the training data is lower than 100,000 segments, following or not following the above recommendations make little or no difference. Superior NMT systems require a sizeable corpus, and generally, we see that at least a million+ segments are needed.

Here is a small set of sentences from various domains is provided below as proof of the quality of output using these techniques

Now sieve this mixture.=હવે આ મિશ્રણને ગરણીથી ગાળી લો.

It is violence and violence is sin.=હિંસા કહેવાય અને હિંસા પાપ છે.

The youth were frustrated and angry.=યુવાનો નિરાશ અને ક્રોધિત હતા.

Give a double advantage.=ચાંલ્લો કરીને ખીર પણ ખવડાવી.

The similarity between Modi and Mamata=મોદી અને મમતા વચ્ચેનું સામ્ય

I'm a big fan of Bumrah.=હું બુમરાહનો મોટો પ્રશંસક છું.

38 people were killed.=તેમાં 38 લોકોના મોત થયા હતા.

The stranger came and asked.=અજાણ્યા યુવકે આવીને પૂછ્યું.

Jet now has 1,300 pilots.=હવે જેટની પાસે 1,300 પાયલટ છે.


Raymond Doctor,  has spent over a decade at the Centre for Development of Advanced Computing (C-DAC) in Pune, India. He is a pioneer in digital Indic language work and was involved in several Indic language NLP based initiatives and conducted research in furthering Indic language Parsers, Segmentation, Tokenization, Stemming, Lemmatization, NER, Chunking, Machine Translation, and Opinion Mining. 


  1. Very correct and practical, આભાર!

    We just added more languages like Gujarati and will add more data for them soon, I'm very interested in supporting such open research work for more language pairs.

    The best way we can help is with filtering back-translations, crowdsourced subtitles and similar large-scale mixed-quality parallel corpora.

  2. "...that Neural MT is more sensitive to noise than Statistical MT." All corpus-based systems degrades with noisy corpus. Early NMT research noted that SMT corpora needed more diligent processing to remove the noise. Yet, none of the research mentioned regression tests using the less noisy corpus with SMT and measuring any change in the SMT results.

    In my experience, a cleaner and less noisy SMT training corpus yields better MT than a dirty noisy SMT training corpus. The same is true when comparing NMT to NMT. Can anyone can point me to research that says cleaner, less noisy NMT corpus gives better results than regressing that cleaner corpus to an SMT system?

  3. Most of the time when approaching a new NLP problem, one of the first steps is always to do the punctuation removal step. Here you can lose valuable data which is important in the overall language model

    Another area of opportunity that gets lost when ignoring punctuation is the grammar correction market best exemplified by Grammarly. As far as I know there is no software available that offers punctuation improvement suggestions. I think there is space for grammarly-like products in other languages that offer nuanced functionalities regarding punctuation.

    A final note: dictation and voice assistants. I use dictation a lot. It works and it is awesome. But my final step is "just" to add punctuation.

    Everyone wants to tackle words, but who is tackling punctuation? Not many.