Friday, May 22, 2020

Computers that "understand" language: BERT Explained

As the momentum with machine learning research continues, we continue to hear about ongoing research breakthroughs, and perhaps the most significant natural language processing (NLP) advance in the last 12 months was something called BERT. The translation industry deals with words, and thus any technology advance that improves capabilities in transforming words from one language to another is considered to be most important and impactful. Therefore, there is a tendency to place new inventions always in the translation context within the industry. This tendency does not always make sense, and this post will attempt to clarify and explain BERT in enough detail to help readers better understand the relevance and use of this innovation within the confines of the translation industry.

What is BERT?

Last year, Google released a neural network-based technique for natural language processing (NLP) pre-training called Bidirectional Encoder Representations from Transformers (BERT). While that sounds like quite a mouthful, BERT is a significant, and possibly even revolutionary step forward for NLP in general. There has been much excitement in the NLP research community about BERT because it enables substantial improvements in a broad range of different NLP tasks. Simply put, BERT brings considerable advances to many tasks related to natural language understanding (NLU).

The following points help us to understand the salient characteristics of the innovation, driven by BERT :

  • BERT is pre-trained on a large corpus of annotated data that enhances and improves subsequent NLP tasks. This pre-training step is half the magic behind BERT's success. It is because as we train a model on a large text corpus, the model starts to pick up the more in-depth and intimate understandings of how the language works. This knowledge is the swiss army knife that is useful for almost any NLP task. For example, a BERT model can be fine-tuned toward a small data NLP task like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on smaller datasets from scratch. BERT allows researchers to get state-of-the-art results even when very little training data is available
  • Words are problematic because plenty of them are ambiguous, polysemous, and synonymous. BERT is designed to help solve ambiguous sentences and phrases that are made up of lots and lots of words with multiple meanings.
  • BERT will help with things like:
    • Named entity determination.
    • Coreference resolution.
    • Question answering.
    • Word sense disambiguation.
    • Automatic summarization.
    • Polysemy resolution
  • BERT is a single model and architecture that brings improvements in many different tasks that previously would have required the use of multiple different models and architectures.
  • BERT also provides a much better contextual sense, and thus increases the probability of understanding the intent in search. Google called this update "the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search."

While BERT is a significant improvement in how computers "understand" human language, it is still far away from understanding language and context in the same way that humans do. We should, however, expect that BERT will have a significant impact on many understanding focused NLP initiatives. The General Language Understanding Evaluation benchmark (GLUE) is a collection of datasets used for training, evaluating, and analyzing NLP models relative to one another. The datasets are designed to test a model's language understanding and are useful for evaluating models like BERT. As the GLUE results show, BERT makes it possible to outperform humans even in comprehension tasks previously thought to be impossible for computers to outperform humans.

To better understand this, I recently sat down with SDL NLP experts, Dragos Munteanu and Steve DeNeefe. I asked them questions to help us all better understand BERT and its possible impact on other areas of language technology.

  1. Can you describe in layman's terms what BERT is and why there is so much excitement about it?

BERT combines three factors in a powerful way. First, it is a very large, attention-based neural network architecture known as "Transformer Encoder" (Transformer networks are the basis of our NMT 2.0 language pairs). Second, one uses a "fill in the blank" method to train the network, where you remove words randomly from a paragraph then the system tries to predict them. Third, it is trained on massive amounts of monolingual text, usually English. There are also variants trained with French, German, Chinese, and even one trained with 104 different languages (but not with parallel data).

The result is a powerful representation of language in context, which can be "fine-tuned" (quickly adapted) to perform many challenging tasks previously considered hard for computers, i.e., requiring world knowledge or common sense.

  1. Some feel that BERT is good for anything related to NLP, but what are the specific NLP problems that BERT has solved best?

BERT and other similar models (RoBERTa, OpenAI GPT, XL-Net) are state of the art on many NLP tasks that require classification, sequence labeling, or similar digesting of text, e.g., named entity recognition, question answering, sentiment analysis. BERT is an encoder, so it digests data, but by itself, it does not produce data. Many NLP tasks also include a data creation task (e.g., abstractive summarization, translation), and these require an additional network and more training.

  1. I am aware that several BERT inspired initiatives are beating human baselines on the GLUE (General Language Understanding Evaluation) leaderboard. Does this mean that computers understand our language?

Neural networks learn a mapping from inputs to outputs by finding patterns in their training data. Deep networks have millions of parameters and thus can learn (or "fit") quite intricate patterns. This is really what enables BERT to perform so well on these tasks. In my opinion, this is not equivalent to an understanding of the language. There are several papers and opinion pieces that analyze these models' behavior on a deeper level, specifically trying to gauge how they handle more complex linguistic situations, and they also conclude that we are still far from real understanding. Speaking from the SDL experience training and evaluating BERT based models for tasks such as sentiment analysis or question answering, we recognize these models perform impressively. However, they still make many mistakes that most people would never make.

  1. If summarization is a key strength of BERT – do you see it making its way into the Content Assistant capabilities that SDL has as the CA scales up to solve larger enterprise problems related to language understanding?

Summarization is one of the critical capabilities in CA, and one of the most helpful in enabling content understanding, which is our overarching goal. We currently do extractive summarization, where we select the most relevant segments from the document. In recognition of the fact that different people care about various aspects of the content, we implemented an adaptive extractive summarization capability. Our users can select critical phrases of particular interest to them, and the summary will change to choose segments that are more related to those phrases. 

Another approach is abstractive summarization, where the algorithm generates new language that did not exist in the document being summarized. Given its powerful representation, transformer networks like BERT are in a good position to generate text that is both fluent and relevant for the document's meaning. In our experiments so far, however, we have not seen compelling evidence that abstractive summaries facilitate better content understanding.

  1. I have read that while there are some benefits to trying to combine BERT-based language models into NMT systems; however, the process to do this seemed rather complicated and expensive in resource terms. Do you see BERT influenced capabilities coming to NMT in the near or distant future?

Pretrained BERT models can be used in several different ways to improve the performance of NMT systems, although there are various technical difficulties. BERT models are built on the same transformer architecture used in the current state-of-the-art NMT systems. Thus, in principle, a BERT model can be used either to replace the encoder part of the NMT system or to help initialize the NMT model's parameters. One of the problems is that, as you mention, using a BERT model increases the computational complexity. And the gains for MT are, so far, not that impressive; nowhere near the kind of benefits that BERT brings on the GLUE-style tasks. However, we continue to look into various ways of exploiting the linguistic knowledge encoded in BERT models to make our MT systems better and more robust.

  1. Others say that BERT could help in developing NMT systems for low-resource languages. Are the capabilities to transfer the learning from monolingual data likely to affect our ability to see more low-resource language combinations shortly?

One of the aspects of BERT that is most relevant here is the idea of training representations that can be quickly fine-tuned for different tasks. If you combine this with the concept of multi-lingual models, you can imagine an architecture/training procedure that learns and builds models relevant to several languages (maybe from the same family), which then can be fine-tuned for any language in the family with small amounts of parallel data.

  1. What are some of the specific capabilities, resources, and competence that SDL has that will enable the company to adopt this kind of breakthrough technology faster?

 Our group has expertise in all the areas involved in developing, optimizing, and deploying NLP technologies for commercial use cases. We have world-class researchers who are recognized in their respective fields, and regularly publish papers in peer-reviewed conferences. 

We have already built a variety of NLP capabilities using state-of-the-art technology: summarization, named entity recognition, question generation, question answering, sentiment analysis, sentence auto-completion. They are in various stages in the research-to-production spectrum, and we continue to develop new ones. 

We also, have expertise in the deployment of large and complex deep learning models. Our technology is designed to optimally use either CPUs or GPUs, in either 32-bit or 16-bit mode, and we understand how to make quality-speed trade-offs to fulfill the various use cases of our customers. 

Last but not least, we foster close collaboration between research, engineering, and product management. Developing NLP capabilities that bring concrete commercial value is as much an art as it is a science, and success can only be attained through the breadth of expertise and deep collaboration.

Monday, May 18, 2020

Data Preparation Best Practices for Neural MT

In any machine learning task, the quality and volume of training data available is a critical determinant of the system that is developed. The importance of data is real for both Statistical MT and Neural MT, which are both data-driven, and produces output that is deeply influenced by the data used to train them. Some believe that Statistical MT systems have a higher tolerance for noisy data. Thus it is assumed that more data volume is better even if it is "noisy," but in my experience, all data-driven MT systems are better when you have quality data. Research shows that Neural MT is more sensitive to noise than Statistical MT. Still, as SMT has been around for 15+ years now, many of the SMT data preparation practices in use historically often continue and are carried over to NMT model building today.

This problem has raised interest in the field of parallel data filtering to identify and correct the most problematic issues for NMT, e.g., segments where source and target are the same, and misaligned sentences. This presentation by eBay provides an overview of the importance of parallel data filtering and its best practices. It adds to the useful points made by Doctor-sahib in this post. Data cleaning and preparation have always been necessary for developing superior MT engines, and most of us agree that it is even more critical now with neural network-based models.

This guest post is by Raymond Doctor, who is an old and wise acquaintance of mine who has spent over a decade at the Centre for Development of Advanced Computing (C-DAC) in Pune, India. He is a pioneer in digital Indic language work and was involved in several Indic language NLP based initiatives conducting research on Indic language Parsers, Segmentation, Tokenization, Stemming, Lemmatization, NER, Chunking, Machine Translation, and Opinion Mining.

The success of these MT experiments is yet more proof that the best MT systems come from those who have a deep understanding of both the underlying linguistics, as well as the MT system development methodology.

He and I also share two Indian languages in common (Hindi and Gujarati). Over the years, he has shown me many examples of output from MT systems he has developed in his research that were the best I had seen for these two languages going into and out of English. The success of his MT experiments is yet more proof that the best MT systems come from those who have a deep understanding of both the underlying linguistics, as well as the MT system development methodology. 

Overview of the SMT data alignment processes

"True inaccuracy and errors in data are at least relatively straightforward to address, because they are generally all logical in nature. Bias, on the other hand, involves changing how humans look at data, and we all know how hard it is to change human behavior."

- Michiko Wolcott

Some other wisdom about data from Michiko:

Truth #1: Data are stupid and lazy.

Data are not intelligent. Even artificial intelligence must be taught before it learns to learn on its own (even that is debatable). Data have no ability on their own. It is often said that insights must be teased out of data.

Truth #2: Data are rarely an objective representation of reality (on their own).

I want to clarify this statement: it does not say that data is rarely accurate or error-free. Accuracy and correctness are dimensions of quality of what is in the data themselves.

The text below is written by the guest author.


Over the years, I have been studying the various recommendations given to prepare training data before submitting it to an NMT learning engine. I feel these recommended practices mainly emerged as best practices at the time of SMT, and have been carried over to NMT with less beneficial results.

I have identified six major pitfalls that data analysts make when preparing training data for NMT models. These data cleaning and preparation practices originated as best practices with SMT, where they were of benefit. Many data practices that made sense with SMT are still being followed today, and it is my opinion that these should be avoided and are likely to result in better outcomes.

While I have listed a few practices that I feel should be avoided, many other SMT-based data prepping practices ensure that the training data is likely to produce a sub-optimal NMT system. But the factors I have listed below are the most common practices which have resulted in lower output quality than would be possible by ignoring these practices. I disregarded the advice given regarding punctuations, deduping, removing truncations, MWEs, and found the quality of NMT output considerably improves in my research with Indic language MT systems.

As far as possible, examples have been provided from a Gujarati <> English NMT system I have developed. But the same can apply to any other parallel corpus.


Quite a few sites tell you to remove punctuations before submitting the data for learning. It is my observation that this is not optimal practice.

Punctuations are markers that allow for understanding the meaning. In a majority of languages word order does not necessarily show interrogation

Tu viens? =You are coming?

Removing the interrogation marker creates confusion and dupes [see my remark below]

See what happens when a comma is removed:

Anne Marie, va manger mon enfant=Anne Marie. Come have your lunch

Anne Marie va manger mon enfant=Anne Marie is going to eat my child


The mayor says, the commissioner is a fool.

The mayor, says the commissioner is a fool.

I feel that in preparing a corpus the punctuation markers should be retained.


Quite a few sites advise you to remove short sentences. Doing this, in my opinion, is a serious error. Short sentences are crucial for translating headlines, one of the stumbling blocks of NMT. Some have no verbs and are pure nominal structures.

Curfew declared: Noun + Verb

Sweep of Covid19 over the continent: Nominal Phrase


Google does not handle nominal structures well, and here is an example:

Sweep of Covid over India= ભારત ઉપર કોવિડનો સ્વીપ

I have found that retaining such structures strengthens and improves the quality of NMT output.


Multiword expressions (MWEs) are expressions that are made up of at least two words, and which can be syntactically and/or semantically idiosyncratic in nature. Moreover, they act as a single unit at some level of linguistic analysis.

Like short sentences, MWEs are often ignored and removed from the training corpus. These MWEs are very often fixed patterns found in a given language. These can be short expressions, titles, or phrasal constructs, just to name a few of the possibilities. MWEs cannot be literally translated and need to be glossed accurately. My experience has been that the higher the volume of MWEs provided, the better the quality of learning. A few MWEs in Gujarati are provided below:

agreement in absence =અભાવાન્વય

agreement in presence =ભવાન્વય

agriculture parity =કૃષિમૂલ્ય સમાનતા

aid and advice =સહાય અને સલાહ

aider and abettor =સહાયક અને મદદગાર

aim fire =નિશાન લગાવી ગોળી ચલાવવી


A large number of sites providing recommendations on NMT training data preparation tell you to remove duplicates, both in the Source and Target texts. This action in popular parlance is termed as deduping. The argument being that deduping the corpus makes for greater accuracy. However, it is common to have an English sentence that can map to two or more strings in the target language. This variation can be because of synonyms used in the target languages, or because of a flexible word order that is especially common in Indic languages. De-duping the data results in weakening the quality of MT output. The only case where deduping needs to be done is when we have two identical strings, both in the Source and Target language. Higher quality NMT engines incorporate these slight variations on a single segment to enable the MT engines to produce multiple variants.

Change of verbal expression and word order:

How are the trade talks between China and the US moving forward now. =ચીન તથા અમેરિકા વચ્ચે વેપાર વ્યવહાર વિષયક વાતચીત હવે કેવી આગળ વધે છે.

How are the trade talks between China and the US moving forward now. =ચીન તથા અમેરિકા વચ્ચે હવે વેપાર વિષયક વાતચીત કેવી આગળ વધે છે.


Experts believe. =એક્સપર્ટ્સ માને છે.

Experts believe. =જાણકારોનું માનવું છે.

Experts believe. =નિષ્ણાતોનું માનવું છે.

Deduping the data in such cases results in reducing the quality of output.

The only case where deduping needs to be done is where we have two identical strings, both in the Source and Target language. In other words, an exact duplicate. High-end NMT engines do not practice deduping since this deprives the MT system from being able to provide variants, which can be seen by clicking on full or part of the gloss.


The inability to handle these are the Achilles heel of a majority of NMT engines, including Google, insofar as English to Indic languages are concerned. Attention to this area is ignored because it is felt that the corpus will handle all verbal patterns in both the source and target language. Even the best of corpora does not handle this.

Providing a set of sentences with the Verbal Pattern of both the source and target languages goes a long way.

Gujarati admits around 40+ verbal patterns and NMT fails on quite a few:

They ought to have been listening to the PM's speech =તેઓએ વડા પ્રધાનનું ભાષણ સાંભળ્યું હોવું જોઈએ

Shown below is a sample of Gujarati verbal patterns with “to eat “ as a paradigm

You are eating =તમે ખાઓ છો
You are not eating =તમે ખાતા નથી
You ate =તમે ખાધું
You can eat =તમે ખાઈ શકો છો
You cannot eat =તમે નહીં ખાઈ શકો
You could not eat =તમે ખાઈ શક્યા નહીં
You did not eat =તમે ખાધું નહીં
You do not eat =તમે ખાતા નથી
You eat =તમે ખાધું
You had been eating =તમે ખાતા હતા
You had eaten =તમે ખાધું હતું
You have eaten =તમે ખાધું છે
You may be eating =તમે ખાતા હોઈ શકો છો
You may eat =તમે ખાઈ શકો છો
You might eat =તમે કદાચ ખાશો
You might not eat =તમે કદાચ ખાશો નહીં
You must eat =તમારે ખાવું જ જોઇએ
You must not eat =તમારે ખાવું ન જોઈએ
You ought not to eat =તમારે ખાવું ન જોઈએ
You ought to eat =તમારે ખાવું જોઈએ
You shall eat =તમે ખાશો

Similarly, the use of a habitual marker used when glossed into French by a high-quality NMT system


This construct is very common in Indic languages and often leads to mistranslation.

Thus,  Gujarati uses જવું કરવું as an adjunct to the main verb. The combination of the pole and the vector verb such as જવું creates a new meaning.

 મરી જવું is not translated as die go, but is simply die

Gujarati admits around 15-20 such verbs, as do Hindi and other Indic languages, and once again, a corpus needs to be fed this type of data in the shape of sentences to produce better output.

 In the case of English it is the prepositional phrases that often create issues:

Pick up, pick someone up, pick up the tab


We noticed that when training data that ignores some of the frequent data preparation recommendations are sent in for training, the quality of MT output markedly improves. However, there is a caveat. If the threshold of the training data is lower than 100,000 segments, following or not following the above recommendations make little or no difference. Superior NMT systems require a sizeable corpus, and generally, we see that at least a million+ segments are needed.

Here is a small set of sentences from various domains is provided below as proof of the quality of output using these techniques

Now sieve this mixture.=હવે આ મિશ્રણને ગરણીથી ગાળી લો.

It is violence and violence is sin.=હિંસા કહેવાય અને હિંસા પાપ છે.

The youth were frustrated and angry.=યુવાનો નિરાશ અને ક્રોધિત હતા.

Give a double advantage.=ચાંલ્લો કરીને ખીર પણ ખવડાવી.

The similarity between Modi and Mamata=મોદી અને મમતા વચ્ચેનું સામ્ય

I'm a big fan of Bumrah.=હું બુમરાહનો મોટો પ્રશંસક છું.

38 people were killed.=તેમાં 38 લોકોના મોત થયા હતા.

The stranger came and asked.=અજાણ્યા યુવકે આવીને પૂછ્યું.

Jet now has 1,300 pilots.=હવે જેટની પાસે 1,300 પાયલટ છે.


Raymond Doctor,  has spent over a decade at the Centre for Development of Advanced Computing (C-DAC) in Pune, India. He is a pioneer in digital Indic language work and was involved in several Indic language NLP based initiatives and conducted research in furthering Indic language Parsers, Segmentation, Tokenization, Stemming, Lemmatization, NER, Chunking, Machine Translation, and Opinion Mining. 

Friday, May 1, 2020

Evaluating Machine Translation Systems

This post is the first in a series of upcoming posts focusing on the issue of quality evaluation of multiple MT systems. MT system selection has become a more important issue in recent times as users and buyers realize that potentially multiple MT systems can be viable for their needs, but would like to develop better, more informed selection procedures.

I have also just ended my tenure at SDL, and this departure will also allow my commentary and opinion in this blog to be more independent and objective, from this point onwards. I look forward to looking more closely at all the most innovative MT solutions in the market today and providing more coverage on them.  

As NMT technology matures it has become increasingly apparent to many buyers that traditional metrics like BLEU that are used to compare/rank different MT systems and vendors are now often inadequate for this purpose, even though these metrics are still useful to engineers who are focused on building a single MT system.  It is now much more widely understood that best practice involves human evaluations used together with automated metrics. This combined scoring approach is a more useful input in conducting comparative evaluations of MT systems.  To the best of my knowledge, there are very few in the professional translation world who do this well, and it is very much an evolving practice and learning that is happening now. Thus, I invite any readers who might be willing to share their insights into conducting consistent and accurate human evaluations to contact me about doing this here.

Most of the focus in the localization world's use of MT remains on MTPE efficiencies (edit distance, translator productivity), often without consideration of how the volume and useable quality might change and impact the overall process and strategy. While this focus has value, it misses the broader potential of MT and "leaves money on the table" as they say.

We should understand the questions that we are most frequently asking is: 
  • What MT system would work best for our business purposes?
  • Is there really enough of a difference between systems to use anything but the lowest cost vendor?
  • Is there a better way to select MT systems than just looking at generic BLEU scores?
I have covered these questions to some extent in prior posts and I would recommend this post and this post to get some background on the challenges in understanding the MT quality big picture.

The COVID-19 pandemic is encouraging MT-use in a positive way. Many more brands now realize that speed, digital agility, and a greater digital presence matter in keeping customers and brands engaged. As NMT continues to improve, much of the "bulk translation market" will move to a production model where most of the work will be done by MT.  Translators who are specialists and true subject matter experts are unlikely to be affected by the technology in a negative way, but NMT is poised to penetrate standard/bulk localization work much more deeply, driving costs down as it does so.

This is a guest post and an unedited independent opinion from an LSP (Language Service Provider) and it is useful in providing us an example of the most common translation industry perspective on the subject of multiple MT system evaluations. It is interesting to note that the NMT advances over SMT are still not quite understood by some, even though the bulk of the research efforts and most new deployments have shifted to NMT. 

Most LSPs continue to stress that human translation is "better" than MT which most of us on the technology side would not argue against, but this view loses something when we see that the real need today is to "translate" millions of words a day. This view also glosses over the fact that all translation tasks are not the same. Even in 2020 most LSPs continue to overlook that MT solves new kinds of translation problems that involve speed and volume and that new skills are needed to really leverage MT in these new directions. There is also a tendency to position the choice as binary MT vs Human Translation, even though much of the evidence is pointing to new man + machine models that provide an improved production approach. The translation needs of the future are quite different from the past and I hope that more service providers in the industry start to recognize this. 

I also think it is unwise for LSPs to start building their own MT systems, especially with NMT. The complexity, cost and expertise required are prohibitive for most. MT systems development should be left to real experts who do this on a regular and continuing basis. The potential for LSPs adding value is in other areas, and I hope to cover this in the coming posts.

Source: MasterWord

                                                                                                                                                               * =======*

It’s not a secret that machine translation (MT) has taken the world by storm. Almost everyone now has had some experience with MT, mostly in the form of a translation app such as Google Translate being popular. But MT comes in a variety of formats and is heavily utilized by businesses and institutions all over the world.

With that in mind, which MT system is best? Since MT comes in many colors, figuratively speaking, which one should you ought to rely on if you decide to build your own MT system? We’ll also talk more about translation quality and whether or not MT is suitable for specialized translations such as medical translation; a critical field now for any active translation company in light of the current coronavirus pandemic that has the whole world at a standstill.

What is Machine Translation?

Machine Translation, or MT, is software that is capable of translating text from a source language to a translated text of the target language. Over the years, there have been multiple variations of MT, but there are three definitive types; Rules-based Machine Translation (RBMT), Statistical Machine Translation (SMT), and Neural Machine Translation (NMT). Here’s a quick rundown of their characteristics, including their pros and cons between each other;

  1. RBMT

Rules-Based Machine Translation is one of the earliest forms of MT. Its algorithm is language-based, meaning for it to know how to translate one source language to the other, it must rely on input data in the form of a lexicon, grammar rules, and other linguistic fundamentals. The problem with RBMT systems is scaling it efficiently as it becomes more complicated as more language rules are added. Also, RBMT is never ideal for obscure languages with minuscule data. However, with the development of advanced MT systems over the years, RMBT has largely been superseded, in which you'll know more about its successor next.

  1.  SMT

Statistical Machine Translation, compared to RBMT, is designed to translate languages from statistical algorithms. SMT works by being fed with data in the form of bilingual text corpora, SMT is programmed to identify patterns in the data and form its translations from it. Patterns in this context mean how many times a certain word/phrase appears consistently in a certain context. This probability learning model allows SMT systems to render relatively appropriate translations compared. It’s pretty much like ‘If this is how was it done, then this is how it should be done’. 

SMT also must be fed with plenty of data just like RBMT, but MT developers of which includes translation app developers prefer SMT due to its ease of setting up due to numerous open-source SMT systems available, cost-effectiveness due to free quality parallel text corpora that are available online, higher translation accuracy than RMBT, and its ease of scalability as the system grows bigger.

But just like RBMT, SMT can’t function well if it’s fed with insufficient and poorly structured parallel text corpora. That being said, it’s not that ideal to translate obscure languages.

  1. NMT

Neural Machine Translation is the latest development in MT. Think of it as an upgraded version of SMT in which its abilities are now supplemented with artificial intelligence (AI), specifically deep learning. Not only is it capable of coming through data faster, but it can also produce better outputs through constant trial and error. SMT does it the same way as well but the only difference, albeit a definitive one, is that it’s able to do it much faster and more accurately. Google Translate recently made the switch in 2016 to NMT from its old SMT system.

Its deep learning capability is such a real game-changer that it’s able to accomplish what RBMT and SMT; translating obscure regional languages. That’s why Google Translate can cover over 100 languages such as Somalian and Gaelic. But its outputs are questionable, to say the least as it needs some time to learn a language that has little reliable data lying around for it to use. However, the development of NMT just goes to show how far MT overall has evolved over the years.

What Makes A Good Machine Translation (MT) System?

There have been many MT systems over the years and many still in development. The ones that happened to survive the test of time are select variants of RBMT and most variants of SMT. NMT has quickly gained popularity and will slowly replace SMT as the years go by. What’s generally expected out of a good custom-built MT system is reliability and quality of outputs, pretty much like any other product or service out there.

If you’re looking for a reliable metric, then BLEU (Bilingual Evaluation Understudy) is one of the most widely used MT evaluation metrics. BLEU ranks MT systems between 0 being the worst and 1 being the best. It rates how close the translated text is to a human. The more human-like and natural-sounding the translation is, the better the score.

That being said, every MT developer creates their system according to not only the developer’s but also a client’s specifications and linguistic needs. So not one of them is alike. But there are MT platforms that are widely used by multiple clients due to their flexibility of being adapted to the client’s needs and ease of use. But even with a variety of MT systems being developed over the years, one thing remains the same; MT systems have to learn from a lot of quality data and must be given the time to learn.

They say that machines are inherently dumb and that they’re only as good as job or data are given to them. For MT, that notion still rings true up to this day and will most likely keep ringing for decades to come. However, quality data isn’t only what makes a good MT system.

There are platforms in which MT is integrated with other processes for it to render quality or at the very least, passable translations. Indeed, MT itself is a process onto its own, but its outputs, even with deep learning capabilities, is still not up to par with that of a professional translator. MT has to be integrated with other processes, namely computer-assisted translation (CAT) tools.

There are many CAT tools but two of the most essential are a glossary tool and translation memory. A glossary is simply a database of terminologies and approved translations. It’s a very simple feature but very important as it saves up a lot of time for the translator as they don’t need to constantly look back and forth which translation is the perfect choice for the source text at hand.

A translation memory is also like a glossary, but stores phrases and sentences. It also saves the translator valuable time as many translations recycle the same language such as user manuals, marketing collateral, and etc. A translation memory also helps by providing consistent language at a given domain and language pair.

I Now Pronounce You Man and Machine

However, even with all the bells and whistles, developers can equip an MT system with, is MT alone enough? Can MT alone produce accurate and quality translations that are demanded by the clients of language services today? MT is part of the solution but doesn’t comprise the complete picture. It sounds counterintuitive, but MT is best paired with a professional translator as a means of optimizing the translation process.

This unlikely union broke the predictions of many that saw MT giving professional translators a run for their money and driving translation companies out of business. Professional translators work with CAT tools as it helps them be more churn out more words than ever before and helps them be more consistent. Why the need for speed? Domo’s latest report states that “2.5 quintillion bytes of data are created every single day”—that’s a lot of data and most of it is not in English which creates the rising demand for translation services.

Also, by having a translator work together with an MT system, the translator is doing the MT system a favor as well by constantly feeding back revisions for the MT to learn from and render better outputs and suggestions. All in all, it’s a highly productive and beneficial two-way street between a translator and an MT system.

Of course, this ‘relationship’ will be all for moot if the MT system wasn’t developed to a satisfactory standard. That being said, developers have to take into account both translation clients and translators themselves.

They have to ensure that not only will the MT system procure quality translations for clients but can also adapt to the needs of the translators using them. Being convenient to use and having a friendly UX design is one thing, but being able to incorporate the inputs of a translator and accurately replicating it in similar contexts is also another thing.

What Do Professional Translation Services Have Over MT?

Specifically, what can a translation company that hires professional translators to do better than artificial intelligence (AI)? Apart from translation quality and consistency, a professional translator has one advantage; they’re human. It may sound cliche but a human can understand nuances and no MT or AI are light years away from replicating.

Unable to Understand Emotional, Cultural, and Social Nuances

As of now, there is no MT yet that is capable of accurately understanding jokes, slang, creative expressions, and so on. The abilities of MT shine brightly with formulaic sentences and predictable language conventions. But if confronted with linguistic habits that are natural in everyday conversations, MT falls apart. This problem is made more pronounced at a global scale since every culture and society has its own way of speaking all the way down to highly distinct street lingo.

Unable to Process Linguistic Nuances

Parent languages are divided by their regional vernaculars and dialects. When someone’s trying to translate English to Spanish, it’s actually just generic Spanish with no local ‘flavoring’. But if you’re aiming for translations that resonate true to how Spanish people or how Mexican people speak, then a professional translator with native-speaking ability is who you need. No MT system now is able to comprehend, let alone translate linguistic nuances reliably.

Unable to Keep Up With Linguistic Trends

Languages change every day with new words being constantly added and removed to the lexicon of world languages. Humor, slang, and creative expressions are a testament to that notion. Even social media has given rise to new creative expressions in ways human society has never experienced before with meme culture as one of the most notable examples. Even if NMT was somehow capable of keeping up, it would still need time for the data to accumulate for it to start translating. By that time, new slang would have already popped out.

Unable to Render Specialized and Highly Contextual Translations

What we mean by specialized here is text with highly nuanced terminology such as the literary field and also texts belonging to critical fields such as the legal, scientific, medical sector. Authors inherently embed their works with highly nuanced expressions and linguistic ‘anomalies’, so much so that there is no identifiable pattern for any MT that can work with since each author has their own voice.

For the legal, and the medical sector, have their own language conventions that although seem formulaic on the surface, the inherently specialized terminologies and the risk factor involved in these fields means no margin of error can be given to MT. There are MT systems used in these sectors but are always paired with a professional legal translator and professional medical translator.

Developing Your Own MT System

Even with the quality issues and other imperfections associated with MT, the demand for machine translation services. According to a report published in Market Watch, “The Global Machine Translation Market was valued at USD 550.46 million in 2019 and is expected to reach USD 1042.46 million by 2025, at a CAGR of 11.23% over the forecast period 2020 - 2025.”.

However, many are looking to develop their own company MT instead of ‘borrowing’ one from an external provider and for good reason. If a translation company is rendering plenty of niche translations in a given year, then configuring their own MT system is the most cost-effective investment as there will be no need to pay for licensing fees to external MT providers.

Many industries have their language conventions and jargon, in regards to internal communication mostly. For example, legalese is perfectly comprehensible to lawyers but downright alien-sounding to those with little legal knowledge. That being said, even businesses and organizations have their own language conventions that veer off from the industry norm. In that case, they would then have to build their very own MT systems, especially if they’re focusing on specific target foreign markets and audiences. 

So out of the 3 listed earlier, which one should you choose? It’s most likely SMT due to its popularity and how much support it gets. There are who have gone for a Hybrid MT by combining SMT and RBMT but that’s probably too intimidating for first-timers. If you want to make the big leap right from the start, then, by all means, go NMT if it meets your company’s objectives. 

Mind you that investing and training any MT system does come at a price and will take time. It’ll take time for glossaries and translation memories to develop, provided that the data used to feed the system is of standard. For a translation company, that usually isn’t a problem as in tandem with open-source parallel text corpora are the translation company’s own document archives.

Can You Choose MT Over a Translation Company?

Back then, instant language translation belonged to the category of futuristic science fiction gadgets. In fact, it still is today albeit we’ve heightened our standards. What we dream of now is instant voice interpretation. Specifically, being able to conduct a seamless multilingual conversation with anyone without the awkward pauses. But let’s get back to reality now. It’s hard not to be impressed with the abilities of MT today since we can easily witness it from our smartphones.

Even so, there are plenty of flaws associated with MT as discussed earlier that’s actually hindering it from developing serious widespread adoption. Be that as it may, MT as it now nevertheless has its own perks. Although one shouldn’t rely too much on MT at certain thresholds, doesn’t mean that you shouldn’t use it at all at specific situations. Here are some reasons why.


There are plenty of translation apps out there such as Google Translate as you might know already. All of them are free with the exception of premium access subscription payments to unlock more features. There are plenty of free translation plugins as well for website developers. Keep in mind that we’re talking about generic translators here and not the specialized MT systems from external providers that have licensing fees.

Speed and Convenience

At specific situations, some are just looking to have translation at the very moment they want it. Whether you’re a language student or a traveling businessperson, MT is your answer. It’s free and they can get results the moment they click the translate button. Even if it’s not 100% accurate, it at least gives them an implied meaning behind the translation.

For Generic, Repetitive, and Well-Resourced Languages

*Consider this pointer at your own risk*. One can certainly find MT if they have non-contextual and predictable text at hand such as simple and formulaic phrases. What you decide to do with it is all on you whether you use it only as a reference or actually employ it in a professional setting. That being said, the most quality translations you can get are from well-resourced such as Spanish, German, French, etc. If you tried translating, even a simple phrase from English to Chinese, you’ll unlikely get a similarly accurate translation since English and Chinese have vastly different language rules and an unrelated linguistic history.

A Note on Translation Quality in the Context of the Coronavirus Pandemic

Despite the vast improvements to MT, quality is still a significant issue and as you’re aware, human translators are there to guarantee that. However, in no situation is quality ever more necessary than in global communication in crisis as made evident by the current coronavirus pandemic, specifically in the form of medical translation. Medical translation is a highly specialized niche in translation and critical one too wherein the slightest mistranslation would lead to potentially unfortunate and even fatal consequences.

Medical translation must be provided by specialized medical translators who have complete mastery over their language pair (Ex. English to Spanish, Spanish to English) and extensive familiarity with medical terminology, medical practices, and code of ethics. They must undergo additional lengthy training before they can be classified as certified medical translators. That being said, are MT systems out of the picture?

There are MT systems that translate medical documents and medical research, but it must be under constant supervision from a certified medical translator. Connecting it to today’s crisis, there hasn’t been a recent time in history where a speedy translation of medical research has been more important than ever. Medical scientists all over the world are working together to understand the COVID-19 virus for them to come up with viable treatments and eventually, a vaccine. With that in mind, medical translation is the only bridge that’s making this level of coordination between medical scientists around the world possible.

Final Takeaway

Will there be a future where MT would be so advanced and almost human-like that professional translators would be an endangered species? If you were to judge by the pace of development of MT in such a short period, it would not be that unreasonable to believe in a future like that. However, let’s not put too much thought into it as it doesn’t pay attention enough to what is demanded from translation in the first place.

It’s apparent now that MT is good at servicing the translation speed and optimization needs, but as for quality, much of it belongs to the hands, or should I say the mind of a professional translator. That union would likely last for the next few decades. But let’s not hold ourselves to that prediction. Perhaps a game-changing MT feature is just a few years away or if our prediction holds true decades. But still, that’s considering our standards on translations, particularly on quality and human-ness, haven’t changed.

Author Bio:

Laurence Ian Sumando is a freelance writer penning pieces on business, marketing, languages, and culture.