Tuesday, January 31, 2017

The Driving Forces Behind MT Technology

This is a  modified version of a post that was originally published on  Caroline Alberoni's blog.


Machine translation (MT) today is as pervasive and ubiquitous as mobile phone technology. While some translators still feel threatened by the technology or feel the need to disparage it for it’s less than perfect translation, it is useful to understand why it is so widely used. At their annual developer conference in April 2016, Google announced that they are translating over 140 billion words a day across 100 languages. Baidu Translate can translate 27 languages and is growing, and processes around 100 million requests every day. Most of this use is from casual internet users who may be interested in translating a news story or some simple phrases. However, there is a growing impact on the professional translation business as well.

If you add the translation volume of Microsoft, Baidu, Yandex and other MT providers, we can certainly expect that more than 500 billion words a day are translated by computers on a daily basis today. This is probably more than 95% of ALL translation done and perhaps as high as 99% on the planet on a daily basis.

As Peter Brantley at Berkeley states in a personal blog:
"Mass machine translation is not a translation of a work, per se, but it is rather, a liberation of the constraints of language in the discovery of knowledge."

The need for translation of business content and other kinds of high-value information on the internet continues to grow, but the increasing use of MT also cause changes that affect translators and agencies alike. The most interesting translation work is increasingly moving beyond the focus of traditional translation work and is likely to do even more so in the future. Thus, the most lucrative and interesting new business translation opportunities, like at eBay for example, may require very different kinds of skills and competence but would still draw on traditional translation and linguistic competence. Translators and linguists today are often required to be “word corpus analysts” and today increasingly are involved in projects to steer MT technology to produce better results.
The professional use of MT is increasingly valid for all of the following:
  • Highly repetitive content where productivity gains with MT can dramatically exceed what is possible with just using TM alone
  • Content that would just not get translated otherwise
  • Content that cannot afford human translation
  • High-value content that is changing every hour and every day
  • Knowledge content that facilitates and enhances the global spread of critical knowledge
  • Content that is created to enhance and accelerate communication with global customers who prefer a self-service model
  • Content that does not need to be perfect but just approximately understandable
The forces that drive the increasing use of MT in the world, are largely beyond the control of the professional “translation industry,” continue to build unabated and can be briefly listed as follows:
  • The Explosion of Content Creation: The sheer volume of content that global enterprises, entertainment agencies, educational establishments, governmental agencies and any international commercial venture need to translate continues to grow by the minute. The amount of digital information increases tenfold every five years! In fact, it can even be said that we live in an age where more information is being created annually than has existed in the 500 years prior.
  • The Changing Content Value Equation: While historically corporate marketing communications had a great degree of control, today most consumers distrust this kind of messaging and would rather trust the shared opinions of fellow consumers. The value of business content increasingly has a very short shelf-life and thus traditional (slow and expensive) TEP (translate-edit-proof) approaches are increasingly questioned for information that may have little or no value after six months.  In actual fact, the fastest growing type of content is actually user-generated content (UGC) that is found in blogs, FB, Youtube, Twitter and community forums. It is estimated by IDC that 70% of the content on the web is UGC and much of that is very pertinent and useful to enterprises. This content is now influencing consumer behavior all over the world and is often referred to as word-of-mouth marketing (WOMM). Consumer reviews are often more trusted than “corporate marketing-speak” and even “expert” reviews which are often funded by the same corporations.We all have experienced Amazon, travel sites, C-Net and other user rating sites. It is useful for both global consumers and global enterprises to make this multilingual. Given the speed at which this information emerges, MT has to be part of the translation solution because of the volume and sheer rate of creation of this type of content. 

How much data is generated every minute? 

A case in point: The world’s largest travel review platform, TripAdvisor receives 315+ million monthly unique visitors, on its website, many contributing reviews. The combined weight of these reviews is considerable, and influence consumer decision making on final purchase selections to a very great extent. Having translations available in multiple languages online to support a purchase decision greatly enhances the possibility of a global consumer executing a transaction on the site.

Another very descriptive example by Juan Rowda at eBay:

"There are currently more than 800 million listings on eBay (over 1 Billion as of this writing). Considering that each listing has around 300 words, how long do you think it would take any given number of linguists to translate these listings? Did I mention that some of the listings may only be online for a day or a week and that the inventory changes continuously?

So, don’t even pull out your calculator. The answer is simple – human translation is not viable. However, if you really want to know, we estimate it would take 1,000 translators 5 years to translate only the 60 million listings eligible for Russia! For (these) listings, machine translation is clearly a much better fit in this scenario. "
  • Short Product Life and Development Cycles: The product life cycles in electronics, fashion, and many other consumer products get shorter all the time, so rapid, “good enough” product descriptions are increasingly considered sufficient for business requirements. The historical translation quality assurance cycles practiced in the 80’s and 90’s are not viable today as they simply could not keep pace with the rate of new product introduction.
  • Continuously Increasing Volume & Managed Cost Pressures: Enterprises are under continuous pressure to translate more content with the same budgets, and thus they seek out translation agencies who understand how to do this with rapid turnaround. Competent use of MT is a critical element of redefining the cost-time-volume equation for translating ever growing volumes of relevant business content especially given the extremely transient nature of a lot of this information.
  • Changing Internet User Base: As more of the developing world comes online it becomes imperative for these new users to have MT to be able to get some basic understanding of existing web content, especially knowledge content. The need is clear not only to global eCommerce sites but also to many local government agencies around the world who need to provide basic health and justice information and services to a growing population of immigrants who may not speak the dominant local language.
  • Widespread Acceptance of Free Generic Machine Translation: The universal availability and widespread use and acceptance of “free MT” on the internet have raised acceptance of MT in executive management circles too. This also drives the momentum for large new types of projects that would never have been considered in the TEP translation world. The fact that 500+ billion words a day are being translated by MT is clear indication that it delivers some value to hundreds of millions of internet users. As the MT quality continues to improve, albeit slowly, it puts further downward pressure on the price of translation work. It can also be said that for many languages MT has become an aid for translators as it can function as a dictionary, terminology or phrase lookup system.
Thus it is safe to presume, that it is very likely that MT is going to be a fact of life for many professional translators in the 21st century. And then, what new skills would a translator need to understand and be considered a valued partner, in a world where MT deployment and “opportunities” will continue to abound?

MT today, has already proven itself in professional use scenarios with many Romance languages, but we are still at a transition point in the use of MT in many other language combinations, and thus the MT experience can often be less than satisfying for translators in those other languages, especially when working with translation agencies who are not technically competent with MT.

The New Skills in Demand

At a high level, the skills that matter in working with the professional use of MT, that we can expect will grow in value to global enterprises and agencies involved in large MT projects are as follows.
  • Understand the different kinds of MT systems that you would interface with. Translators that understand the different kinds of MT are likely to be much more marketable.
  • Understand the specific output quality of the MT engines that you are working with. Provide articulate linguistic feedback on MT output. Being able to provide articulate feedback on error patterns is perhaps one of the most sought after skills in professional MT deployment today. This ability to assess the quality of MT output is also beneficial to a freelancer who is trying to decide whether to work on a PEMT project or not.
  • Develop skills with new kinds of tools that are valuable in dealing with corpus level tasks and manipulations. It is much more likely that MT projects will involve much larger volumes of data and data preparation and global pattern modification skills become much more useful and valuable.
  • Develop skills in providing pattern level feedback and develop rapid error pattern identification and correction. Being able to devise a rapidly implementable test and evaluation routines that are useful and effective is an urgent market requirement. This paper summarizes the specific linguistic issues with Brazilian Portuguese that provide an idea of what this actually means.
  • Develop a corpus view that involves linguistic steering rather than segment level corrections. This is a fundamental change of mental perspective that is a mandatory requirement for successful professional involvement with MT. Understanding the competence of the translation agencies that you engage with is also a key requirement as it is VERY easy to mismanage an MT project and most translation agencies that attempt to build MT engines on their own are quite likely to be incompetent.

What can a translator do?

  1. Learn and educate yourself on the variants of MT.
  2. Experiment with major public engines from Google, Systran, and Bing and with specialist tools like Lilt, SDL Adaptive MT and SmartCAT that allow easy interaction with MT.
  3. Understand how to rapidly assess MT output quality BEFORE you engage in any MT project.
  4. Don’t work with incompetent translation agencies who know little or nothing about MT but only seek to reduce rates with crappy do-it-yourself engines.
  5. Experiment with corpus management tools.

While it is quite possible that MT will never be quite good enough to be used for the translation of literary work and poetry where linguistic finesse and deep semantic insight is essential, it is clear in 2017 that MT has a definite role to make much more information multilingual in the global enterprise and any international communication. The MT technology has evolved over the years and is now beginning to use a new development methodology based on neural networks similar to those formed in human brains. Early results of this Neural Machine Translation are clearly better than the current technology, and we are in a period of inflated expectations of what is possible, but there is a reason for optimism and I think we should only expect that MT will become even more universal and widely used in the years to come.


  1. >we can certainly expect that more than 500 billion words a day are translated by computers on a daily basis today

    500 billion words a day for, say, 3.5 billion Internet users, seems to be under 150 words a day per user. Doesn't sound very impressive. Am I missing something?


    1. Yes indeed you are missing some contextual facts.

      The most detailed usage data on this MT usage is provided by Google and as recently as April 2016 they told us that that they translate 143B words a day across 100+ languages. They provided a pretty complete description of the usage in this blog post:

      While many people will translate whole web pages or documents – the large majority (including translators) only translate words or phrases. E.g. “In addition to common phrases like “I love you,” we also see people looking for translations related to current events and trends. For instance, last year we saw a big spike in translations for the word "selfie,” and this past week, translations for "purple rain" spiked by more than 25,000 percent.”

      Also “The most common translations are between English and Spanish, Arabic, Russian, Portuguese and Indonesian,” the graphic shows by balloon size the most active languages.

      “Ninety-two percent of our translations come from outside of the United States, with Brazil topping the list.”

      The next largest public MT service is Microsoft who have a more active business use profile from all that I can gather. But in my estimates they have a lower total word volume than Google.

      After Microsoft is Baidu who has 100M requests a day (vs 500M for Google) – I expect that the overall profile is the same.

      Then I add Yandex, Naver, Yahoo, SYSTRAN and the smaller MT vendors who may also have a public portal to these giant use sites, e.g. SDL does 20B words a month but all in a professional business use setting. In my estimation the total is about 500B words/day but I could quite possibly be too low by as much as a 100B or more.

      And maybe only a third or less of the total online population actually translates – it does not make sense that everybody online is translating. So the average is higher but will be very bi-modal (i.e. a large number does 10 words or less, and a large number does 300 words or more. )

  2. Apples and oranges?

    I don't doubt the data per se. What intrigues me is the relevance and validity of MT vs. HT claims. Is what is being reported a function of MT progress, or simply a matter of accessibility/user-friendly interface?

    Let's look 50 (or 500) years back… These flavors of "translation" has been around pretty much forever.

    1. Non-human translation: dictionary/phrasebook lookups.
    2. Self-translation: trying to produce something intelligible using 1.
    3. Ad-hoc (shop owner's kid/relative/passerby).
    4. "Professional"/for-hire: someone whose main duty is to translate/interpret for a fee.

    (Note that 2, 3 and 4 may have involved 1, but back then we had no data to go by).

    Is MT really a game-changer (if we look at a bigger picture)?