Friday, January 22, 2010

Why Machine Translation Matters

"I have not failed. I've just found 10,000 ways that won't work." - Thomas Edison

After more than fifty years of eMpTy promises and repeated failures, amazingly, interest in machine translation continues to grow. It is still something that almost everybody hopes will work someday. We just won’t give up. Why? How can an industry that fails to deliver for 50 years still be around? Clearly, MT is a difficult problem, but I think the main reason that we persist is that there is a huge thirst for information, data and knowledge that exists across language barriers. The growing volume of valuable information on the internet only makes this thirst more urgent.

Is automated translation finally ready to deliver on its promise? What are the issues with this technology and what will it take to make it work? I would like to provide my perspective on why it matters and why it is important that we continue in our quest to make it work better.

In the professional translation world there is much skepticism about MT and we see MT regularly being trashed in Translator and LSP blogs, forums and conversations at conferences. Many dismiss it entirely, as a foolish and pointless quest, based on what they see on Google and other free online translation portals. Very few understand or have ever seen the potential that carefully tuned and customized MT systems suggest.  There are a few who have begun to understand that MT is an imperative that will not go away and step tentatively forward. I am happy to see some wholeheartedly embrace it and try and learn how to use it skillfully to develop long term competitive advantage.

For some professionals there is a debate about whether Rule-based MT (RbMT) or Statistical Machine Translation (SMT) is better and of late it has become very fashionable to claim that the "right" approach is hybrid. Industry giants (Google, Microsoft, IBM) are all very focused on SMT with increasingly greater linguistic variations, and there is a healthy open source movement also underlying this (SMT) technology that is spawning innovative, new companies. My company, Asia Online, I think is one of the bright lights on the horizon.

My personal interest in MT is driven by a conviction that it can truly be an instrument to bring positive change in the world. It is possible that, using MT to ease access to critical knowledge could revolutionize and rapidly accelerate the development of much of the world’s poorest communities. I don’t think it is an exaggeration to say that “good” MT could help to improve the lives of millions in the coming years. And thus I feel that improving the quality of MT is a problem worthy of the attention of the best minds on the planet. I also think that getting the professional industry engaged with the technology is key to rapidly driving the quality of MT systems higher and perhaps to reach a tipping point where it enables all kinds of valuable information to rapidly become multilingual. My sense is that MT needs to earn the respect of professionals to really build a quality momentum and make the breakthroughs that so many us yearn for.

The Increasing Velocity of Information Creation
We live in a world where knowledge is power and information access, many say has become a human right. In 2006, the amount of digital information created, captured, and replicated was 1,288 x 1018 bits. In computer parlance, that's 161 exabytes or 161 billion gigabytes …

This is about 3 million times the information in all the books ever written!

Between 2006 and 2010, the information added annually to the digital universe will increase more than six fold from 161 exabytes to 988 exabytes. In 2007 it was already 281 exabytes. It is likely that the bulk of this new information will originate in just a few key languages of the digitally privileged knowledge driven economies. So are we heading into a global digital divide in the not so distant future?  The famous Berkeley study on How Much Information testifies to this huge momentum. A recent update to the study suggests that US households consumed approximately 3.6 zettabytes of information in 2008. Access to information is closely linked to prosperity and economic well being as shown below.

Peter Brantley at Berkeley in a personal blog quotes Zuckerman's wonderful essay:
“For the Internet to fulfill it’s most ambitious promises, we need to recognize translation as one of the core challenges to an open, shared and collectively governed internet. Many of us share a vision of the Internet as a place where the good ideas of any person, in any country, can influence thought and opinion around the world. This vision can only be realized if we accept the challenge of a polyglot internet and build tools and systems to bridge and translate between the hundreds of languages represented online."
Brantley goes on to say:
"Mass machine translation is not a translation of a work, per se, but it is rather, a liberation of the constraints of language in the discovery of knowledge."
Today, the world faces a new kind poverty. While, we in the West face a glut of information, much of the world faces information poverty. The cost for this can be high. “80% of the premature deaths in the developing world are due to lack of information” according to the University of Limerick President Prof. Don Barry. Much of the world’s knowledge is created and remains in a handful of languages, inaccessible to most who don’t speak these languages. Asia Online conducted a survey of local content available in SE Asian languages, and found that China and Japan each had 120X more content, and English speakers have perhaps 600X more content available to them than the billion people in the SEA region. Access to knowledge is one the keys to economic prosperity. Automated translation is one of those technologies that offers a way to reduce the digital divide and raise living standards across the world. As imperfect as it is, this technology may even be the key to real people-to-people contact across the globe.

The seminal essay The Polyglot Internet by Ethan Zuckerman has got to be the most eloquent justification for why translation technology and collaborative processes must and will improve.  It has become the inspiration and manifesto for the Open Translation Tools Summit.
While there is profound need to continue improving machine translation, we also need to focus on enabling and empowering human translators.Professional translation continues to be the gold standard for the translation of critical documents. But these methods are too expensive to be used by web surfers simply interested in understanding what peers in China or Colombia are discussing and participating in these discussions.
The polyglot internet demands that we explore the possibility and power of distributed human translation.
We are at the very early stages of the emergence of a new model for translation of online content – “peer production” models of translation.

Visionaries like Vint Cerf also points this out in a recent interview. Ray Kurzweil has spoken on the transformational potential that this technology could have on the world. Bill Gates has commented many times on the potential of MT to help unlock  knowledge, both for emerging countries and those who do not speak English. The Asia Online project is focused on breaking the language barriers for knowledge content using a combination of automated translation and crowdsourcing. Much of the English Wikipedia is intended to be translated into several Asian languages that are content starved using hybrid SMT and crowdsourcing.  Meedan is yet another example of how SMT and a community can work together to translate interesting content quickly at high quality levels to share information. There are many more.

While stories of MT mishaps and mistranslations abound, (we all know how easy it is to make MT look bad), it is becoming increasingly apparent to many, that it is important to learn how to use and extend the capabilities of this technology successfully. While MT is unlikely to replace human beings in any application where quality is really important, there are a growing number of cases that show that MT is suitable for:

· Highly repetitive content where productivity gains with MT can dramatically exceed what is possible with just using TM alone
· Content that would just not get translated otherwise
· Content that cannot afford human translation
· High value content that is changing every hour and every day
· Knowledge content that facilitates and enhances the global spread of critical knowledge
· Content that is created to enhance and accelerate communication with global customers who prefer a self-service model
· Content that does not need to be perfect but just approximately understandable

The forces that drive the interest in this technology continue to build momentum. Disruption is coming and much of the momentum is from outside the professional industry. I believe there is an opportunity for the professional translation industry to lead, and to develop and demonstrate best practice models that others will follow and emulate. Some may even learn to build competitive advantage from their use and superior understanding of how to leverage MT in professional projects.

I invite those interested in a productive professional dialogue to join the Automated Language Translation group in LinkedIn to come and explore how to learn to use this technology to professional advantage. I think we will continue to see more companies learn to use MT technology and I look forward to changing the archaic translation model that rules today for large and massive scale translation projects.

So here you have finally a real MT focused blog entry.


  1. I'm one of these skeptics, Kirti. However, I'll have to agree with most of your bullet list. There is certainly a place for 'good' MT in the translation ecosystem. I remember years ago hearing from a database geek how he was involved with saving lives in Vietnam by teaching hospitals how to develop information systems to track patient visits or localize problems associated with bad water sources. He stated that lack of information on IT procedures was a major barrier there. If MT can put a monolingual programmer in Vietnam on the path to a solution sooner by opening up the Microsoft Knowledge Base or help some village elder figure out how to sterilize water, hooray for all sides.

    However, what I have seen more of is a attempt to use MT in places for which it is not suited, for critical support information, safety-related documentation for the Western world where liability is a real issue, etc. Even bright minds like Lou Cremers, who do excellent work in the field, talk of the need for controlled language use, but the reality I see is that even specialist documentation consultancies for whom I translate do at best a mediocre job of "controlling" their use of language. And as anyone who has edited a lot of "near misses" (close fuzzy matches, plausible MT results that just happen to be wrong) ca honestly tell you, there is less time saved with MT + post-editing than is assumed, ad under workhouse conditions serious errors are inevitable.

    I personally would be thrilled if MT could handle 90% of my workload for free. But I fear for the safety and liability of those who use the results that would be delivered today. The unfortunate thing is that the general public does not understand the issues ad runs enormous risks by placing faith in a false god.

  2. Kevin

    I agree that MT does not belong everywhere, especially to translate instructions for equipment that can cause bodily injury if misused.

    I actually see that wherever quality and accuracy are really critical, it is necessary for humans to be involved. The higher the quality required the greater the need for human validation.

    This also includes things like contracts, key marketing messages or wherever linguistic finesse is required.

    I think that like TM, it will become more of a regular feature and will be able to boost productivity IF properly used.

    It is important not to confuse the general purpose MT one sees on Google and Systran as the only possibilities with the technology.

    It will be important to understand where it is useful and what needs to be done to make it work better (customization and tuning) and raise the quality to a point that it is actually useful to translators.


  3. Hi guys,

    Sorry that I do not agree that MT cannot be used for contracts or for marketing. I've done it. Never lost an corporate or mass market customers from it.

    topic: Poll: Do you use Machine Translation tools for professional purposes?
    post title: yes indeed have used for such tasks

    And the overall time of dictionary building and/or MT postediting was much less than the standard translation speed statistics which are 184-420 words per hour (1472-3360 words per day) without any CAT tools, stated in survey:
    Translation speed versus content management. In Multilingual Computing and Technology, Number 62, March 2004.