Pages

Saturday, February 27, 2010

Rule-based MT vs. Statistical MT: Does it Matter?

One of the current debates in the MT community is RbMT vs. SMT. While I do have a clear bias that favors SMT, I have tried to be fair and have written many times on this subject. I agree that it cannot be said that one approach is definitely ALWAYS better than the other. There are many successful uses of both. In fact, at this point in time there may be more examples of RbMT successes since it has been around longer.

However, there is clear evidence that SMT continues to gain momentum and is increasingly the preferred approach. RbMT has been around for 50 years and the MT engines we see around are in many cases the result of decades of investment and research. SMT is barely 5+ years old in terms of being commercially available since Kevin Knight began his research at USC in 2000 and is only just beginning to become available in the market.
RbMT vs SMT
The people best suited to answer the question of which approach is better are those who have explored both RbMT & SMT paradigms deeply, to solve the same problem. Unfortunately there are very few of these people around. The only ones I know for sure that have this knowledge are the Google Translate and Microsoft Live Translate teams and they have both voted in favor of SMT.
Today, RbMT still makes sense when you have very little data, or where you have a good foundation rules engine already in place, that has been tested and is a good starting point for customization. Some say they also perform better on languages with very large structural and morphological differences.Combinations like English <> Japanese, Russian, Hungarian, Korean still seem to often do better with RbMT. It is also claimed by some that RbMT systems are more stable and reliable than SMT systems. I think this is probably true with systems built from web-scraped or dirty data but the story with clean data is quite different. SMT systems built with clean data are stable, reliable and much more responsive to small amounts of corrective feedback.

What most people still overlook is that the free online engines are not a good representation of the best output possible with MT today. The best systems come after focused customization efforts, and the best examples for both RbMT and SMT are carefully customized in domain systems that are built for very specific enterprise needs rather than for general web user translation.

It has also become very fashionable to use the word “hybrid” of late. For many this means using both RbMT and SMT at the same time. However, this is more easily said than done. From my viewpoint, characterizing the new Systran system as a hybrid engine is misleading. It is an RbMT engine that applies a statistical post-process on the RbMT output to improve fluency. Fluency has always been a problem for RbMT and this post-process is an attempt to improve the quality of the raw RbMT output.Thus this approach is not a true hybrid from my point of view. In the same way, linguistics are being added to SMT engines in different ways to handle issues like word order and dramatically different morphology which have been a problem for pure data-based SMT approaches. I think most of us agree that statistics, data and linguistics (rules and concepts) are all necessary to get better results, but there are no true hybrids out there today.
RbMTvsSMT Table
I would also like to present my case for the emerging dominance of SMT with some data that I think we can mostly agree, is factual and true and not just a matter of my opinion.
Fact 1: Google used Systran RbMT system as their translation engines for many years before switching to SMT. The Google engines are general purpose baseline systems (i.e. non domain focused). Most people will agree that Google compares favorably with Babelfish which is a RbMT engine. I am told they switched because they saw a better-quality future and continuing evolution with SMT which CONTINUES TO IMPROVE as more data becomes available and corrective feedback is provided. Most people agree that the Google engines have continued to improve since they switched to SMT.
Fact 2: Most of the widely used RbMT systems have been developed over many years (decades in some cases) while none of the SMT systems are much over 5 years old and are still in infancy in 2010. 
Fact 3: Microsoft switched from a Systran RbMT engine to an SMT approach for all their public translation engines in the MSN Live portal as well. I presume for similar reasons as Google. They also use a largely SMT based approach to translate millions of words in their knowledge bases into 9 languages which is perhaps the most widely used corporate MT application in the world today. The Microsoft quality also continues to improve.
Fact 4: Worldlingo switched from a RbMT foundation to SMT to get broader language coverage and attempt to reverse a loss of traffic (mostly to Google)
Fact 5: SMT providers have been able to easily outstrip RbMT providers in terms of language coverage and we are only at the beginning of this trend. Google had a base of 25 languages while they were RbMT based but now have over 45 language pairs that can go into any other language and apparently now have over 1,000 language combinations with their SMT engines.
Fact 6: The Moses Open Source SMT training system has been downloaded over 4,000 times in the last year. TAUS considers it “the most accessed MT system in the world today.” Many new initiatives are coming forth from this exploration of SMT by the open source community and we have not yet really seen the impact of this in the marketplace.

Google and Microsoft have placed their bets. Even IBM, which still has a legacy RbMT offering, has their Arabic and Chinese speech systems linked to an SMT engine that they have developed. So now, we have three of the largest IT companies in the world focused on SMT-based approaches. 

However, this is perhaps just relevant for the public online free engines. Many of us know that customized, in-domain focused systems are different and for enterprise use, the kind of system that matters most. How easy is it to customize an SMT vs RbMT engine?
Whichbetter
Fact 7: Callison-Burch, Koehn et al have published a paper (funded by Euromatrix) where they compared 6 European languages engines as baselines and after domain tuning with TM data for SMT, dictionaries for RbMT. They found that Czech, French, Spanish and German to English all had better domain results with SMT. Only the Eng>Ger domain had better results on domain focused systems with RbMT. However, he did find that RbMT had better baselines in many cases than they had since they do not have the data resources that Google or Microsoft have and whose baseline systems are much better.
Fact 8: Asia Online has been involved with patent domain focused systems in Chinese and Japanese. We have produced higher quality translations than RbMT systems which have been carefully developed with almost a decade of dictionary and rules tuning. The SMT systems were built over 3-6 months and will continue to improve. It should be noted that in both cases Asia Online is using linguistic rules in addition to raw data-based SMT engine development.
Fact 9: The intellectual investment from the computational linguistics and NLP community is heavily biased towards SMT maybe by as much as a factor of 10X. This can be verified by looking at the focus of major conferences on MT in the recent past and in 2010. I suspect that this will mean continued advance and progress in the quality of SMT based approaches.

Some of my personal bias and general opinion on this issue:
-- If you have a lot of bilingual matching phrase pairs (100K+) you should try SMT and in most cases you will get better results than a RbMT especially if you spend some time providing corrective feedback in an environment like Asia Online. I think man-machine collaborations are much more easily engineered in SMT frameworks. Corrective feedback can be immediately useful and can leverage the engine quality very quickly.
-- SMT systems will continue to improve as long you have clean data foundations and continue to provide corrective feedback and retrain these systems periodically after “teaching” it what it is getting wrong.
-- SMT will win the English to German quality game in the next 3 years or sooner.
-- SMT will become the preferred approach for most of the new high value markets like Brazilian Portuguese, Chinese, Indic Languages, Indonesian, Thai, Malaysian and major African markets.
-- SMT will continue to improve significantly in future because: Open Source + Academic Research + Growing Data on Web + Crowdsourcing Feedback are all at play with this technology

SMT systems will improve as more data becomes available, bad data is removed and as pre and post processing technologies around these systems improve. I also suspect that the future systems will be some variation of SMT + Linguistics (which includes rules) rather than data-only based approaches. I also see that humans will be essential to driving the technology forward and that some in the professional industry will be at the helm, as they do in fact understand how to manage large scale translation projects better than most.

I have also covered this in some detail in a white paper that can be found in the L10NCafe or on my LinkedIn profile and there is much discussion about this subject in the Automated Language Translation group in LinkedIn where you can also read the views of others with differing opinions. I recommend the entries from Jordi Carrera in particular, as he is an eloquent and articulate voice for RbMT technology. One of the best MT systems I know is an RbMT system that has source analysis and cleanup, integrated and largely automated post-editing at PAHO. The overall process flow is what makes it great, not that it is based on RbMT.
 So does it matter what approach you use? If you have a satisfactory, working RbMT engine then there is probably no reason to change. I would suggest that SMT makes more sense for most long-term initiatives where you want to see the system continually improve. Remember in the end the real objective is to get high volumes of content translated faster in the most accurate way possible and both approaches can work with the right expertise, even though I do prefer SMT and do believe that it will dominate in future.

Tuesday, February 23, 2010

Translation As a Force of Change

One of the things that I have always found interesting about the world of translation is that apart from facilitating global commerce, I see that it also has the potential to be a means to break down walls between cultures and also improve the lives of humans as information starts to flow more freely across languages. Poverty and lack of information are often very closely correlated.This is really powerful, but for the most part the professionals focus on documentation and content that is necessary but not considered especially high value. So who does the world changing stuff?

I tend to think that translation, collaboration and automation are closely related and that great things are possible as these key elements line up.

I wanted to point out some examples of this power already at work. I noticed several articles yesterday on Meedan which is an online meeting place for English and Arabic speakers to share viewpoints. A non-profit service that hopes to foster greater understanding and tolerance, translates content from the Arabic media to English and vice versa. The site uses machine translation, a community to help clean up MT output and already makes 3 million words of translation memory available to enable continuing leverage and encourage new English <> Arabic translation efforts. I met George Weyman last year and I am very happy to see this initiative grow in strength. Apart from being a peacemaker and bridge-builder, George is also a fine tin flute player as this video (starts at 2:28) of an impromptu music jam in an Irish bar shows. I joined them by drumming on the table. In time, I would not be surprised to find that Meedan becomes a model for building dialog elsewhere in the world. 

This meeting happened at the AGIS conference which was focused on building a community and collaboration platform to be able to launch initiatives against information poverty and bring translation assistance to humanitarian causes. Like the Open Translation Tools conference I wrote about earlier, these are fledgling movements that are growing in strength. I would not be surprised to see initiatives like The Rosetta Foundation become a source for more compelling innovation in translation than companies like SDL and other professional industry “leaders”. Collaboration, automation, MT, community management and open source were the focus at AGIS. This is in contrast to the same  localization themes we see repeated endlessly at the larger industry conferences. I would bet that revolution is more likely to come from hungry, motivated “world-changing” mindsets that I saw at AGIS than the professionals reeling under cost cutting from buyers, that we usually see at the major localization conferences. My sense is that people who feel awe can make shit happen.

Recently we also saw the power of collaboration and focused community efforts in Haiti. The following are just a few examples:
Language Lifelines: describes a variety of language industry initiatives to help relief assistance.
GALA setup a site to coordinate language related efforts and Jeff Allen resurrected data that he had worked on at CMU to help Microsoft, Google and others to develop MT solutions that might prove useful to the reconstruction effort.

I was also drawn into a vision that Dion Wiggins, CEO, Asia Online had to translate mostly educational open source content into several South East Asian languages to address the information poverty in the region. Again, the foundation of the effort here is an automated translation platform together with community collaboration and high value content. While this project still has a long way to go, the initial efforts are proof that the concept can work.  There is a growing belief that access to information and knowledge not only raises the lives of those who have access, but  also creates commercial opportunity as more people come online.

We are also seeing that community members (the crowd) can also step up to engage in translation projects, sometimes on a very large scale. While Facebook gets a lot press, I think it is the least interesting of these initiatives as it only focuses on L10N content which probably was best done by professionals anyway. They did prove however, that using crowds is a good way to rapidly expand the language coverage and your global customer base. If they actually extend this to the user content, I think Facebook could become a major force in translation. And again in this case, a management and collaboration infrastructure platform was necessary to enable and manage crowd contribution. I cannot see them extending the translation effort to the real user content without engaging machine translation into the process and flow. Many IT companies have also started to explore crowdsourcing, including Adobe, EMC and Intel and will expand language coverage this way. The professional translation industry should take note that this makes sense for companies to do because “long-tail” languages are not easily done cost effectively through standard channels.
 
While many in the professional industry comment disparagingly about quality in crowdsourcing translation, there is evidence that it can work quite well. The three best examples I know of are the TED Open Translation project which now has translated almost 5000 speeches into 70 languages using a pool of over 2000 volunteer translators, and the Yeeyan project in China and Global Voices.

The Yeeyan project takes interesting content in English and translates it into Chinese just to share interesting, compelling material. The community involves 8,000 volunteer translators, who’ve created 40,000 translations and collaborate with the Guardian, Time, NY Times and others. This effort got them into some trouble with Chinese censorship regulations but it has already evolved into a platform that employs “translators” and is self funding. 

Global Voices is translated into more than 15 languages by volunteer translators, who have formed the LinguaAdvocacy website and network to help people speak out online in places where their voices are censored. This a truly virtual organization that allows us to hear real voices from around the world. Check out the recently translated articles. There are many more initiatives that I give a shout out to in my Twitter stream.

I believe the professional industry is at a point where they need to understand collaboration, crowdsourcing, automation, MT, and open source. This is both an opportunity and a threat as those who resist these new forces, will likely be marginalized. Microsoft changed the world when they introduced PCs and a much more open IT model while IBM defended mainframes and became much less relevant. At the time, the management at IBM were not able to take a nerdy college dropout named Bill seriously. Maybe because he delivered his software on a single floppy or maybe because he did not wear a tie.  Microsoft in turn was caught completely off guard when Google introduced their much more open, free and cloud based model and became less relevant. This cycle will likely continue as innovation drives change and I predict that Google too will become less dominant in the not so distant future because they have lost the original spirit.

The Economist is also regularly translated into Chinese by a group that calls themselves the Eco Team. The founder had this to say:
"Like the forum name says, producing a Chinese version of The Economist is our goal. But we're still young and immature; very amateur, not professional. So what? Because we are young, we have the fervor, the enthusiasm, the passion. Because we are amateurs, we'll double our efforts to do our best. As long as we wish, we can be successful and do a good job!"

Ethan Zuckerman summarizes the implications of this very nicely. Change is coming to the world of translation, with or without the support and guidance of the professionals.

We are in an age where information is a primary driver of wealth creation. While the initial wave has been focused around English and European languages, this will increasingly shift to languages like Chinese. Social Networks in China are already proving that they can be innovators and leaders in the new digital economy. The value of information and thus of translation will continue to increase, and the understanding that knowledge can bring prosperity will hopefully gain momentum all around the world. I hope that some of us will help make this happen.

Wednesday, February 17, 2010

The Global Customer Support Translation Opportunity

Recently I have written about why MT is important for LSPs. MT is a key enabling technology to make large volumes of dynamic high-value business content multilingual.  I have also pointed out the significant business value of making customer support content in particular more multilingual.I would like to go into more detail on the specific challenges one is likely to face in translating knowledge base and community content and how this could be addressed.

In most cases support content is likely to be 20X to 1000X the volume, of even a large documentation project so using MT technology will be a core requirement. It is also important that stakeholders understand that human quality is not achievable across the whole body of content and that it is important to define a “good enough” quality level early in the process.


Understanding the Corpus

The first step to developing a translation strategy for “massive” content is to profile the source corpus and understand volatility, language style, terminology, high frequency linguistic patterns, content creation process,  and assess existing linguistic resources available to build an MT engine. It is usually wise to do the following:
-- Gather existing translation memory (TM) and glossaries for training corpus
-- Identify sections that must be human translated (e.g. security, payment processing terms and conditions, legal content)
-- Analyze the source corpus and identify high frequency phrase patterns and ensure that they they are translated and validated by human translators
-- Identify the most frequently used knowledge base and community content and ensure that these are translated by humans and used as training corpus.
Once this is done, an MT engine can be built and evaluated. While it is important to do linguistic evaluation, it is perhaps even more important to show samples of MT output to real customers and determine whether the output is useful.
KB Development Process
It is generally recommended that new knowledge base content is run through the initial engine and the MT translation is analyzed and corrected by human post-editors and linguists until a target quality level is achieved. This process may involve several iterations to continually improve the quality. The whole knowledge base can be periodically retranslated as big improvements in MT engine quality are accomplished. It is important to understand that this is an ongoing and continuously evolving process and that overall quality will be strongly related to the amount of human corrective feedback that is provided.
Self-service KB
It is worth restating that there are significant benefits to doing this as the customer support environment evolves with the general momentum behind collaboration and community networks.The ROI in terms of call deflection savings and improved customer satisfaction is well documented and is significant. But perhaps the greatest benefit is the expanded visibility for the global customer who cannot really use the English content in it’s original form.

Microsoft has clearly demonstrated the value of making their huge knowledge base multilingual. At a recent TAUS conference they reported that hundreds of millions of support queries are handled by raw MT and interestingly, surveys indicate that the successful resolution and customer satisfaction in many of these languages is actually higher than it is for English! Others are starting to follow suit and Intel and Cisco have also done similar things on a smaller scale. The CSI presentation by Greg Oxton at a recent TAUS meeting states it very simply:


Content is King -- Language is Critical

I saw recently that analysts in the content management community have identified the growing demand for multilingual content as one of the strongest trends of 2009 and see it growing further in 2010. The Gilbane Group has a big emphasis on content globalization in their upcoming conference this summer. I was involved with a webinar yesterday with Moravia that focused on the customer support content globalization issue. A replay of the webinar is available here.

The time is now, to focus on and learn how to undertake content globalization projects that start at ten million words and and can run into hundreds of millions of words. This is the future of professional translation and I think that effective man-machine collaborations will be a key to success.