Friday, March 26, 2010

Dispelling some Popular Misconceptions about Statistical MT

Recently some RbMT enthusiasts have been speaking up about what they consider are SMT deficiencies in the LinkedIn MT group and also in a March 2010 article in Multilingual magazine.

I covered the SMT vs RbMT debate in a previous blog entry but I thought it would be useful to clarify some of the key issues that get brought up, more specifically. It is clear that Google and Microsoft who had years of experience with Systran RbMT have decided that an SMT-based approach is preferable. Probably because of better quality and more control. I even decided to put a Microsoft MT widget on this blog since it allows readers to both translate and correct the translations. How cool is that? This dynamic feedback loop will become a hallmark of all SMT in future.

I have noticed that much of the criticism comes from people who have had little or no direct experience with SMT (beyond Google Translate) and have not had experience customizing a commercial SMT system. Or perhaps, their information is somewhat dated and based on old research data; e.g. much is made about the blue cat being translated as le bleu chat rather than as le chat bleu. My attempts to recreate this “problem” on 4 different SMT engines showed that all the SMT engines in the market today understand how to do this without issue.Go ahead, try it.

Some of the key criticisms are as follows:

-- SMT Is unpredictable:
This is the most common criticism from RbMT “experts”. I think this may be true to some extent if your training data is noisy or not properly cleaned and normalized. An experiment that I was involved with showed exactly this, but only with dirty data (Garbage In Garbage Out). Since many of the early SMT systems were built by customizing baselines built from web-scraped data, there is risk and evidence of unpredictable behavior. At Asia Online we have seen that this unpredictability essentially disappears as the data is cleaned and normalized.
Clean data reduces Unpredictability
Another related criticism I have seen repeated often, is that SMT systems are inconsistent in terminology use. SMT systems do by definition tend to choose phrase patterns that have higher statistical density, but this is easily corrected by using glossaries (much simpler than dictionaries) which provide an over-ride function and encourage the engine to use very specific user-specified terminology.

-- SMT is harder to setup and customize:
The bulk of the effort in building custom SMT systems is focused on data cleaning, data gathering and data preparation for “training”. This should include both bilingual parallel text as well as monolingual target language text. Once data is cleaned, SMT engines can usually be built in days. In future we can expect that this could be reduced to hours for most systems.

TAUS has reported that Moses has been downloaded 4,000 times in the last 12 months. This suggests that many more will be working with SMT in the near future. Historically, the single commercial SMT vendor (US government focused) has been prohibitively expensive and provided very little user control. This is changing and as more people jump into open systems SMT, and I think we will see this become a non-issue. What many fail to understand is that RbMT systems give you the ability to control quality by adding dictionaries and that this is really the only control variable one has. There is generally no ability to modify or add rules without huge expense and vendor support.

We already see companies like Autodesk and Sun (Oracle) and even some LSPs building their own SMT systems, probably, because it is not that difficult. At Asia Online we have clean foundation data for 500+ language combinations that allow our users to quickly build custom systems by combining their data with ours. We will continue to build our clean data resources to enable more people to easily build custom SMT systems.
Data Path to Qlty

-- SMT requires large volumes of reliable and clean data:
It is true that SMT does need data to produce good results. But this includes both bilingual TMs as well as monolingual target language which can be easily harvested from the web. For many European languages it is possible to get significantly better results than Google with as little as 100,000 segments in a focused domain. Of course you get better results with more data. Initiatives like PANACEA, LDC, OPUS, Meedan  and entities like the UN and the EC are making an increasing amount of training data available to the open source community and we will see this issue also get easier and easier to solve. The TDA may also be helpful if you have money to spare. There are already several SMT-based start-ups in Spain, who have considerable experience in RbMT but choose an SMT-based approach for their future because they see higher quality results are possible with less effort. This shift in focus by competent RbMT practitioners is very telling and I think suggests more will follow.

-- RbMT is a better foundation for hybrid systems:
Hybrid systems are now increasingly seen as the wave of the future. We see that the original data-only mindset amongst SMT developers has changed and increasingly they incorporate more linguistics or syntax and grammar rules into their methodology. RbMT systems after 50 years of development (in some cases) realize that statistical methods can improve long term structural problems like the fluency of their output. I think the openness of the SMT community will out distance initiatives with RbMT foundations, especially since it is so difficult to go into the rules engines and make changes. Much of the advances over the next few years will involve pre- and post-processing around one of these approaches. There will probably be another 5,000 people download and play with Moses this year. This broad collective effort will generate knowledge that my instincts tell me will move faster to higher quality than the much lower investment on the RbMT side.
AO Hybrid
While there are some language combinations where RbMT systems outperform SMT-based systems, I think this too will change. RbMT can still make sense if you have no data but then why not just use the free online MT? Recent anecdotal surveys by the mainstream press are only the beginning of the coming SMT tidal wave. Most of us can remember how much worse the free online MT experience was a few years ago, when Google and Microsoft were still RbMT based. I now notice continuous improvements in these data-driven systems. As the internet naturally generates more data they will continue to improve. Ultimately this competition keeps everybody honest and hopefully on their toes. Whichever approach produces better quality is likely to gain increasing momentum and acceptance.  Apart from the fact that it is getting easier to get data and SMT's open source relationship, the major driving force I think is user control. SMT simply gives one more control across the board, even the learning algorithms are modifiable. My bet is on the SMT + Linguistics + Active Human Feedback loop as the clear quality leader in the very near future.

This debate is important as the types of skills that active users need to develop for each approach are quite different. For SMT:  Data Cleaning, Data Analysis, Linguistic Steering  and Linguistic Pattern Analysis and even Linguistic Rules Development for some languages. For RbMT: Dictionary Building and recently Statistical Post-Editing. But it is clear that rapid, efficient post-editing is important in either case.

Please chime in and share your opinion or experience on this issue.

No comments:

Post a Comment