MT has been in the news a lot of late and professionals are probably getting tired of this new hype wave. Major stories in The New York Times and the Los Angeles Times have been circulating endlessly – please don’t send them to me, I have seen them.
There is also another initiative by Gabble On which asks volunteers to evaluate Google Translate, Microsoft Bing, and Yahoo Babel Fish translations. And bloggers like John Yunker and many others have posted the preliminary results to that perennial question “Which Engine Translates Best?” on their blogs.
This certainly shows that inquiring minds want to know and that this is a question that will not go away. It is probably useful to have a general sense from this kind of news but does this kind of coverage really leave you any wiser and more informed?
Without looking at a single article or any of the results, I can tell you that the results are quite predictable, based on my very basic knowledge of statistics. Google is likely to be seen as the best simply because they have greater coverage of what the engines will be tested on and have probably crawled more bilingual parallel data than everybody else added together. I think the NYT comparison clearly suggests this. But does this actually mean that they have the best quality?
I thought it would be useful to share “more informed” opinions on what these types of tests really mean. Much of what I gathered can be found scattered around the ALT Group in LinkedIn so as usual I am just organizing and repurposing.
My personal sense is that this a pretty meaningless exercise unless one has some upfront clarity on why you are doing this. It depends on what you measure, how you measure, for what objective and when you measure. On any given day, any one of these engines could be the best for what you specifically want to translate. Measuring random snippet translations on baseline capabilities will only provide the crudest measure that may or may not be useful to a casual internet user but completely useless to understanding the possibilities that exist for professional enterprise use where you hopefully have a much more directed purpose. In the professional context knowledge about customization strategies and key control parameters are much more important. The more important question for the professional is: Can I make it do what I want relatively well and relatively easily?
The following are some selected comments from the LinkedIn MT group that provides an interesting and more informed (I think so anyway) professional perspective of this news.
Maghi King said: “The only really good test would have to take into account the particular user's needs and the environment in which the MT is going to be used - one size does not really fit all.”
Tex Texin said: “Identifying the best MT by voting will only determine which company encouraged the largest number of its employees to vote.”
Craig Myers said: “MT processes must be "trained" to provide desired outputs through creating a solid feedback loop to optimize accurate outcomes over time. Benchmarking one TM system against another is a fairly ridiculous endeavor unless you accept a very limited range of languages, content, and metrics upon which to base the competition upon - but then a limited scope negates any "real world" conclusions that might be drawn about languages and/or content areas outside of those upon which the competition is based.”
Alon Lavie AMTA President & Associate Research Professor at Carnegie Mellon University has, I think some of the most useful and informed things to say (follow the link to read the full thread):
The side by side comparison in the NY Times article is NOT an evaluation of MT. These are anecdotal examples. You could legitimately claim that the examples are not representative (of anything) and that casual users may draw unwarranted conclusions from them. I too think that they were poorly chosen. But any serious translation professional should know better. I can't imagine anyone considering using MT professionally drawing any kind of definite conclusions from these particular examples.
The specific choice of examples is not only biased, but also very naive. Take the first snippet from "The Little Prince". Those of us working with SMT should quickly suspect that the human translation of the book is very likely part of Google's training data. Why? The Google translation is simply way too close to the human reference translation. Translators - imagine that the sentences in this passage were in your TM... and Google fundamentally just retrieved the human translations.”
Ethan Shen of Gabble On “is hoping to be able to detect predictive patterns in the data that he could use to predict future engine performance. But he has no control over the input data (participants choose to translate anything they want), and he's collecting just about no real extrinsic information about the data. So beyond very basic things such as language-pair and length of source, he's unlikely to find any characteristics that are predictive of any future performance with any certainty whatsoever.
What can be done (but Ethan is not doing) is to use intrinsic properties of the MT translations themselves (for example, word and sequence agreement between the MT translations) to identify the better translation. In MT research, that's called "hypothesis selection". My students and I work extensively on a more ambitious problem than that - we do MT system combination, where we attempt to create a new and improved translation by combining pieces from the various original MT translations. Rather than select which translation is best, we leverage all of them. We have had some significant success with this. At the NIST 2009 evaluation, we (and others working on this) were able to get improvements of about six BLEU points beyond the best MT system for Arabic-to-English. That was about a 10% relative improvement. That was a particularly effective setting. Strong but diverse MT engines that each produce good but different translations are the best input to system combination.”
So while these kinds of anecdotal surveys are interesting and can get MT some news buzz, be wary of using them as any real indication of quality. They will also clearly establish that humans/professionals are needed to get real quality. The professional translation industry has hopefully learned that the “translation quality” question needs to be approached with care or you end up with conflation at best and a lot of mostly irrelevant data.
My best MT engine would be the system that does the best job on content I am interested in on that day. So I will try 2 at least. The best for professional use has to be the system that gives users steering control, and the ability to tune an engine to their very specific business needs as easily (and cost-effectively) as possible and helps enterprises build long term leverage in communicating with global customers.