Pages

Monday, March 15, 2010

The Ongoing Quest for “Best” MT Translation Quality

MT has been in the news a lot of late and professionals are probably getting tired of this new hype wave. Major stories in The New York Times and the Los Angeles Times have been circulating endlessly – please don’t send them to me, I have seen them. 

There is also another initiative by Gabble On which asks volunteers to evaluate Google Translate, Microsoft Bing, and Yahoo Babel Fish translations. And bloggers like John Yunker and many others have posted the preliminary results to that perennial question “Which Engine Translates Best?” on their blogs.

This certainly shows that inquiring minds want to know and that this is a question that will not go away. It is probably useful to have a general sense from this kind of news but does this kind of coverage really leave you any wiser and more informed?

Without looking at a single article or any of the results, I can tell you that the results are quite predictable, based on my very basic knowledge of statistics. Google is likely to be seen as the best simply because they have greater coverage of what the engines will be tested on and have probably crawled more bilingual parallel data than everybody else added together. I think the NYT comparison clearly suggests this. But does this actually mean that they have the best quality?

I thought it would be useful to share “more informed” opinions on what these types of tests really mean. Much of what I gathered can be found scattered around the ALT Group in LinkedIn so as usual I am just organizing and repurposing.

My personal sense is that this a pretty meaningless exercise unless one has some upfront clarity on why you are doing this. It depends on what you measure, how you measure, for what objective and when you measure. On any given day, any one of these engines could be the best for what you specifically want to translate. Measuring random snippet translations on baseline capabilities will only provide the crudest measure that may or may not be useful to a casual internet user but completely useless to understanding the possibilities that exist for professional enterprise use where you hopefully have a much more directed purpose. In the professional context knowledge about customization strategies and key control parameters are much more important. The more important question for the professional is: Can I make it do what I want relatively well and relatively easily?
FreevsCustom

The following are some selected comments from the LinkedIn MT group that provides an interesting and more informed (I think so anyway) professional perspective of this news.
Maghi King said: “The only really good test would have to take into account the particular user's needs and the environment in which the MT is going to be used - one size does not really fit all.”
Tex Texin said: “Identifying the best MT by voting will only determine which company encouraged the largest number of its employees to vote.”
Craig Myers said: “MT processes must be "trained" to provide desired outputs through creating a solid feedback loop to optimize accurate outcomes over time. Benchmarking one TM system against another is a fairly ridiculous endeavor unless you accept a very limited range of languages, content, and metrics upon which to base the competition upon - but then a limited scope negates any "real world" conclusions that might be drawn about languages and/or content areas outside of those upon which the competition is based.”
Alon Lavie AMTA President & Associate Research Professor at Carnegie Mellon University has, I think some of the most useful and informed things to say (follow the link to read the full thread):
The side by side comparison in the NY Times article is NOT an evaluation of MT. These are anecdotal examples. You could legitimately claim that the examples are not representative (of anything) and that casual users may draw unwarranted conclusions from them. I too think that they were poorly chosen. But any serious translation professional should know better. I can't imagine anyone considering using MT professionally drawing any kind of definite conclusions from these particular examples.

The specific choice of examples is not only biased, but also very naive. Take the first snippet from "The Little Prince". Those of us working with SMT should quickly suspect that the human translation of the book is very likely part of Google's training data. Why? The Google translation is simply way too close to the human reference translation. Translators - imagine that the sentences in this passage were in your TM... and Google fundamentally just retrieved the human translations.”

Ethan Shen of Gabble On “is hoping to be able to detect predictive patterns in the data that he could use to predict future engine performance. But he has no control over the input data (participants choose to translate anything they want), and he's collecting just about no real extrinsic information about the data. So beyond very basic things such as language-pair and length of source, he's unlikely to find any characteristics that are predictive of any future performance with any certainty whatsoever. 

What can be done (but Ethan is not doing) is to use intrinsic properties of the MT translations themselves (for example, word and sequence agreement between the MT translations) to identify the better translation. In MT research, that's called "hypothesis selection". My students and I work extensively on a more ambitious problem than that - we do MT system combination, where we attempt to create a new and improved translation by combining pieces from the various original MT translations. Rather than select which translation is best, we leverage all of them. We have had some significant success with this. At the NIST 2009 evaluation, we (and others working on this) were able to get improvements of about six BLEU points beyond the best MT system for Arabic-to-English. That was about a 10% relative improvement. That was a particularly effective setting. Strong but diverse MT engines that each produce good but different translations are the best input to system combination.”
So while these kinds of anecdotal surveys are interesting and can get MT some news buzz, be wary of using them as any real indication of quality. They will also clearly establish that humans/professionals are needed to get real quality. The professional translation industry has hopefully learned that the “translation quality” question needs to be approached with care or you end up with conflation at best and a lot of  mostly irrelevant data. 

My best MT engine would be the system that does the best job on content I am interested in on that day. So I will try 2 at least. The best for professional use has to be the system that gives users steering control, and the ability to tune an engine to their very specific business needs as easily (and cost-effectively) as possible and helps enterprises build long term leverage in communicating with global customers.

2 comments:

  1. Some comments from Neil Coffey in the ProZ discussion of this same blog material.



    Kirti Vashee wrote:
    My personal sense is that this a pretty meaningless exercise unless one has some upfront clarity on why you are doing this.


    Well, actually they do present a list of hypotheses on the site. But in a sense, that's a flaw-- when you conduct an experiment and don't want your subjects to bias your results, you don't usually tell your subjects in advance what your hypothesis is...

    Of course, every experiment has flaws, and you have to weigh up the difficulty of removing these flaws vs practical constraints. Some other problems in this case are:

    - they say they want 10,000 votes -- but votes of *what*? is this per language pair? what methodology have they used to estimate that this will be enough to get statistically significant results in the language pair with the likely lowest number of votes?
    - how will they assess and compensate for natural biases in the type of people taking part in the experimnt? (e.g. the site is in English and located in the US, so more people are likely to find the site in a US search engine configured for English; an MT system trained/designed more for US English will then be inherently likely to fair better)


    Kirti Vashee wrote:
    This is another criticism by Alon Lavie who is a professor of computational statistics at CMU:
    Ethan Shen of Gabble On “is hoping to be able to detect predictive patterns in the data that he could use to predict future engine performance. But he has no control over the input data (participants choose to translate anything they want)


    Though they do have control over how they *filter* the data they get-- e.g. they can say "we'll only include input between X and Y words in length"-- and this can be a viable approach if done properly. But they obviously need to be careful not to bias their results by "peeking" (e.g. they have to make decisions about how to filter using a sample of the data that is then removed from the data actually analysed, and the decision of which sentences are used for experimental design and which are actually analysed should be random).



    and he's collecting just about no real extrinsic information about the data


    Yes, that's a potential problem, though arguably one that can be overcome by collecting a large amount of data. (OTOH, I'm not sure that 10,000 sentences is large enough.)



    So beyond very basic things such as language-pair and length of source, he's unlikely to find any characteristics that are predictive of any future performance with any certainty whatsoever.


    Arguably true, but if you look at their actual list of hypotheses, they probably *are* collecting enough in principle for those specific hypotheses. (Whether the testing of those hypotheses tell us much about future performance of MT, I'm not sure...)



    What can be done (but Ethan is not doing) is to use intrinsic properties of the MT translations themselves (for example, word and sequence agreement between the MT translations) to identify the better translation.


    I think this definitely has some advantages in terms of experimental design (your measurements are more "objective"; you can effectively run any text through the system "instantaneously", so you can run arbitrary numbers of sentences/sentences from well-defined sources). What I'd be interested to know is how you then remove the problem of circularity from your results-- in other words, if your experiment shows that Google Translate comes out top, how do you know that this result isn't biased by Google Translate using similar measures in their (essentially unpublished) training process to the ones that you're using in your evaluation?

    ReplyDelete
  2. At the end of the day no MT system that eschews understanding the meaning of the source will win the day. Statistics is educated guessing. Statistics has a place in MT, but it needs to be combined with rule based and interlingual to create a hybrid system. One to many does not create a multilingual environment. Many to many does.

    ReplyDelete