tag:blogger.com,1999:blog-6748877443699290050.post6863907584582028802..comments2024-03-27T23:43:31.674-07:00Comments on eMpTy Pages: The Ongoing Quest for “Best” MT Translation QualityKirti Vasheehttp://www.blogger.com/profile/16795076802721564830noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-6748877443699290050.post-54025050845473800752010-06-17T10:53:43.001-07:002010-06-17T10:53:43.001-07:00At the end of the day no MT system that eschews un...At the end of the day no MT system that eschews understanding the meaning of the source will win the day. Statistics is educated guessing. Statistics has a place in MT, but it needs to be combined with rule based and interlingual to create a hybrid system. One to many does not create a multilingual environment. Many to many does.Unknownhttps://www.blogger.com/profile/15240764754452797780noreply@blogger.comtag:blogger.com,1999:blog-6748877443699290050.post-76034084612789472992010-04-20T16:04:09.336-07:002010-04-20T16:04:09.336-07:00Some comments from Neil Coffey in the ProZ discuss...Some comments from Neil Coffey in the ProZ discussion of this same blog material.<br /><br /><br /><br />Kirti Vashee wrote:<br />My personal sense is that this a pretty meaningless exercise unless one has some upfront clarity on why you are doing this.<br /><br /><br />Well, actually they do present a list of hypotheses on the site. But in a sense, that's a flaw-- when you conduct an experiment and don't want your subjects to bias your results, you don't usually tell your subjects in advance what your hypothesis is...<br /><br />Of course, every experiment has flaws, and you have to weigh up the difficulty of removing these flaws vs practical constraints. Some other problems in this case are:<br /><br />- they say they want 10,000 votes -- but votes of *what*? is this per language pair? what methodology have they used to estimate that this will be enough to get statistically significant results in the language pair with the likely lowest number of votes?<br />- how will they assess and compensate for natural biases in the type of people taking part in the experimnt? (e.g. the site is in English and located in the US, so more people are likely to find the site in a US search engine configured for English; an MT system trained/designed more for US English will then be inherently likely to fair better)<br /><br /><br />Kirti Vashee wrote:<br />This is another criticism by Alon Lavie who is a professor of computational statistics at CMU:<br />Ethan Shen of Gabble On “is hoping to be able to detect predictive patterns in the data that he could use to predict future engine performance. But he has no control over the input data (participants choose to translate anything they want)<br /><br /><br />Though they do have control over how they *filter* the data they get-- e.g. they can say "we'll only include input between X and Y words in length"-- and this can be a viable approach if done properly. But they obviously need to be careful not to bias their results by "peeking" (e.g. they have to make decisions about how to filter using a sample of the data that is then removed from the data actually analysed, and the decision of which sentences are used for experimental design and which are actually analysed should be random).<br /><br /><br /><br />and he's collecting just about no real extrinsic information about the data<br /><br /><br />Yes, that's a potential problem, though arguably one that can be overcome by collecting a large amount of data. (OTOH, I'm not sure that 10,000 sentences is large enough.)<br /><br /><br /><br />So beyond very basic things such as language-pair and length of source, he's unlikely to find any characteristics that are predictive of any future performance with any certainty whatsoever.<br /><br /><br />Arguably true, but if you look at their actual list of hypotheses, they probably *are* collecting enough in principle for those specific hypotheses. (Whether the testing of those hypotheses tell us much about future performance of MT, I'm not sure...)<br /><br /><br /><br />What can be done (but Ethan is not doing) is to use intrinsic properties of the MT translations themselves (for example, word and sequence agreement between the MT translations) to identify the better translation.<br /><br /><br />I think this definitely has some advantages in terms of experimental design (your measurements are more "objective"; you can effectively run any text through the system "instantaneously", so you can run arbitrary numbers of sentences/sentences from well-defined sources). What I'd be interested to know is how you then remove the problem of circularity from your results-- in other words, if your experiment shows that Google Translate comes out top, how do you know that this result isn't biased by Google Translate using similar measures in their (essentially unpublished) training process to the ones that you're using in your evaluation?Kirti Vasheehttps://www.blogger.com/profile/16795076802721564830noreply@blogger.com