Comments on eMpTy Pages: Understanding Machine Translation Quality: A Review

Very useful recap, many thanks!

2021-10-22T11:19:17.307-07:00

Very useful recap, many thanks!

Systematic and repeatable human assessment of MT o...

2021-10-22T11:18:02.404-07:00

Systematic and repeatable human assessment of MT output is a critical driver of progress with MT and tools like yours will bring much needed structure and experimental repeatability to this process. I look forward to learning more about your product.

Thanks for properly positioning human insight into...

2021-10-22T11:17:03.106-07:00

Thanks for properly positioning human insight into MT quality, Kirti! From ContentQuo’s work with some of the best enterprise&LSP Machine Translation teams in the world, we see how relentlessly they invest into continuous, multi-faceted, carefully instrumented processes for human assessment of MT output in order to constantly be ahead of the game when tracking the evolution of their MT engine mix. Even for fast-feedback scenarios like eCommerce where raw MT quality converts almost directly into cash, human evaluation seems essential to understand & preserve the quality drivers of high conversion. Not to mention the majority of scenarios, where fast meaningful business ROI is nigh impossible to obtain!..

Interpreting what a score means is difficult, and ...

2021-10-22T11:11:04.462-07:00

Interpreting what a score means is difficult, and it is even harder to understand what small improvements in a score may actually mean.

Having a "representative" reference sample is an issue with all these metrics. Quite possibly BLEU remains popular because we understand what a score might mean and enough of us have learned to do this "properly" after doing it a thousand times.

This is not true for hLepor, Comet, ChrF -- few have the experience to properly interpret them. It may take years for widespread understanding to happen so in the meantime BLEU is still better.

Kirti, IMVHO, the problems with BLEU are the scale...

2021-10-22T11:09:55.442-07:00

Kirti, IMVHO, the problems with BLEU are the scale, which does not allow for understandable increments, and the reference sample, but I can see that, despite the efforts, the other metrics are not that better.
In my previous comment, I forgot to mention a point I have long tried to catch the attention of researchers on: Assessing the source. No quality rate can be really understood if it does not take the quality of the source content into consideration. No machine can perform adequately with poor content, while an experienced HT could: The difference is all in the ability to guess. ;-)

Amazingly BLEU is still the most commonly used mea...

2021-10-22T11:08:05.663-07:00

Amazingly BLEU is still the most commonly used measure even at MT Summit in August 2021. Hardly obsolete. The reason I think that this is so is that it is well understood and it is not really that much worse than hLepor, Comet and others. There is very little to be gained for all the hassle/effort incurred in using the others.

The other things you mention are very interesting and probably deserve a closer examination that I hope to do in future

Well, Kirti, as you know, price is a fundamental p...

2021-10-22T11:06:19.659-07:00

Well, Kirti, as you know, price is a fundamental parameter, so a price/performance ratio is crucial. Also, the response to trimming the so-called learning slope. I may also add the resilience to variances in training data and, last but not least, the predictability of results, although I'm not sure how this last parameter can be measured.
Anyway, I'm fairly surprised that still no thought is given to subjectivity in assembling the assessment data, especially with BLEU, which I now considered fairly obsolete and inadequate.
All said, I'm quite amused by the amount of attention given to human assessment, but given the churning rate of pointless discussion on PEMT and quality at large in this industry I must say I'm not surprised.

Thank you Luigi for your comment What other varia...

2021-10-22T11:05:03.782-07:00

Thank you Luigi for your comment

What other variables should be added to get a more complete equation?

However unfortunate for championing a commercial p...

2021-10-22T11:03:16.181-07:00

However unfortunate for championing a commercial platform that's by no means better than competitors, at least this is a very honest piece on MT "quality" assessment. Many variables are still out of the equation, though.

Such a good article. Thank you!

2021-10-22T11:00:07.828-07:00

Such a good article. Thank you!