One of the big differences between RbMT and SMT is the role of automated measures in the system development process. Traditionally RbMT systems have had little or no use for this but a SMT system could probably not be built without using some kind of automated measurement. SMT developers are constantly trying new techniques, data combinations to improve systems, and need quick and frequent feedback on whether a particular strategy is working or not. It is necessary to use some form of standardized, objective and relatively rapid means of assessing quality as part of the system development process in the technology.
I will overview BLEU in this entry and continue this discussion in future blog entries.
The question of quality is a difficult question to answer, because there is no entirely objective way to measure the quality/accuracy of automated translation software, or of any translation for that matter, that is widely accepted. The localization industry has struggled for years to establish some kind of objective measure for human translation quality and has yet to really succeed on this. Competent and objective humans are usually the surest measure of quality, but as we all know, objectivity and real structural rigor is hard to define. LISA claims that 20% or 1 in 5 use the LISA QA Model 3.1 but I have rarely seen examples of this in actual use.
The most widely used measure in SMT today is BLEU. The oddly named BLEU – (BiLingual Evaluation Understudy) is an approach developed by IBM is especially actively used by developers in the SMT community even though everybody is always complaining about how flawed it is. There is a great discussion on this in the Automated Language Translation group in LinkedIn.
What is a BLEU score?
Measuring translation quality is difficult because there is no absolute way to measure how “correct” a translation is. Many “correct” answers are possible, and there can be as many “correct” answers as there are translators. The most common way to measure quality (in SMT) is to compare the output of automated translation to a human translation of the same document. The problem is that one human translator will translate the document significantly differently than another human translator. This leads to problems when using these human references to measure the quality of an automated translation solution. A document translated by an automated software solution may have 60% of the words overlap with one translator’s translation, and only 40% with the other translator’s translation; even though both human reference translations can be technically correct, the one with the 60% overlap with machine translation provides a higher “quality” score for the automated translation than the other translator’s translation did. Therefore, although humans are the true test of correctness, they do not provide an entirely objective and consistent measurement for quality.
The BLEU metric scores a translation on a scale of 0 to 1. The closer to 1, the more overlap there is with a human reference translation and thus the better the system is. In a nutshell, the BLEU metric measures how many words overlap, giving higher scores to sequential words. For example, a string of four words in the translation that match the human reference translation (in the same order) will have a positive impact on the BLEU score and is weighted more heavily (and scored higher) than a one or two word match. It is very unlikely that you would ever score 1 as that would mean that the compared output is exactly the same as the reference output.
-- The scoring algorithms punish you (brevity penalty) for unnecessarily repeating high frequency words like “the”.
-- Studies have shown that there is a high correlation between BLEU and human judgments of quality when properly used.
-- BLEU scores are often stated on a scale of 1 to 100 to simplify communication but should not be confused with percentage of accuracy.
-- Even two competent human translations of the exact same material may only score in the 0.6 or 0.7 if they use different vocabulary and phrasing.
To conduct a BLEU measurement the following data is necessary:
1. One or more human reference translations. (In the case of SMT, this should be data, that has NOT been used in building the system as training data and ideally should be unknown to the SMT system developer. It is generally recommended that 1,000 or more sentences be used to get a meaningful measurement.) If you use too small a sample set you can sway the score significantly with just a few sentences that match or do not match well.
2. Automated translation output of the exact same source data set.
3. A measurement utility like Language Studio LiteTM that performs the comparison and calculation for you
As would be expected using multiple human reference tests will always result in higher scores as the SMT output has more human variations to match against. The NIST (National Institute of Standards & Technology) uses BLEU as an approximate measure of quality in its annual MT competitions with four human reference sets to ensure that some variance in human translation are captured, and thus allow more accurate quality evaluations of the MT solutions being evaluated. Thus, when companies claim they have the “best” MT system, all they are really saying is that they got the highest BLEU score on a single reference set comparison. The same system could do quite poorly with a different Test Set, so this information should be used with some care. The MT community has also recently started evaluating these measures to see which correspond most closely to human judgments.
What is BLEU useful for?
SMT systems are built by “training” a computer with examples of human translations. As more human translation data is added, systems should generally get better in quality. Asia Online provides a development environment that allows users to develop and make many adjustments in developing an SMT translation system. Often, new data can be added with beneficial results but sometimes this new data can cause a negative effect especially if it is dirty. Thus, to measure if progress is being made in the development process, the system developers need to be able to measure the quality rapidly and frequently to make sure they are improving the system and are in fact making progress.
During the development process, an automatic test is necessary to quickly see the impact of a development strategy. BLEU allows developers a way “to monitor the effect of daily changes to their systems in order to weed out bad ideas from good ideas.” When used to evaluate the relative merit of different system building strategies, BLEU can be quite effective as it provides very quick feedback and this enables SMT developers to quickly refine and improve translation systems they are building and continue to improve quality on a long term basis.
Asia Online provides a table that is periodically updated showing the BLEU scores of 506 different language combinations. The table is shown below, where the first column is the Source Language code and the first row is the Target Language code. This is useful, since for the most part the same amount/quality/type of core data has been used to build the all the SMT systems shown in the table and the test sets used to measure the quality are basically comparable. As you can see, the darker green combinations produce the best systems (given the same amount of data). The table also shows that English to Romance Languages and Romance to Romance Language combination produce the best quality systems, other things being equal. It also shows that Finnish and Hungarian are more difficult to deal with in general.
What is BLEU not useful for?
BLEU scores are always very directly related to a specific “test set” and a specific language pair. Thus, BLEU should not be used as an absolute measure of translation quality because the BLEU score can vary even for one language depending on the test and subject domain. In most cases comparing BLEU scores across different languages is meaningless unless very strict protocols have been followed.
Because of this, it is always recommended to use human translators to measure fluency and verify the accuracy of the systems after systems have been built. Also, most industry leaders will always vet the BLEU score readings with human assessments before production use.
In competitive comparisons it is important to carry out the comparison tests in an unbiased, scientific manner to get a true view of where you stand against competitive alternatives. Thus it is important to use the exact same test set AND the same BLEU measurement tool. The Test Set should be unknown to all the systems that are involved in the measurement. As the basic calculations used in determining the final BLEU can also vary, it is important to use the same tool when measuring several different systems.
BLEU score comparisons between two systems presented by some companies can be misleading because:
- companies may use different test sets and one may be simpler than the other
- different BLEU measurement tools are used
- if more human references are used to calculate the BLEU score, the scores will be higher (i.e., scoring one system with 4 human reference translations will increase the number of overlapping words versus a score calculated with 1 human reference translation)
Because of this, I would recommend
- use blind (that is previously unseen by the system developers) test sets to generate the BLEU scores
- use the same BLEU measurement tool
- adjust and normalize the scores so that a translation scored using 4 human reference translations is not compared to a translation with only one human reference translation.
If you are looking at BLEU scores that compare two different translation systems, you should always understand how the results were generated. Comparing systems that were tested on different test sets will be somewhat meaningless and could lead to very misleading and erroneous conclusions.
What are best-practices in using BLEU?
-- BLEU is best used as a way to evaluate development strategies and most useful to developers engaged in the SMT system building process.
-- Take care to develop a comprehensive and “blind” set of test data to measure your systems of (500 - 1000+ sentences) that cover the domain of interest.
-- Remember that a system developed to translate software knowledge base material is unlikely to do well on a test set with sentences that are common in general political news. So keep your test set focused on your business purpose.
-- Use BLEU measurements frequently when adding new data to your system to understand if it is beneficial or not.
-- When measuring competitive systems ensure that you are using:
-- The same test set
-- The same measurement tool
Remember that BLEU is not useful as an absolute measure for quality as it only focuses on matching word clusters in two similar documents.
BLEU is used more and more often as Moses becomes more popular. And while it is clear there are many flaws it is still useful if it is used with care. I will go into the problems with BLEU and new measurement approaches in my next entry.