Friday, March 26, 2010

Dispelling some Popular Misconceptions about Statistical MT

Recently some RbMT enthusiasts have been speaking up about what they consider are SMT deficiencies in the LinkedIn MT group and also in a March 2010 article in Multilingual magazine.

I covered the SMT vs RbMT debate in a previous blog entry but I thought it would be useful to clarify some of the key issues that get brought up, more specifically. It is clear that Google and Microsoft who had years of experience with Systran RbMT have decided that an SMT-based approach is preferable. Probably because of better quality and more control. I even decided to put a Microsoft MT widget on this blog since it allows readers to both translate and correct the translations. How cool is that? This dynamic feedback loop will become a hallmark of all SMT in future.

I have noticed that much of the criticism comes from people who have had little or no direct experience with SMT (beyond Google Translate) and have not had experience customizing a commercial SMT system. Or perhaps, their information is somewhat dated and based on old research data; e.g. much is made about the blue cat being translated as le bleu chat rather than as le chat bleu. My attempts to recreate this “problem” on 4 different SMT engines showed that all the SMT engines in the market today understand how to do this without issue.Go ahead, try it.

Some of the key criticisms are as follows:

-- SMT Is unpredictable:
This is the most common criticism from RbMT “experts”. I think this may be true to some extent if your training data is noisy or not properly cleaned and normalized. An experiment that I was involved with showed exactly this, but only with dirty data (Garbage In Garbage Out). Since many of the early SMT systems were built by customizing baselines built from web-scraped data, there is risk and evidence of unpredictable behavior. At Asia Online we have seen that this unpredictability essentially disappears as the data is cleaned and normalized.
Clean data reduces Unpredictability
Another related criticism I have seen repeated often, is that SMT systems are inconsistent in terminology use. SMT systems do by definition tend to choose phrase patterns that have higher statistical density, but this is easily corrected by using glossaries (much simpler than dictionaries) which provide an over-ride function and encourage the engine to use very specific user-specified terminology.

-- SMT is harder to setup and customize:
The bulk of the effort in building custom SMT systems is focused on data cleaning, data gathering and data preparation for “training”. This should include both bilingual parallel text as well as monolingual target language text. Once data is cleaned, SMT engines can usually be built in days. In future we can expect that this could be reduced to hours for most systems.

TAUS has reported that Moses has been downloaded 4,000 times in the last 12 months. This suggests that many more will be working with SMT in the near future. Historically, the single commercial SMT vendor (US government focused) has been prohibitively expensive and provided very little user control. This is changing and as more people jump into open systems SMT, and I think we will see this become a non-issue. What many fail to understand is that RbMT systems give you the ability to control quality by adding dictionaries and that this is really the only control variable one has. There is generally no ability to modify or add rules without huge expense and vendor support.

We already see companies like Autodesk and Sun (Oracle) and even some LSPs building their own SMT systems, probably, because it is not that difficult. At Asia Online we have clean foundation data for 500+ language combinations that allow our users to quickly build custom systems by combining their data with ours. We will continue to build our clean data resources to enable more people to easily build custom SMT systems.
Data Path to Qlty

-- SMT requires large volumes of reliable and clean data:
It is true that SMT does need data to produce good results. But this includes both bilingual TMs as well as monolingual target language which can be easily harvested from the web. For many European languages it is possible to get significantly better results than Google with as little as 100,000 segments in a focused domain. Of course you get better results with more data. Initiatives like PANACEA, LDC, OPUS, Meedan  and entities like the UN and the EC are making an increasing amount of training data available to the open source community and we will see this issue also get easier and easier to solve. The TDA may also be helpful if you have money to spare. There are already several SMT-based start-ups in Spain, who have considerable experience in RbMT but choose an SMT-based approach for their future because they see higher quality results are possible with less effort. This shift in focus by competent RbMT practitioners is very telling and I think suggests more will follow.

-- RbMT is a better foundation for hybrid systems:
Hybrid systems are now increasingly seen as the wave of the future. We see that the original data-only mindset amongst SMT developers has changed and increasingly they incorporate more linguistics or syntax and grammar rules into their methodology. RbMT systems after 50 years of development (in some cases) realize that statistical methods can improve long term structural problems like the fluency of their output. I think the openness of the SMT community will out distance initiatives with RbMT foundations, especially since it is so difficult to go into the rules engines and make changes. Much of the advances over the next few years will involve pre- and post-processing around one of these approaches. There will probably be another 5,000 people download and play with Moses this year. This broad collective effort will generate knowledge that my instincts tell me will move faster to higher quality than the much lower investment on the RbMT side.
AO Hybrid
While there are some language combinations where RbMT systems outperform SMT-based systems, I think this too will change. RbMT can still make sense if you have no data but then why not just use the free online MT? Recent anecdotal surveys by the mainstream press are only the beginning of the coming SMT tidal wave. Most of us can remember how much worse the free online MT experience was a few years ago, when Google and Microsoft were still RbMT based. I now notice continuous improvements in these data-driven systems. As the internet naturally generates more data they will continue to improve. Ultimately this competition keeps everybody honest and hopefully on their toes. Whichever approach produces better quality is likely to gain increasing momentum and acceptance.  Apart from the fact that it is getting easier to get data and SMT's open source relationship, the major driving force I think is user control. SMT simply gives one more control across the board, even the learning algorithms are modifiable. My bet is on the SMT + Linguistics + Active Human Feedback loop as the clear quality leader in the very near future.

This debate is important as the types of skills that active users need to develop for each approach are quite different. For SMT:  Data Cleaning, Data Analysis, Linguistic Steering  and Linguistic Pattern Analysis and even Linguistic Rules Development for some languages. For RbMT: Dictionary Building and recently Statistical Post-Editing. But it is clear that rapid, efficient post-editing is important in either case.

Please chime in and share your opinion or experience on this issue.

Monday, March 22, 2010

Why Machine Translation Matters -- Part II

I just spent the last few days at the ATA-TCD Conference in Scottsdale AZ. You can read highlights from the Twitter stream by searching on #TCD11. While it always nice to be in the sun for a few days, it is encouraging to see people in the industry focused on change along key dimensions like standards, technology and automation as well as the impact of social media on business strategy. I enjoyed several thought provoking presentations and discussions I had with many attendees during the conference.

One of the sessions I did was with Alon Lavie and Mike Dillinger who (both represent AMTA leadership) gave a very useful overview for LSPs to get a better, more realistic sense about MT and provided a basic primer on the subject. I thought that since my original blog post on this subject is my most popular post it might be useful to further develop this theme.

My original post focused on how MT could help address information poverty.Here are some of my new comments on why MT matters from the presentation at TCD11. The issue of growth in the sheer volume of information is increasingly clear to most but it is worth restating with some specific projections from IDC and EMC who monitor this very closely. The following chart shows projections just on enterprise content volume.
Enterprise Data Growth

In actual fact the fastest growth is actually in user generated content (UGC) e.g. blogs, FB, Youtube, Flickr and community forums. It is estimated that 70% of the content on the web is UGC and much of that is very pertinent and useful to enterprises. This content is now influencing consumer behavior all over the world and is often referred to as word-of-mouth-marketing (WOMM). Consumer reviews are often more trusted than corporate marketing-speak and even “expert” reviews.We all have experienced Amazon, travel sites, C-Net and other user rating sites. It is useful for both global consumers and global enterprises to make this multilingual. Given the speed at which this information emerges, MT has to be part of the translation solution though involving humans in the process will produce better quality.
UGC Importance

So if this is going on, it also means that what used to be the primary focus for the professional industry, needs to change from the static content of yesteryear to the more dynamic and much higher volume user generated content of today. This is often where product opinions are formed and this is also where customer loyalty or disloyalty can form as the customer support experience shows. This is what I call high value content. The following chart shows that MT will play a critical role in making this content more visible because it is high value and because of the sheer volume.
Shift to Dynamic

I also found another powerful argument for any multicultural society like the US and UK in this paper by Julia Alanen. She points out that language barriers keep 25 million non-English speakers deprived of critical government services (in the US) and that this also affects the rest of the population. While she focuses on the need for translators and interpreters, the content explosion is hitting this sector too. She point out:
Deprivation of plenary language access undermines human dignity, exacerbates many immigrants’ innate vulnerabilities, and harms society at large by impeding the efficacy of the healthcare and justice systems.
Getting back to the TCD conference, I was glad to see that several people (LSP leaders) asked me how they could learn more about MT and get more engaged with the technology. AMTA is proactively reaching out and trying to connect to the ATA by timing their conference to expand collaboration with the ATA. This is heartening to see and quite a contrast to negativity and the dueling conferences we see in other parts of the localization industry.

I also saw a quote from June Cohen, Executive Producer of TED Media at SXSW that I think is pretty wonderful (even though it may be naive and idealistic) when she was asked "What technology would you like invented? Or uninvented?"
"Instantaneous, accurate translation online. Nothing would do more to promote peace on this planet." 

Change is coming, and what are initially seen as threats can often be opportunities when one changes one’s own viewpoint. So here’s to change that creates more opportunity. Cheers.

Monday, March 15, 2010

The Ongoing Quest for “Best” MT Translation Quality

MT has been in the news a lot of late and professionals are probably getting tired of this new hype wave. Major stories in The New York Times and the Los Angeles Times have been circulating endlessly – please don’t send them to me, I have seen them. 

There is also another initiative by Gabble On which asks volunteers to evaluate Google Translate, Microsoft Bing, and Yahoo Babel Fish translations. And bloggers like John Yunker and many others have posted the preliminary results to that perennial question “Which Engine Translates Best?” on their blogs.

This certainly shows that inquiring minds want to know and that this is a question that will not go away. It is probably useful to have a general sense from this kind of news but does this kind of coverage really leave you any wiser and more informed?

Without looking at a single article or any of the results, I can tell you that the results are quite predictable, based on my very basic knowledge of statistics. Google is likely to be seen as the best simply because they have greater coverage of what the engines will be tested on and have probably crawled more bilingual parallel data than everybody else added together. I think the NYT comparison clearly suggests this. But does this actually mean that they have the best quality?

I thought it would be useful to share “more informed” opinions on what these types of tests really mean. Much of what I gathered can be found scattered around the ALT Group in LinkedIn so as usual I am just organizing and repurposing.

My personal sense is that this a pretty meaningless exercise unless one has some upfront clarity on why you are doing this. It depends on what you measure, how you measure, for what objective and when you measure. On any given day, any one of these engines could be the best for what you specifically want to translate. Measuring random snippet translations on baseline capabilities will only provide the crudest measure that may or may not be useful to a casual internet user but completely useless to understanding the possibilities that exist for professional enterprise use where you hopefully have a much more directed purpose. In the professional context knowledge about customization strategies and key control parameters are much more important. The more important question for the professional is: Can I make it do what I want relatively well and relatively easily?

The following are some selected comments from the LinkedIn MT group that provides an interesting and more informed (I think so anyway) professional perspective of this news.
Maghi King said: “The only really good test would have to take into account the particular user's needs and the environment in which the MT is going to be used - one size does not really fit all.”
Tex Texin said: “Identifying the best MT by voting will only determine which company encouraged the largest number of its employees to vote.”
Craig Myers said: “MT processes must be "trained" to provide desired outputs through creating a solid feedback loop to optimize accurate outcomes over time. Benchmarking one TM system against another is a fairly ridiculous endeavor unless you accept a very limited range of languages, content, and metrics upon which to base the competition upon - but then a limited scope negates any "real world" conclusions that might be drawn about languages and/or content areas outside of those upon which the competition is based.”
Alon Lavie AMTA President & Associate Research Professor at Carnegie Mellon University has, I think some of the most useful and informed things to say (follow the link to read the full thread):
The side by side comparison in the NY Times article is NOT an evaluation of MT. These are anecdotal examples. You could legitimately claim that the examples are not representative (of anything) and that casual users may draw unwarranted conclusions from them. I too think that they were poorly chosen. But any serious translation professional should know better. I can't imagine anyone considering using MT professionally drawing any kind of definite conclusions from these particular examples.

The specific choice of examples is not only biased, but also very naive. Take the first snippet from "The Little Prince". Those of us working with SMT should quickly suspect that the human translation of the book is very likely part of Google's training data. Why? The Google translation is simply way too close to the human reference translation. Translators - imagine that the sentences in this passage were in your TM... and Google fundamentally just retrieved the human translations.”

Ethan Shen of Gabble On “is hoping to be able to detect predictive patterns in the data that he could use to predict future engine performance. But he has no control over the input data (participants choose to translate anything they want), and he's collecting just about no real extrinsic information about the data. So beyond very basic things such as language-pair and length of source, he's unlikely to find any characteristics that are predictive of any future performance with any certainty whatsoever. 

What can be done (but Ethan is not doing) is to use intrinsic properties of the MT translations themselves (for example, word and sequence agreement between the MT translations) to identify the better translation. In MT research, that's called "hypothesis selection". My students and I work extensively on a more ambitious problem than that - we do MT system combination, where we attempt to create a new and improved translation by combining pieces from the various original MT translations. Rather than select which translation is best, we leverage all of them. We have had some significant success with this. At the NIST 2009 evaluation, we (and others working on this) were able to get improvements of about six BLEU points beyond the best MT system for Arabic-to-English. That was about a 10% relative improvement. That was a particularly effective setting. Strong but diverse MT engines that each produce good but different translations are the best input to system combination.”
So while these kinds of anecdotal surveys are interesting and can get MT some news buzz, be wary of using them as any real indication of quality. They will also clearly establish that humans/professionals are needed to get real quality. The professional translation industry has hopefully learned that the “translation quality” question needs to be approached with care or you end up with conflation at best and a lot of  mostly irrelevant data. 

My best MT engine would be the system that does the best job on content I am interested in on that day. So I will try 2 at least. The best for professional use has to be the system that gives users steering control, and the ability to tune an engine to their very specific business needs as easily (and cost-effectively) as possible and helps enterprises build long term leverage in communicating with global customers.

Thursday, March 11, 2010

Problems with BLEU, and New Translation Quality Measurement Approaches

In the previous entry I described the basic concept of MT translation quality measurement using BLEU scores. However, there is much criticism of BLEU for many reasons which I will describe briefly.

There are several criticisms of BLEU that should also be understood if you are to use the metric effectively. BLEU only measures direct word-by-word similarity, and looks to match and measure the extent to which word clusters in two documents are identical. Accurate translations that use different words may score poorly since there is no match in the human reference. 

There is no understanding of paraphrases and synonyms so scores can be somewhat misleading in terms of overall accuracy. You have to get the exact same words as the human reference translation to get credit e.g.
"Wander" doesn't get partial credit for "stroll," nor "sofa" for "couch."
Also, nonsensical language that contains the right phrases in the wrong order can score high. e.g.
"Appeared calm when he was taken to the American plane, which will to Miami, Florida" would get the very same score as: "was being led to the calm as he was would take carry him seemed quite when taken". These and other problems are described in this article.

This paper from the Univ. Of Edinburgh shows that BLEU may not correlate  with human judgment to the degree it was previously believed and that sometimes higher scores do not mean better translation quality.

A more recent criticism identifies the following problems:
-- It is an intrinsically meaningless score
-- It admits too many variations – meaningless and syntactically incorrect variations can score the same as good variations
-- It admits too few variations – it treats synonyms as incorrect
-- More reference translations do not necessarily help
-- Poor correlation with human judgments
They also point out several problems with BLEU in the English-Hindi language combination. Thus it is clear that while BLEU can be useful in some circumstances it is increasingly brought into question.

The research community has recently focused on developing metrics that overcome at least some of these shortcomings and there are a few new measures that are promising. The objective is to develop automated measures that track very closely to human judgments and thus provide a useful proxy during the development of any MT system.  The most promising of these include:

METEOR: Alon Lavie who is the inventor of this metric says:
“I think it would be worth pointing out that while BLEU has been most suitable for "parameter tuning" of SMT systems, METEOR is particularly better suited for assessing MT translation quality of individual sentences, collections of sentences, and documents. This may be quite important to LSPs and to users of MT technology.” METEOR also can be trained to correlate closely to particular types of human quality judgments and customized for specific project needs.

TERp: takes as input a set of reference translations, and a set of machine translation output for that same data. It aligns the MT output to the reference translations, and measures the number of 'edits' needed to transform the MT output into the reference translation.

There are some who suggest that it is best to use a combination of these measures to get the best sense and maximize accuracy. The NIST and Euromatrix project have studied the correlation of various metrics to human judgments for those who want the gory details.


The basic objective of any automated quality measurement metric is to provide the same judgment that a competent, consistent and objective human would provide,  very quickly and efficiently during the engine development process. And thus guide the development of MT engines in an accurate and useful way.

In the Automated Language Translation group in LinkedIn there is a very interesting discussion on these metrics. If you read the thread all the way through you will see there is much confusion on what is meant by “translation quality”. The dialog between developers and translation professionals has not been productive because of this confusion. Process and linguistic quality are often equated and thus confusion ensues.
This discussion is often difficult because of conflation, i.e. very different concepts being equated and assumed to be the same. I think we have at least 3 different concepts that are being referenced and confused as being the same concept,  in many discussions on “quality”.

1. End to End Process Standards: ISO 9001, EN15038, Microsoft QA and LISA QA 3.1. They have a strong focus is on administrative, documentation, review and revision processes not just the quality assessment of the final translation.
2. Automated SMT System Output Metrics (TQM): BLEU, METEOR, TERp, F-Measure, Rouge and several others that only focus on rapidly scoring MT output by assessing precision and recall and referencing one or more human translations of the exact same source material to develop this score.
3. Human Evaluation of Translations: Error categorization and subjective human quality assessment, usually at a sentence level. SAE J2450, the LISA Quality Metric and perhaps the Butler Hill TQ Metric (that Microsoft uses extensively and TAUS advocates) are examples of this. The screenshot below shows the Asia Online human evaluation tool.
Human Quality Assessment
There is also the ASTM F2575 A new translation quality assurance standard published by ASTM International (an ANSI-accredited standards body) which defines quality as: The degree to which the characteristics of a translation fulfill the requirements of the agreed upon specifications.

For the professional translation industry the most important question is: Is the quality good enough to give to a human post-editor? There is an interesting discussion on “Weighting Machine Translation Quality with Cost of Correction” by Tex Texin and commenters. There is work being done to see if METEOR and TERp can be adapted to provide more meaningful input to professionals on the degree of post-editing effort likely for different engines.It will get easier and the process will be more informed.

For most MT projects it is recommended that "Test Sets" be developed with great care to reflect the reality of the systems production use as a first step. There should also be always be a human quality assessment step. This is usually done after BLEU/METERO have been used to develop the basic engine. This human error analysis is used to develop corrective strategies and assess the scope of the post-editing task. I think this step is critical BEFORE you start post-editing as it will provide information on how easy or difficult the effort is likely to be. Successful post-editing has to start with a quality assessment of the general MT engine quality. We are still in the early days of this and the best examples I have seen of this kind of quality assessment are at Microsoft and Asia Online.

So ideally an iterative process chain would be something like: 
Clean Training Data > Build Engine with BLEU/Meteor as TQM > Human Quality and Error Evaluation > Create New Correction Data > Improve & Refine the MT Engine > Release Production Engine> Post-Edit > Feed Corrected Data back to Engine and finally make LQA measurement on delivered output to understand the quality in L10N terms.
There is much evidence that an a priori step to clean up/simplify the source content would generally improve overall quality and efficiency and help one get to higher quality sooner.

It is my opinion that we are just now learning how to do this well and processes and tools are still quite primitive, though some in the RbMT community claim they have been doing it (badly?) for years.  As the dialog between the developers and localization professionals gets clearer, I think we will see tools emerge that provide the critical information LSPs need, to understand what the production value of any MT engine is, before they begin the post-editing project. I am optimistic and believe that we will see many interesting advances in the near future.

Tuesday, March 9, 2010

The Need for Automated Translation Quality Measurement in SMT: BLEU

One of the big differences between RbMT and SMT is the role of automated measures in the system development process. Traditionally RbMT systems have had little or no use for this but a SMT system could probably not be built without using some kind of automated measurement. SMT developers are constantly trying new techniques, data combinations to improve systems, and need quick and frequent feedback on whether a particular strategy is working or not. It is necessary to use some form of standardized, objective and relatively rapid means of assessing quality as part of the system development process in the technology. 

I will overview BLEU in this entry and continue this discussion in future blog entries.

The question of quality is a difficult question to answer, because there is no entirely objective way to measure the quality/accuracy of automated translation software, or of any translation for that matter, that is widely accepted. The localization industry has struggled for years to establish some kind of objective measure for human translation quality and has yet to really succeed on this. Competent and objective humans are usually the surest measure of quality, but as we all know, objectivity and real structural rigor is hard to define. LISA claims that 20% or 1 in 5 use the LISA QA Model 3.1 but I have rarely seen examples of this in actual use.

The most widely used measure in SMT today is BLEU. The oddly named BLEU – (BiLingual Evaluation Understudy) is an approach developed by IBM is especially actively used by developers in the SMT community even though everybody is always complaining about how flawed it is. There is a great discussion on this in the Automated Language Translation group in LinkedIn.

What is a BLEU score?
Measuring translation quality is difficult because there is no absolute way to measure how “correct” a translation is. Many “correct” answers are possible, and there can be as many “correct” answers as there are translators. The most common way to measure quality (in SMT) is to compare the output of automated translation to a human translation of the same document. The problem is that one human translator will translate the document significantly differently than another human translator. This leads to problems when using these human references to measure the quality of an automated translation solution. A document translated by an automated software solution may have 60% of the words overlap with one translator’s translation, and only 40% with the other translator’s translation; even though both human reference translations can be technically correct, the one with the 60% overlap with machine translation provides a higher “quality” score for the automated translation than the other translator’s translation did. Therefore, although humans are the true test of correctness, they do not provide an entirely objective and consistent measurement for quality.

The BLEU metric scores a translation on a scale of 0 to 1. The closer to 1, the more overlap there is with a human reference translation and thus the better the system is. In a nutshell, the BLEU metric measures how many words overlap, giving higher scores to sequential words. For example, a string of four words in the translation that match the human reference translation (in the same order) will have a positive impact on the BLEU  score and is weighted more heavily (and scored higher) than a one or two word match. It is very unlikely that you would ever score 1 as that would mean that the compared output is exactly the same as the reference output.
-- The scoring algorithms punish you (brevity penalty) for unnecessarily repeating high frequency words like “the”.
-- Studies have shown that there is a high correlation between BLEU and human judgments of quality when properly used.
-- BLEU scores are often stated on a scale of 1 to 100 to simplify communication but should not be confused with percentage of accuracy.
-- Even two competent human translations of the exact same material may only score in the 0.6 or 0.7 if they use different vocabulary and phrasing.

To conduct a BLEU measurement the following data is necessary:
1. One or more human reference translations. (In the case of SMT, this should be data, that has NOT been used in building the system as training data and ideally should be unknown to the SMT system developer. It is generally recommended that 1,000 or more sentences be used to get a meaningful measurement.) If you use too small a sample set you can sway the score significantly with just a few sentences that match or do not match well.
2. Automated translation output of the exact same source data set.
3. A measurement utility like Language Studio LiteTM that performs the comparison and calculation for you
As would be expected using multiple human reference tests will always result in higher scores as the SMT output has more human variations to match against. The NIST (National Institute of Standards & Technology) uses BLEU as an approximate measure of quality in its annual MT competitions with four human reference sets to ensure that some variance in human translation are captured, and thus allow more accurate quality evaluations of the MT solutions being evaluated. Thus, when companies claim they have the “best” MT system, all they are really saying is that they got the highest BLEU score on a single reference set comparison. The same system could do quite poorly with a different Test Set, so this information should be used with some care. The MT community has also recently started evaluating these measures to see which correspond most closely to human judgments.

What is BLEU useful for?
SMT systems are built by “training” a computer with examples of human translations. As more human translation data is added, systems should generally get better in quality. Asia Online provides a development environment that allows users to develop and make many adjustments in developing an SMT translation system. Often, new data can be added with beneficial results but sometimes this new data can cause a negative effect especially if it is dirty. Thus, to measure if  progress is being made in the development process, the system developers need to be able to measure the quality rapidly and frequently to make sure they are improving the system and are in fact making progress.

During the development process, an automatic test is necessary to quickly see the impact of a development strategy. BLEU allows developers a way “to monitor the effect of daily changes to their systems in order to weed out bad ideas from good ideas.” When used to evaluate the relative merit of different system building strategies, BLEU can be quite effective as it provides very quick feedback and this enables SMT developers to quickly refine and improve translation systems they are building and continue to improve quality on a long term basis.

Asia Online provides a table that is periodically updated showing the BLEU scores of 506 different language combinations. The table is shown below, where the first column is the Source Language code and the first row is the Target Language code. This is useful, since for the most part the same amount/quality/type of core data has been used to build the all the SMT systems shown in the table and the test sets used to measure the quality are basically comparable. As you can see, the darker green combinations produce the best systems (given the same amount of data). The table also shows that English to Romance Languages and Romance to Romance Language combination produce the best quality systems, other things being equal. It also shows that Finnish and Hungarian are more difficult to deal with in general.
What is BLEU not useful for?
BLEU scores are always very directly related to a specific “test set” and a specific language pair. Thus, BLEU should not be used as an absolute measure of translation quality because the BLEU score can vary even for one language depending on the test and subject domain. In most cases comparing BLEU scores across different languages is meaningless unless very strict protocols have been followed.

Because of this, it is always recommended to use human translators to measure fluency and verify the accuracy of the systems after systems have been built. Also, most industry leaders will always vet the BLEU score readings with human assessments before production use.

In competitive comparisons it is important to carry out the comparison tests in an unbiased, scientific manner to get a true view of where you stand against competitive alternatives. Thus it is important to use the exact same test set AND the same BLEU measurement tool. The Test Set should be unknown to all the systems that are involved in the measurement. As the basic calculations used in determining the final BLEU can also vary, it is important to use the same tool when measuring several different systems.
BLEU score comparisons between two systems presented by some companies can be misleading because:
  1. companies may use different test sets and one may be simpler than the other
  2. different BLEU measurement tools are used
  3. if more human references are used to calculate the BLEU score, the scores will be higher (i.e., scoring one system with 4 human reference translations will increase the number of overlapping words versus a score calculated with 1 human reference translation)
Because of this, I would recommend
  1. use blind (that is previously unseen by the system developers) test sets to generate the BLEU scores
  2. use the same BLEU measurement tool
  3. adjust and normalize the scores so that a translation scored using 4 human reference translations is not compared to a translation with only one human reference translation.
If you are looking at BLEU scores that compare two different translation systems, you should always understand how the results were generated. Comparing systems that were tested on different test sets will be somewhat meaningless and could lead to very misleading and erroneous conclusions.

What are best-practices in using BLEU?
-- BLEU is best used as a way to evaluate development strategies and most useful to developers engaged in the SMT system building process. 
-- Take care to develop a comprehensive and “blind” set of test data to measure your systems of (500 - 1000+ sentences) that cover the domain of interest.  
-- Remember that a system developed to translate software knowledge base material is unlikely to do well on a test set with sentences that are common in general political news. So keep your test set focused on your business purpose. 
-- Use BLEU measurements frequently when adding new data to your system to understand if it is beneficial or not.
-- When measuring competitive systems ensure that you are using: 
    -- The same test set 
    -- The same measurement tool 

Remember that BLEU is not useful as an absolute measure for quality as it only focuses on matching word clusters in two similar documents.

BLEU is used more and more often as Moses becomes more popular. And while it is clear there are many flaws it is still useful if it is used with care. I will go into the problems with BLEU and new measurement approaches in my next entry.

Wednesday, March 3, 2010

Interesting Video Content on Translation Technology

Today the NY Times had an interesting story about the changing web. It is growing massively in volume, from petabytes to zettabytes to yottabytes in fact according to Cisco, who also estimates that video will account for 90 percent of all Internet traffic by 2013.This growing use of video content will likely drive change in many business functions as well e.g. customer support, training, marketing and even basic product communications.

So what does this huge growth in video content mean for those of us in the professional translation industry?
-- The business value of user documentation and “corporate-speak filled websites” will increasingly diminish.
-- Leading edge corporations will increasingly move to using video to help sell, support and inform their global customers about their products and services.
-- This is both an opportunity and a threat to the professional industry and those who learn how to do do video well will likely emerge as leaders.
-- I think that this will only intensify the demand for increased use of automation = MT, but I think that there will also be an increased need for more effective collaborative networks. 

So while I can’t really say a whole lot more about how video will impact the translation industry I thought it would be valuable to gather some of the best video content I know of on MT or translation automation on the web and share the links.  I have tried to keep it to the best (sound/video) sources.

I do not know of many (any) sources for any RbMT related presentations so please let me know or mention them in the comments if you know of any. (The one that I did have from TAUS has been removed from the site.)

Philipp Koehn of University of Edinburgh & Asia Online: Excellent overview on the basics of Statistical Machine Translation and how SMT works, and also covers the automatic evaluation methodology. This is a great primer for somebody who wants to understand the basics of SMT as described by a world expert on the subject.

Franz Och of Google on the state of Google SMT:
Some highlights include:
o Portuguese is their best single system in terms of BLEU scores
o Huge data: over 1 B words on many LPs and even Finnish improves as you get to 1B words
o Google will make 10 Euro LPs LM also available with 100B words/LP in 5 grams available soon via LDC
o GOOG has much more demand on English to X than the other way around
o GOOG is building a Yiddish system using linguistic bridges to German / Polish / Hebrew has made it possible to develop an SMT system
o GOOG is working on long-distance reordering /dependencies using syntax & two-pass before translation
o Anaphora / topic identification / non-local word disambiguation is also under development
o Handling target language morphology is a major challenge
o They are trying to develop reliability and confidence measures
o Training attempts to focus only on BLEU maximization – they wants to make a way to do machine learning that can optimize on many dimensions
o GOOG wants to develop interactive MT and translation as a dialogue - editing
o GOOG wants to get to 100 LPs in the next few years

Dion Wiggins Presentation from the December 2009 Thailand Conference on why we continue to care about MT and about emerging models for man-machine collaboration.

Jost Zetsche on the “Reconvergence of MT and TM: he explains how machine translation has rapidly evolved from a separate, quite isolated technology into a new concept that is very much integrated in other translation tools and systems used by human translators. Thus while it cannot be said the MT works well in every language there is a growing body of translators who have actually experienced productivity benefits.

Asia Online Presentation from webinar on Why LSPs should be interested in MT. (Requires registration) 

Asia Online & Moravia Presentation from webinar on MT use in Customer Support applications. There is a great summary by Jenia Lazlo if you want to get the gist of it without watching. 

Another presentation on The State of the Art of SMT from Ronald Kuhn.

The Bay Area MT User Group has also put up some presentations on the experience that Autodesk & Sun  has had with the Moses Opens Source SMT technology. I do believe that you need to become a member first.

Please let me know if you know of other video material that could be included in this list.