Pages

Wednesday, January 25, 2012

A Short Guide to Measuring and Comparing Machine Translation Engines

This is an article from the Asia Online November 2011 Newsletter that provides useful advice for meaningful comparisons of  MT engines and is authored by Dion Wiggins, CEO of Asia Online. So the next time somebody promises you a BLEU of 60, be skeptical, and make sure you get the proper context and assurances that it was properly done.


“What is your BLEU score?” This is the single most irrelevant question relating to translation quality, yet one of the most frequently asked. BLEU scores and other translation quality metrics greatly depend on many factors that must be understood in order for a score to be meaningful. A BLEU score of 20 in some cases can be better than a BLEU score of 50 or vice versa. Without understanding how a test set was measured and other details such as language pair and domain complexity, a BLEU score without context is not much more than a meaningless number. (For a primer on BLEU look here.)


BLEU scores and other translation quality metrics will vary based upon:
  • The test set being measured: Different test sets will give very different scores. A test set that is out of domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the segments in the test set should be gold standard (i.e. validated as correct by humans). Lower quality test set data will give a less meaningful score.
  • How many human reference translations were used: If there is more than one human reference translation, the resulting BLEU score will be higher as there are more opportunities for the machine translation to match part of the reference.
  • The complexity of the language pair: Spanish is a simpler language in terms of grammar and structure than Finnish or Chinese relative to English. Typically if the source or target language is relatively more complex,  the BLEU score will be lower.
  • The complexity of the domain: A patent has far more complex text and structure than a children’s story book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other.
  • The capitalization of the segments being measured: When comparing metrics, the most common form of measurement is Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured.
  • The measurement software: There are many measurement tools for translation quality. Each may vary slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same. The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge and this software measures the scores, for a given test set, for a variety of quality metrics.
It is clear from the above list of variables that a BLEU score number by itself has no real meaning.
How BLEU scores and other translation metrics are measured
With BLEU scores, a higher score indicates higher quality. A BLEU score is not a linear metric. A 2 BLEU point increase from 20 to 22 will be considerably more noticeable than the same increase from 50 to 52. F-Measure and METEOR also work in this manner where a higher score is also better. For Translation Error Rate (TER), a lower score is a better score. Language Studio™ Pro supports all of these metrics and can be downloaded for free.
Basic Test Set Criteria Checklist
The criteria specified by this checklist are absolute. Not complying with any of the checklist items will result in a score that is unreliable and less meaningful.
  • Test Set Data should be very high quality: If the test set data are of low quality, then the measurement delivered will not be reliable.
  • Test set should be in domain: The test set should represent the type of information that you are going to translate. The domain, writing style and vocabulary should be representative of what you intend to translate. Testing on out-of-domain text will not result in a useful metric.
  • Test Set Data must not be included in the training Data: If you are creating an SMT engine, then you must make sure that the data you are testing with or very similar data are not in the data that the engine was trained with. If the test data are in the training data the scores will be artificially high and will not represent the level of quality that will be output when other "blind" data are translated.
  • Test Set Data should be data that can be translated: Test set segments should have a minimal amount of dates, times, numbers and names. While a valid part a segment, they are not parts of the segment that are translated; they are usually transformed or mapped. The focus for a test set should be on words that are to be translated.
  • Test Set Data should have segments that are at between 8 and 15 words in length: Short segments will artificially raise the quality scores as most metrics do not take into account segment length. Short segments are more likely to get a perfect match of the entire phrase, which is not a translation and is more like 100% match with a translation memory. The longer the segment, the more opportunity there is for variations on what is being translated. This will result in artificially lower scores, even if the translation is good. A small number of segments shorter than 8 words or longer than 15 words are acceptable, but these should be limited.
  • Test set should be at least 1,000 segments: While it is possible to get a metric from shorter test sets, a reasonable statistic representation of the metric can only be created when there are sufficient segments to build statistics from. When there are only a low number of segments, small anomalies in one or two segments can raise or reduce the test set score artificially.Be skeptical of scores from test sets that only contain a few hundred sentences.
Comparing Translation Engines - Initial Assessment Checklist
Language Studio™ can be used for calculating BLEU, TER, F-Measure and METEOR scores.
  • All conditions of the Basic Test Set Criteria must be met: If any condition is not met, then the results of the test could be flawed and not meaningful or reliable.
  • Test set must be consistent: The exact same test set must be used for comparison across all translation engines. Do not use different test sets for different engines.
  • Test sets should be “blind”: If the MT engine has seen the test set before or included the test set data in the training data, then the quality of the output will be artificially high and not represent the true quality of the system.
  • Tests must be carried out transparently: Where possible, submit the data yourself to the MT engine and get it back immediately. Do not rely on a third party to submit the data. If there are no tools or APIs for test set submission, the test set should be returned within 10 minutes of being submitted to the vendor via email. This removes any possibility of the MT vendor tampering with the output or fine tuning the engine based on the output.
  • Word Segmentation and Tokenization must be consistent: If Word Segmentation is required (i.e. for languages such as Chinese, Japanese and Thai) then the same word segmentation tool should be used on the reference translations and all the machine translation outputs. The same tokenization should also be used. Language Studio™ Pro provides a simple means to ensure all tokenization is consistent with its embedded tokenization technology.
Ability to Improve is More Important than Initial Translation Engine Quality
The initial scores of a machine translation engine, while indicative of initial quality, should be viewed as a starting point for rapid improvement which is measured by the test set and BLEU scores. Depending on the volume and quality of data provided to the SMT vendor for training, the quality may be lower or higher. Most often, more important than the initial quality is how quickly the translation engine quality improves

Frequently a new translation engine will have gaps in vocabulary and grammatical coverage. Other machine translation vendors’ engines do not improve at all or merely improve very little unless huge volumes of data are added to the initial training data. Most vendors recommend retraining once you have gathered a volume of additional data that is at least 20% of the size of the initial training data that the engine was trained on. Even when this volume of data is added, only a small improvement is achieved. As a result, very few translation engines evolve in quality much further than their initial quality.

In stark contrast, Language Studio™ translation engines are created with millions of sentences of data that Asia Online has prepared in addition to the data that the customer provides. The translation engines improve rapidly with a very small amount of feedback. It is not uncommon to get a 1-2 BLEU score improvement with as little as a few thousand post-edited sentences. Language Studio has a unique 4 step approach that leverages the benefits of Clean Data SMT and manufactures additional learning data by directly analyzing the edits made to the machine translated output.
Consequently, only a small amount of post-edited feedback can improve Language Studio™ translation engine quality quite considerably, and it can do so at speeds much faster and with far less effort than with other machine translation vendors. Asia Online provides complimentary Incremental Improvement Trainings to encourage rapid translation engine quality improvement with every full customization and also offers additional complimentary Incremental Improvement Trainings when word packages are purchased, greatly reducing Total Cost of Ownership (TCO). 
 
An investment in quality at the development stages of a translation engine impacts and reduces the cost of post editing directly, while increasing post editing productivity. While the development of some rules, normalization, glossary and non-translatable term work will assist in the rate of improvement, the fastest and most efficient way to improve Language Studio™ engines is to post edit the translations and feed them back into Language Studio™ for processing. The edits will be analyzed and new training data will be generated, directly addressing the primary cause of most errors. In other words, just post editing as part of a normal project will result in an immediate improvement. Little or no other extra effort is needed. By leveraging the standard post editing process, the effort and cost of improvement as well as the volume of data required in order to improve is greatly reduced. 

Depending on the initial training data provided by the client, a small number of Incremental Improvement Trainings are usually sufficient for most Language Studio™ translation engines to improve to a quality level approaching near-human quality. 

Other machine translation vendors are now also claiming to build systems based on Clean Data SMT. Closer investigation reveals that their definition of “cleaning” is not the same as Asia Online. Removing formatting tags is not cleaning data. Language Studio™ analyzes translation memories and other training data and ensures that only the highest quality in domain data from trusted sources is included in the creation of your custom engine. The result is that improvements are rapid. Even with just a few thousand segments edited, the improvements are notable. When combined with Language Studio™ hybrid rules and an SMT approach to machine translation the quality of the translation output can increase by as much as 10, 20 or even 30 BLEU points between versions.
Comparing Translation Engines – Translation Quality Improvement Assessment
  • Comparing Versions: When comparing improvements between versions of a translation engine from a single vendor, it is possible to work with just one test set, but the vendor must ensure that the test set remains “blind” and that the scores are not biased towards the test set. Only then can a meaningful representation of quality improvement be achieved.
  • Comparing Machine Translation Vendors: When comparing translation engine output from different vendors, a second “blind” test set is often needed to measure improvement. While you can use the first test set, it is often difficult to ensure that the vendor did not adapt its system to better suit and be biased towards the test set and in doing so delivering an artificially high score. It is also possible for the test set data to be added to engines training data which will also bias the score.
As a general rule, if you cannot be 100% certain that the vendor has not included the first test set data or adapted the engine to suit the test set, then a second “blind” test set is required. When a second test set is used, a measurement should be taken from the original translation engine and compared to the improved translation engine to give a meaningful result that can be trusted and relied upon.
Bringing It All Together
The table below shows a real world example of a version 1 translation engine from Asia Online and an improved version after feedback. Additional rules were added to the translation to meet specific client requirements, which resulted in considerable improvement in translation quality. This is part of Asia Online’s standard customization process. Language Studio™ puts a very high level of control in the customer’s hands where rules, runtime glossaries, non-translatable terms and other customization features ensure the quality of the output is as close to human quality and requires the least amount of editing possible. 


BLEU Score
Comparisons
Case Sensitive
Asia Online  
V1
SMT
V2
SMT
V2
SMT +
Rules
Google Bing Systran
Reference 1 36.05 45.96 56.59 30.58 29.64 21.01
Reference 2 35.80 39.31 48.85 32.05 29.94 22.56
Reference 3 38.65 52.31 65.03 35.51 33.17 24.68
Combined References 50.45 66.52 80.48 44.58 41.65 30.26
Case Insensitive            
Reference 1 41.30 52.65 59.25 32.18 31.49 22.49
Reference 2 41.01 45.32 51.24 33.67 31.64 23.88
Reference 3 43.99 58.97 67.49 37.15 35.01 25.92
Combined References 56.83 74.35 82.89 46.26 43.68 31.68
*Language Pair: English into French.     Domain: Information Technology.           

It can be seen clearly from the scores above that when all three human reference translations are combined the BLEU score is significantly higher and that the BLEU scores vary considerably between each of the human reference translations. The impact of the improvement and the application of client specific rules can also be seen, raising the case sensitive BLEU score from 50.45 to 80.48 (an increase of 30.03 in just one improvement iteration). One interesting side effect of having multiple human references is that it is often possible to judge the quality of the human reference also. In the example above, the machine translation output is much closer to human reference 3, indicating a higher quality reference. The client later confirmed that the editor who prepared the reference was a senior editor and more skilled than the other 2 editors who prepared human reference 1 and 2. 


A BLEU score, as with other translation metrics, is just a meaningless number unless it is established in a controlled environment. Asking “What is your BLEU score?” could result in any one of the above scores being given. When controls are applied, translation metrics can be used both to measure improvements in a translation engine and compare translation engines from different vendors. However, while automated metrics are useful, the ultimate measurement is still a human assessment. Language Studio™ Pro also provides tools to assist in delivering balanced, repeatable and meaningful metrics for human quality assessment.

Thursday, December 29, 2011

Review: Most Popular Blog Posts from 2011

Blogs are about sharing with authenticity. A good blog can help you really connect deeply with your audience in a meaningful way because the content is not only relevant but insightful and personal. I think most enterprises miss that point. When you do it right, your customers will walk away not only having learned something new but will also feel much more connected to your brand.     David Armano EVP, Global Innovation & Integration at Edelman Digital
Don’t say anything online that you wouldn’t want plastered on a billboard with your face on it. -- Erin Bury

 One of the things that I enjoy about blogging is the feedback that one gets, and the continuing and evolving  discussion that sometimes comes forth from these posts. I find it helps to clarify my thinking on what really matters, and the critical feedback one gets, on assumptions that may previously go unquestioned is very useful in just evolving my own thinking on these issues. The feedback and the rankings helps me, and others too, I think, to understand what strikes a chord in the reader community, and can also sometimes help to guide further evolutionary thinking on the subjects at hand. This is is a ranking of the most popular (Unique Visitors and Page Views) posts of the year based on the data provided by Google Analytics.
  1. Analysis of the Shutdown Announcements of the Google Translate API and the subsequent posts on what this may mean for the translation industry were by far the most popular posts of the year. The original post authored by Dion Wiggins was also referenced by the Atlantic and  other mainstream media and still continues to be an influential view on the announcement today, probably much more so than any other publicly offered opinion in the professional translation industry.
  2. The Continuing Saga & Evolution of Machine Translation was coverage of the IMTT 7th Conference in Cordoba triggered active debates and discussions MT, automation and translator compensation in several forums and clearly struck a chord for many.
  3. The Future of Translation Memory (TM) is a posting that continues to receive high new visit rates long after it was originally published.
  4. The Building Momentum for Post-Edited Machine Translation (PEMT) a number of case studies on the increasing use of post-edited MT to meet business timeliness and production cost requirements.
  5. Has Google Translate Reached the Limits of its Ongoing Improvement? More evidence that more data Is not always better especially for MT, but even for Search, and the many reasons to consider the data quality, yet again.
  6. The Growing Interest & Concern About the Future of Professional Translation About reactions to the changes underway in translation
  7. Standards: the Importance of Measurement A guest post by Valeria Cannavina on how standards can drive quality improvements
  8. The Moses Madness and Dead Flowers A post that questions some of the assumptions made by “instant Moses” advocates and challenges the long-term value of these experiments. Strong opinions voiced in the comments.
  9. Translation Crowdsourcing An exploration of the driving forces underlying successful translation crowdsourcing efforts.
  10. An Exploration of Post-Editing MT – Part I Discussion on the nature and compensation of post-editing MT work.
Please Repeat: Influence is NOT Popularity --  Brian Solis

While reader traffic is one way to measure the impact of articles, there are also other ways that capture the relative influence of individual posts. PostRank is one such measure that I think monitors how others reference the posts, and monitors where and when content generates meaningful interactions across the web. They provide a truer picture of the relative influence and impact of individual blog posts, and thus I include the latest PostRank snapshot here. (You can link to the posts through the table on the right of this blog text).  This table shows that some articles that may not have had high direct readership may actually be much more useful to readers and it is interesting to see how different the two lists are though it is clear that the analysis of the Google Translate API shutdown/pay-wall was a major hit no matter how you look at it. 

image

It is also interesting to note that some older posts continue to strike a chord with readers and remain active in terms of visibility because the themes are longer lived and also perhaps because they ring true. The original post on standards and some of the posts discussing disintermediation were also posts that generate continuing interest and continue to show up in both the Google Analytics and PostRank ratings.

I have noticed that we are getting more clarity on post-editing MT work in many different ways including new models for more equitable compensation. I am hoping to highlight best practices in this area in the coming year as I believe it will be critical to ongoing adoption and success with MT technology. I also think there will be much more to share on best practices of post-editing MT and I expect that we may find that it is not quite the dreaded beast it has often been portrayed to be.

Social Media is not just a set of new channels for marketing messages. It’s an opportunity for organizations to align with the marketplace and start delivering on behalf of customers  -- Valeria Maltoni, conversationagent.com

I would also like to invite some of you to contribute to the discussion in this blog (guest posts) and assure you that I believe in open discourse and think it is useful for many different viewpoints to be aired to get closer to the “truth”. So please don’t hesitate to send me contributions that you think might be interesting to the audience that has been following this blog. I thank you for your support and I hope that the content here will continue to earn your interest and comments to extend the discussion beyond my thoughts on key issues.

For those who are not aware, there are some very interesting videos from presentations at TAUS that I reported on in the 4th ranked posting above on PEMT momentum.   


Videos of presentations and panels at the recent TAUS User Conference in Santa Clara are now available on YouTube for everyone. The links below will take you to playlists on specific themes: 


I wish you all a wonderful holiday season and look forward to sharing observations in the coming year, a year that many say will be a turning point across many dimensions.

Friday, December 2, 2011

The Moses Madness and Dead Flowers

Machine translation technology has an unfortunate history of overpromising and under delivering. At least 50 years of doing this and sometimes it seems that the torture will never stop. MT enthusiasts continue to make promises that often greatly exceed the realistic possibilities. Recently, in various conversations, I have seen that the level of unwarranted exuberance around the possibilities with the Moses Open Source SMT technology is rising to peak levels. This is especially true in the LSP community. While most technologies go through a single hype cycle, MT seems destined to go through several of these cycles with each new approach and the latest of these is what I call Moses Madness. It has become fashionable of late to build instant DIY MT engines, using tools that help you with the mechanics of running the software that is “Moses”.  While some of these tools greatly simplify the mechanical process of running the Moses software, they do not give you any insight into what is really going on inside the magic box or any clues to what you are doing at all.  Moses is a wonderful technology and it enables all kinds of experimentation that furthers the art and science of data-driven MT, but it does require some knowledge and understanding for real success. It is possible to get a quick and dirty MT engine together using some of these tools, but for long-term strategic translation production leverage, I am not so sure. Thus it is my sense that we are at the peak of the hype cycle for DIY Moses.


I would like to present a somewhat contrarian viewpoint to much of what you will hear at TAUS - “Let a thousand MT systems bloom”,  and other online forums on getting started with instant MT approaches.  IMO Moses and especially instant Moses is clearly not the final answer. While Moses is a starting point for real development, it should not be mistaken as the final destination. I think there are a number of reasons that you should pause before you jump in, and at least build up some knowledge before taking the dive. I have attempted to enumerate some of these reasons, but I am sure some will disagree. Anyway, I hope an open discussion will be valuable in reaching a more sustainable and accurate view of the reality and so here goes, even though perhaps I am rushing in where angels fear to tread.  And of course my opinion on this matter is not impartial, given my involvement with Asia Online.


The Sheer Complexity
As you can see from the official description, Moses is an open source project that makes its home in the academic research community. This link describes some of the conferences where people with some expertise and understanding of what Moses actually does convene and share information. Take a look at the program committee of these conferences to get a sense of what the focus might be. Now take a look at the “ step-by-step guide”, which students in NLP are expected to be able to handle. It is what you would have to do to build an MT system if did not have the DIY kit.  Most of the instant/simplified Moses engine services in the market focus on simplifying this and only this aspect of developing an MT engine.

Clearly it would be good to have some knowledge of what is going on in the magic box BEFORE you begin, and perhaps it would even be really nice to have some limited team expertise with computational linguistics to make your exploration more useful. Remember that hiding complexity is not quite the same as removing complexity, and it would be smart to not underestimate this complexity BEFORE you begin. Anybody who has ventured into this has probably realized already, that while some of the complexity has been hidden, there is still much that is ugly and complicated to deal with in Moses world, and often it feels like the blind leading the blind.

I have noticed that many in professional translation industry have trouble even with basics like MT system BLEU scoring, and even some alleged MT experts barely know how to measure BLEU accurately and fairly. Thus I am skeptical that LSPs will be able to jump into this with any real level of competence in the short term. A level of competence that assures or at least raises the probability of business success i.e. enhances long-term translation productivity. Though it is possible that a hardy few will learn over the next 2-5 years, it is also clear that NLP and computational linguistics is not for everyone. The level and extent of knowledge required is simply too specialized and vast. As Richard Feynman said:”I think it’s much more interesting to live not knowing, than to have answers which might be wrong.” (Though he was talking about beauty, curiosity and mostly about doubt). 

Alon Lavie, AMTA President, CMU NLP professor and President of Safaba (which develops hosted MT solutions that are largely built on top of Moses) says:
“ I am of course a strong supporter, and am extremely enthusiastic about Moses and what it has accomplished in both academic research and in the commercial space. I also think there is indeed a lot of value in the various DIY offerings (commercial and Achim's M4L efforts). But these efforts primarily target and solve the *engineering complexity* of deploying Moses. While this undoubtedly is a critical bottleneck, I think there is a potential pitfall here that users that are not MT experts (the vast majority) would come to believe that that's all it takes to build a state-of-the-art MT system. The technology is actually complex and is getting more complex and involved to master. Users may be disappointed with what they get from DIY Moses, and more detrimentally, become convinced that that's the best they can accomplish, when in fact letting expert MT developers do the work can result in far better performance results. I think this is an important message to communicate to potential users, but I'm not sure how best to communicate this message.”

Thus, I will join Alon in trying to convey the message that Moses is a starting point in your exploration of MT and not the final answer, and that experience, expertise and knowledge matter. Perhaps, a way to understand the complexity issue better, is to use some analogies. 
image

The sewing machine/tailor analogy: Moses can be perhaps be viewed as a very basic sewing machine. You still need to understand how to cut cloth, stitching technique, fabric and lining selection, measurement, pocket technique (?), final fit modifications and so on to make clothes. Tailors do it better and expert tailors that only focus on men's suits do it even better than you or I would with the same sewing equipment. The closest to a ready made suit would be the free MT engines, except in this analogy they are only available in one size. Expertise really does matter folks if you want to customize-to-fit.

The DIY car analogy: In this analogy, Moses is the car engine and perhaps a very basic chassis, one that would be dangerous on a highway or bumpy roads. The DIY task is to build a car that can actually be used as transportation. This will require some understanding of auto systems design, matching key components to each other, tires, braking systems, body design and so on. Finally you also need to learn to drive and you would want the car to turn right when you want to. Again, expert mechanics are more likely to be successful even though there are some great DIY kits out there for NASCAR enthusiasts.


The Learning Curve
Even if you do have a team with some NLP expertise, remember that working with any complex technology involves a process of learning and usually an apprenticeship to get to a point of real skill. The people who build SMT engines at Microsoft, Google, Asia Online and other MT research teams have built thousands of MT engines during their MT careers. The skills developed and lessons learned during this experience are not easily replicated and embedded into open source code. Failure is often the best teacher and most of these teams have failed often enough to understand the many pitfalls along an SMT engine development path. To expect that any “instant” Moses solution is going to capture and encapsulate all of this is naïve and and somewhat arrogant. This is the kind of skill where expertise builds slowly, and comes after much experimentation across many different kinds of data and use case scenarios. Just as professional tailors and expert mechanics are likely to produce better results, MT experts who work across many different use scenarios are likely to produce much better results than a do-it-yourself enthusiast might. These results translate into long-term savings that should far exceed an initially higher price.

The objective of MT deployment for most LSP users is to increase translation productivity. (Very few have reached the next phase where they are translating new content that would never be translated were it not for MT). Thus getting the best possible systems that produce the highest possible MT output quality really matters to achieve this core objective of achieving measurable translation productivity. To put this in simpler terms, the difference between instant Moses systems and expert MT systems could be as much as 4,000 words/day versus 10,000+ words a day. Expert MT engine developers like Asia Online have multi-dimensional approaches, NLP skills, and many specialized tools in place to extract the maximum amount of information out of the data they have available. The use of these tools is guided by two team members with deep expertise on the inner workings of Moses and SMT in general. The learning process driving the development of these comprehensive tools takes years, and they enable Asia Online custom systems to produce superior translation output to the free online MT engines consistently. One team member has literally written the book on SMT and created Moses and thus one could presume is quite likely to have the expertise to develop better MT systems than most. 

I have already heard from several translators who when asked to post-edit “instant Moses” output they know is inferior, simply run the same source material through Google/Bing and edit that instead, to improve their own personal productivity and save themselves some anguish. So if your Moses engine is not as good as these public engines you will find that translators will simply bypass them whenever they can. And they may not actually tell you that they are doing this. Post-editors will generally choose the best MT output they can get access to, so beware if your engine does not compare well. And buyers, insist on seeing how these instant MT engines compare to the public free engines on a meaningful and comprehensive test set, not just a 100 or so sentences.

However, I am also aware that some Moses initiatives have produced great results e.g. Autodesk,(for you doubters on the value of PEMT, here is clear evidence from a customer viewpoint) and here I would caution against any extrapolation of these results and expectation to achieve this for any and every Moses attempt. The team that produced these systems were more technically capable and knowledgeable than most, and I am also aware that that their training data was better suited for SMT than most of the TM you will find in the TDA or on the web. And even here, I would argue that MT experts would probably produce better results with the same data especially with the Asian languages where other support tools and processes become much more imperative.

As others have stated before me, the global population of people who actually understand how these data-driven systems work is really quite tiny, miniscule in fact. If you are building Moses systems you should be comparing yourself to the public free engines, as you may find that all your effort was much ado about nothing. One would hope that you will produce systems that compare favorably to these “free” options. And if your competition includes the lads and lassies at Microsoft and Google, one would hope that you know more about how to do this than pushing the instant Make-my-engine button. The financial cost of ignorance is substantially higher than most are able to define in terms of lost opportunity costs, and learning costs (a.k.a. mistakes) should be factored into a real TCO (Total Cost of Ownership).

The bottom line: Success with SMT requires very specialized skills that include, some NLP background, massive data handling skills, knowledge of parallel computing processing, linguistic data management tools, corpus analysis and linguistic structural analysis capabilities for optimal results not to mention a culture that nurtures collaboration with translators.

The Data, the Data, The Data
Moses is a data-driven technology and thus is highly dependent on the data that is used. Data volume is required to get good output from the systems and thus users have to gather the data from public sources and it is important to normalize and prepare the data for optimal performance. Most LSPs will not have the data or skills needed to gather the data in an optimal way. I have seen two major SMT engineering initiatives up close, one where training data was scraped off the web by spider programs, and another where data was not allowed to go into training data if it had not passed several human linguistic quality assessment checks. The differing impact of these approaches is quite striking. The dirty data approach requires substantially larger amounts of new data to see any ongoing improvement, while the clean data approach can produce compelling improvement results with much less new data. 

This ability to respond to small amounts of corrective feedback is a critical condition for ongoing improvement, and for continued improvements in productivity e.g. raising PEMT throughput up to 15,000+ words/day in the shortest time possible. I have already stated that I was surprised how little attention is paid to data quality in instant Moses approaches presented at TAUS. And while data volume matters, for high quality domain-focused systems, the data you exclude may be more important than what you include. We are in a phase of the web's development where ‘ Big Data” is solving many semantic and linguistic problems, but we have also seen that data is not always the solution to better MT systems.

The upfront data analysis and data preparation, the development of “good” tuning and test sets are critical to the the short and long-term quality and evolution of an MT engine. This is something that takes experience and experimentation to understand and be skillful at. Experts can add huge value at this formative stage. Remember that this is a technology where “Garbage In Garbage Out” (GIGO) will be particularly true. Many who understand how bad TM can get don’t need any further elaboration on this, even though some people in the SMT community remain unconvinced that clean data does matter.

Many of the people who have jumped into instant Moses, do not realize that to get your initial MT engine to improve, will require very large amounts of new data with a standard Moses approach. The rule of thumb I have heard used frequently is that you need 20-25% of the initial training data volume to see meaningful improvements. Thus, if you used 10 million words to build your system, you will need 2-3 million new words to see the system noticeably improve. So most of these instant systems are as good as they are ever going to get when the first engine is produced. In contrast, Asia Online systems can improve dramatically with as little as a few thousand sentences (a single project) and are architected and designed from the outset to improve continuously over time with focused and targeted corrective feedback.

Given the difficulty of getting large amounts of new data, users need systems that can respond to small amounts of corrective feedback and yet show noticeable improvements. One of the major deficiencies of historical MT systems has been the lack of user control, the inability of users to make any meaningful impact on the quality of raw output produced on an ongoing basis.This ability to CONTINUALLY steer the MT engine with financially feasible amounts (i.e. relatively small) of corrective feedback is a key to getting the best long-term productivity results and ROI. I think as users get more informed on how to work with this technology, they will zero in on this ability of some expert MT systems. IMO, it is the single most important criterion when evaluating competitive MT systems:
  • What do I have to do to improve the raw system output quality once an initial engine is in place?
  • And, how much effort/data is required to get meaningful and measurable improvements?
  • Measurable = Rising average throughput of post-editors (By hundreds or thousands of more words a day, and often a multiple of what is possible with instant MT).
The issue of data cleaning is also not well understood. While it is helpful to remove tags and formatting information, it is also important to validate the linguistics and the quality of translations in addition to this to avoid GIGO results. Users should take care to keep data in the cleanest possible state (format wise and linguistically) as it can provide real long-term business production leverage on a scale greater than most TM data can. What most successful users will find is that 90%+ of the time spent in developing the highest quality engines is spent in corpus and data analysis, data preparation and organization, error detection and correction. The Moses step is a tiny component of the whole process of developing superior MT engines. image

Control & Data Security
One of the reasons why it may make sense to use Moses sometimes is to keep your data and training and translation activity REALLY REALLY private (e.g. translations of interrogation transcripts where persuasion involving water might be used). The need for security and privacy especially makes sense for national security applications, but I find it hard to understand the resistance some global companies have, to working in the cloud when a lot of this MT and PEMT content ends up on the web anyway. For most companies cloud computing simply makes sense and spares the user from the substantial IT burden of maintaining the hardware infrastructure needed to play at the highest professional level. (Asia Online actually makes it’s full training and translation environment available for on-premise installation for large enterprise customers like LexisNexis who process hundreds of millions of words a day and have suitable computing and human resource expertise to handle this).

I have heard of several LSPs who have spent $10K–$20K on servers that will probably only do Moses training once a year. If you do not have the data to drive an improvement in your Moses engine, what is the point of having these kinds of servers?  There is no point in trying to re-train an engine when you don’t have enough new data to make any noticeable impact. This is a technology that just makes much more sense in the cloud, for scalability, extensibility, security and effective control. Cloud solutions are often more secure than on-premise installations at LSPs because cloud service providers can afford the IT staff that has deep expertise on computer security, data protection and data availability management. (BTW I have also seen what happens when hacks try and manage 200 servers = not pretty). Like many other things in today’s world, IT (Information Technology) has become so specialized and complex that it makes more sense to outsource much of it, and work in the cloud rather than try and do it on your own with a meager and barely trained staff. Compare your IT staff capabilities to any cloud service provider. Even Microsoft Office is finally making the transition to the cloud. Some analysts are even saying that the shift to the cloud will challenge the dominance of older stalwarts like HP, Microsoft, Intel, SAP, RIM, Oracle, Cisco, Dell  and that a third of these companies may not be around in in 2020. Remember DEC and Wang? In a world where tablets, smartphones and mobile platforms will increasingly drive global commerce, the desktop/server perspective of traditional IT is already fading, and makes less sense with each passing day. It is ironic to see LSPs jumping on the “On-Premise Server” train just as it about to reach the end of the line.

Cloud based MT can also be setup to be always improving (assuming you have more than basic Moses MT) as new data is added regularly and feedback gathered from users as Google and Bing do. Setting up this kind of infrastructure is a significant undertaking and most Moses users will never get to that point, but this is how the best MT systems will continue to evolve. What some may find is, that their domain focused MT system may be better than the public engines in January, but by June this may no longer be true. You should realize that you are dealing with a moving target and most public engines will continue to improve. All the expert MT developers are constantly updating and enhancing their technology, most have already moved beyond the phrase-based SMT that Moses is today, and are incorporating linguistics in various forms. This can only be done because they understand what they are doing. Some of these enhancements may make it back to Moses years later but the productivity edge will remain with experts in the foreseeable future and I expect in 2012 we will see several case studies where expert MT systems outperform instant Moses systems by significant margins. So my advice; Be wary of any kind of instant MT solution that is not free.

I started the eMpTy Pages blog in early 2010, and one of my earliest posts was on the importance of clean data for SMT.  It was blasphemy at the time to question the value of sheer data volume for SMT, but in the period since then, many have validated that working with consolidated TM from multiple sources, trusted though they may be, is a tricky affair and data quality does matter. Pooling data can work sometimes but will also fail often without cleaning and standardization.

The origin of the phrase “Let a thousand flowers bloom” is attributed to a misquote of Mao Zedong. The results for Chinese intellectuals who took Mao seriously were quite unfortunate.  Fortunately we live in better times (I think?) and this phrase is not likely to have such dire consequences today. However, while a thousand MT systems may bloom (or at least be seeded), I predict that many will fade and die quickly. This is not necessarily bad, as hopefully institutional, community and industry learning will take place, and some practitioners may actually discover that they now have a much better appreciation for corpus linguistics and some of the skills that drive the creation of better MT systems. The experimental evidence from many failed experiments with Moses will also provide useful information for MT experts and further enhance the state of the art and science of MT. The learning curve for this technology is long and arduous and it may take a while for the dust to settle from the current hype, but I fully expect that by December 21st, 2012 it will be clear that expertise, experience and knowledge does matter with something as complex as Moses. Dead flowers are also used to fertilize gardens and help other plants thrive, and as long as we have the long view, we will continue to move onward and upward. I will restate my prediction, that the best MT systems will still come from close collaboration between MT experts with linguists, translators, LSPs and insight drawn from experience and failure. 


And you can send me dead flowers every morning
Send me dead flowers by the mail
Send me dead flowers to my wedding
And I won't forget to put roses on your grave

A celebration for dead flowers