Monday, February 1, 2010

The Impact of “Clean Data” on SMT

This is a summary of a study on translation memory consolidation that I was involved with and a continued examination of the issue of “clean data” which I believe is essential to long-term success with data-driven MT initiatives.The Asia Online rating system rates data that is deemed to be best suited for SMT training and is not a judgment on TM quality for TM purposes. The study conducted by Asia Online took a relatively small set of TM data with the kind facilitation of TAUS and data provided by three members and attempted to answer three questions:
  1. Is there a benefit to sharing TM data for the purpose of building SMT engines?
  2. What are some practical guidelines to help enhance serious data sharing attempts ?
  3. What do best practices look like?
As many people continue to believe that sheer data volume alone is enough to solve many problems with SMT I thought it would be useful to provide an overview of the Asia Online data consolidation study in this blog. Apart from the simple common sense of the "Garbage In Garbage Out" principal which is important in any data processing applications and perhaps even more so in SMT, does it not make sense that if SMT engines learn from parallel corpus, it would be wise and efficient to clean this corpus first?

The Google paper that has also been referenced by TAUS/TDA as foundational justification for “more data is always better” has been criticized by many.  Jaap van Der Meer suggests that Norvig said “forget trying to come up with elegant theories and embrace the unreasonable effectiveness of data.” A more careful examination of the paper reveals that many of the examples in the paper are related to graphical image examples where erroneous pixels are much more tolerable.In fact, Norvig himself, has stated in his own blog that his comments were misinterpreted:

To set the record straight: That's a silly statement, I didn't say it, and I disagree with it. … Peter Norvig

So I maintain that both data quality and algorithms matter, and unless we are talking about huge magnitudes of order differences, clean data will produce better SMT engines and respond more easily to corrective feedback. I have seen this proven over and over again. We are all aware that TM tends to get messy over time and that it is wise to scrub and clean it periodically for best results in any translation automation endeavor.

Basically the study found that some TM is better suited for SMT and that it is important to understand this BEFORE you consolidate data from multiple sources. The graphic below shows the details of the data in question. Additionally we also found that Datasets A and C were more consistent in their use of terminology.

The key findings from the study are as follows:
  • -- Data quality matters and all translation memory data is not equally good for SMT engine development.
  • -- Data quality assessment should be an important first step in any data consolidation exercise for SMT engine development purposes.
  • -- MORE DATA IS NOT ALWAYS BETTER and smaller amounts of high quality data can produce better results than large amounts of dirty data.
  • -- Terminological consistency is an important driver for better quality and success with SMT. Efforts made to standardize terminology across multiple TM datasets will likely yield significant improvements in SMT engine quality.
  • -- Introducing “known dirty data” into the system decreases the quality of the system and increases the unpredictability of the results 
  • -- Systems built with clean data and consistent terminology tend to perform better and improve faster


  • -- Data cleaning and normalization and terminology analysis and standardization is a critical first step to having success with any project that combines TM for developing SMT engines
In the noise about soft censorship we have gotten distracted from two additional questions that also are worth our attention.

What is the best way to store data so that it is useful for both TM and SMT leverage purposes?
What are the best practices for consolidating TM? What tools are necessary to maximize benefits?

Common Source of Data Problems in TM
Encoding problems and inconsistencies in the data.
Large volumes of formatting tags and other metadata that have no linguistic value but which can impact system learning.
Punctuation differences and diacritics are inconsistent across the TM
English words appear in the French translation. This may be valid in translation memory, but will result in English being embedded in the French training data and make it possible for the SMT engine to think that English is French!
Excessive formatting tags and HTML (often representing images) embedded in segments
Frequently the French translations had bracketed terms that were not present in the English source.
Frequently the capitalization does not match.
Terminology was sometimes inconsistent, with different terms being used through the data for the same concept or meaning
Large number of variables in many different forms embedded in the text. Variable forms and formats are inconsistent.
Multiple sentences in one segment. While this is valid, the job of word alignment becomes more complex. Higher quality SMT word alignment can be achieved when these are broken out into their individual segments.
In French text, there are frequently abbreviations when there should be a complete word.
Words missing on either side.

These are early days and we are all still learning, but the tools are getting better and the dialogue is getting more useful and pragmatic as we move away from naive views, that any random pile of data is better than one that has been carefully considered and prepared.

Without understanding the relative cleanliness and quality of the data, data sharing is not necessarily beneficial.

While TM data may often be problematic for SMT in its raw state, some of what is considered “dirt” to SMT can be cleaned through automated tools used by Asia Online and others. However, these tools cannot correct situations when the translations themselves are of a lower quality. This issue has also been highlighted by Don DePalma in a recent article referring to this study where he said: “Our recent MT research contended that many organizations will find that their TMs are not up to snuff — these manually created memories often carve into stone the aggregated work of lots of people of random capabilities, passed back and forth among LSPs over the years with little oversight or management.”

Lets hope that the TDA too will let this taboo subject (data quality) out into the open. I am sure the community will come together and help develop strategies to cope and overcome the initial problems that we identify when we try and share TM data resources

1 comment:

  1. We have also found surprising improvements in SMT quality (measured in BLEU, at least) using fairly straightforward cleanup strategies, even such simple things as normalizing British and American spelling in the training data.

    Note that we did not similarly normalize the test data, so the increased BLEU scores we observed were due to more subtle improvements, most likely from better alignment during training.