- Is there a benefit to sharing TM data for the purpose of building SMT engines?
- What are some practical guidelines to help enhance serious data sharing attempts ?
- What do best practices look like?
The Google paper that has also been referenced by TAUS/TDA as foundational justification for “more data is always better” has been criticized by many. Jaap van Der Meer suggests that Norvig said “forget trying to come up with elegant theories and embrace the unreasonable effectiveness of data.” A more careful examination of the paper reveals that many of the examples in the paper are related to graphical image examples where erroneous pixels are much more tolerable.In fact, Norvig himself, has stated in his own blog that his comments were misinterpreted:
To set the record straight: That's a silly statement, I didn't say it, and I disagree with it. … Peter Norvig
So I maintain that both data quality and algorithms matter, and unless we are talking about huge magnitudes of order differences, clean data will produce better SMT engines and respond more easily to corrective feedback. I have seen this proven over and over again. We are all aware that TM tends to get messy over time and that it is wise to scrub and clean it periodically for best results in any translation automation endeavor.
Basically the study found that some TM is better suited for SMT and that it is important to understand this BEFORE you consolidate data from multiple sources. The graphic below shows the details of the data in question. Additionally we also found that Datasets A and C were more consistent in their use of terminology.
The key findings from the study are as follows:
- -- Data quality matters and all translation memory data is not equally good for SMT engine development.
- -- Data quality assessment should be an important first step in any data consolidation exercise for SMT engine development purposes.
- -- MORE DATA IS NOT ALWAYS BETTER and smaller amounts of high quality data can produce better results than large amounts of dirty data.
- -- Terminological consistency is an important driver for better quality and success with SMT. Efforts made to standardize terminology across multiple TM datasets will likely yield significant improvements in SMT engine quality.
- -- Introducing “known dirty data” into the system decreases the quality of the system and increases the unpredictability of the results
- -- Systems built with clean data and consistent terminology tend to perform better and improve faster
- -- Data cleaning and normalization and terminology analysis and standardization is a critical first step to having success with any project that combines TM for developing SMT engines
What is the best way to store data so that it is useful for both TM and SMT leverage purposes?
What are the best practices for consolidating TM? What tools are necessary to maximize benefits?
Common Source of Data Problems in TM
|Encoding problems and inconsistencies in the data.|
|Large volumes of formatting tags and other metadata that have no linguistic value but which can impact system learning.|
|Punctuation differences and diacritics are inconsistent across the TM|
|English words appear in the French translation. This may be valid in translation memory, but will result in English being embedded in the French training data and make it possible for the SMT engine to think that English is French!|
|Excessive formatting tags and HTML (often representing images) embedded in segments|
|Frequently the French translations had bracketed terms that were not present in the English source.|
|Frequently the capitalization does not match.|
|Terminology was sometimes inconsistent, with different terms being used through the data for the same concept or meaning|
|Large number of variables in many different forms embedded in the text. Variable forms and formats are inconsistent.|
|Multiple sentences in one segment. While this is valid, the job of word alignment becomes more complex. Higher quality SMT word alignment can be achieved when these are broken out into their individual segments.|
|In French text, there are frequently abbreviations when there should be a complete word.|
|Words missing on either side.|
These are early days and we are all still learning, but the tools are getting better and the dialogue is getting more useful and pragmatic as we move away from naive views, that any random pile of data is better than one that has been carefully considered and prepared.
Without understanding the relative cleanliness and quality of the data, data sharing is not necessarily beneficial.
While TM data may often be problematic for SMT in its raw state, some of what is considered “dirt” to SMT can be cleaned through automated tools used by Asia Online and others. However, these tools cannot correct situations when the translations themselves are of a lower quality. This issue has also been highlighted by Don DePalma in a recent article referring to this study where he said: “Our recent MT research contended that many organizations will find that their TMs are not up to snuff — these manually created memories often carve into stone the aggregated work of lots of people of random capabilities, passed back and forth among LSPs over the years with little oversight or management.”
Lets hope that the TDA too will let this taboo subject (data quality) out into the open. I am sure the community will come together and help develop strategies to cope and overcome the initial problems that we identify when we try and share TM data resources