Friday, July 13, 2012

The Relationship Between Productivity and Effective Use of Translation Technology

As machine translation continues to gain momentum, we are seeing many more instances of LSPs and some enterprise users exploring the potential use of the technology in core production work. MT today is still unfortunately quite complex and there are few universally accurate truisms or rules of thumb that replace the need for at least some minimal amount of expertise and understanding. Expertise and knowledge are key requirements for those who wish to use MT successfully in a translation production context. However, there are still many misconceptions about the effective use of the technology.
Some of the most common misconceptions include:
All MT systems are about the same. Not really, some MT systems that have undergone expert-managed customization and domain focused training can produce dramatically better results than generic systems. This also means that you are not likely to get a very good understanding of the capabilities of an MT technology without doing a real pilot project that involves customization. Yet I often see people trying to make judgments about which MT system to use based on running a few paragraphs though a generic engine.
All MT applications are the same. Some MT applications that are focused on localization (documentation, core website content) need much higher quality to be useful, than other applications like making customer support forum content multilingual where good gisting quality is adequate. Translator productivity applications are the most difficult to do successfully and one where naïve users (e.g. your average LSP with Moses) are likely to fail.
Post-editors should be paid the same lower rate for all MT post-editing work. CSA states that this magic rate is 61% of the full rate in 2010. However, setting a fixed rate without understanding the reality of the MT output quality can often be unfair to editors and cause resentment that can undermine any attempt to build production leverage.  Compensation needs to be linked to productivity and effort expended to “fix MT” and the most successful users are respectful and careful to do this well to ensure a stable and motivated work force.
MT is responsible for falling translation rates. This is a digression, but I wanted to highlight some interesting analysis and opinion by Luigi Muzii on why this is NOT true and he provides very interesting analysis and opinion on this matter in this article and also in a post called “Changes Ahead” that was characterized as follows by Rob Vandenberg.

I will address the first three issues in this post and provide some more context to clarify these misconceptions.
MT systems can vary and produce very different type and quality of output depending on all of the following factors:
  • Methodology used (RbMT, SMT, Hybrid which can also mean many different things)
  • The skill and knowledge of the practitioners working with the technology and building the systems. MT is still quite complex and needs skills that take time to develop and refine, to get output quality that surpasses the quality produced by public MT engines from Google and Bing.
  • Increasingly the quality and the volume of the “training data” are an important determinant of the quality of the system as SMT approaches increasingly lead the way.
  • The language pair: It is much easier to get “good” systems with FIGS than with CJK relative to English. Languages like Hungarian, Finnish and Turkish are just tough in general (relative to English).
  • The ability of the system to respond to small amounts of strategic corrective feedback. This is critical to build real business leverage. While some systems may improve slightly when many millions of new words are added to train them, very few can respond favorably to small volumes of additional data. MT system development is evolutionary and one should enter into development with this mindset.

MT can be useful in many different scenarios but it should be understood that the expected usable quality for different uses are very different. We live in a world today, where MT translates billions of words each day for internet users who are trying to understand content of interest on the web or communicate with others across the world. There are also many corporate and business applications where the sheer volume and volatility of the information could not justify anything but MT, e.g. technical knowledge base content, customer forum discussions, hotel reviews where “good enough” is good enough. Much of this information has little or no value over time e.g. configuration guidance on DOS 5.0/Windows XP or a 3 year old hotel review, but could have great value and enhance global customer satisfaction for a brief window in time even in an imperfect linguistic-quality form. MT use for traditional LSP applications are the most demanding of all MT applications and require the deepest knowledge and expertise and skill. MT in this context can only add value if the output produced is of sufficient quality, that it actually enhances the productivity of translators and makes the business translation process more cost efficient. It is not a replacement for human translation and thus needs to be at a quality level that humans acknowledge its utility and actually want to use it.

Much of the early dissatisfaction with MT in the professional translation world is a result of asking translators to edit  poor quality output for much lower rates in a relatively arbitrary fashion, that did not accurately reflect the level of effort that was involved. The task of post-editing MT to publication quality levels needs an understanding of the average level of effort needed and very few in the professional translation world have figured this out. Omnilingua is an example of how to do it right, with a very clear and trusted quality measurement profile of the MT output which then also helps to define productivity and fair compensation for editors. This task of accurate measurement of MT output quality and then determination of the correct compensation structure is key to successful MT deployment and is quite possible in high-trust scenarios but much harder to implement when trust is less prevalent.

In the following largely hypothetical example (which is based on a generalization of actual experiences) I have summarized the possibilities to show how MT system output quality and productivity are related. I have also taken the additional step of showing how lower word rates can often make sense with “good” MT systems, and hopefully demonstrate that it is in the interests of both LSPs and translator/post-editors to figure out the key quality/productivity metrics accurately. Once the productivity is clearly established lower rates make sense because the throughput is trusted. Both parties need to be willing to make adjustments when the numbers don’t properly balance out.
In this hypothetical comparison we will assume that there are 3 MT systems all focused on the same production task. These systems are of differing quality and their related productivity impact is characterized below. The objective in every case is to produce final output that cannot be discerned from a pure human TEP production effort:
  1. Good Instant MT/Moses System – A large majority of these systems do not produce output better than the free generic engines on the internet. I am assuming that perhaps 5% to 10% of these systems can reach a state where they can outperform Google. TAUS has highlighted several case studies where this is documented and where it is clear this is difficult.Typically productivity for a very successful effort will range from 3,000 words per day and slightly higher.
  2. Average Expert System – A product of a reasonable amount of data and expertise and experience that enables productivity over 5,000 words/day to as much as 7,000 words per day for editors who work on correcting the MT output.
  3. Excellent Expert System – This is possible with data-rich systems developed by experts that have gone through several iterations of improvement and corrective feedback. I have seen systems that enable 9,000 words/day to as much as 12,000 words/day throughput. Some exceptional systems are even higher!
In the following table these 3 systems are profiled to compare the overall time and cost implications for a 500,000 word project. This clearly shows (fabricated though it is) that higher quality MT systems will provide the best overall production benefits. This also implies that it is worth investing in developing this better quality upfront, rather than opting for a low initial cost option that provides less benefit.
The graphs below show these comparisons for different sized projects and also illustrate that both the cost and time savings on larger projects can become quite significant.
Some actual and very specific examples are described in this post by Sajan (EN>ZH) and in this case study by Hunnect for EN>HU.