An issue that continues to be a source of great confusion and dissatisfaction in the translation industry is related to the determination of the appropriate compensation rate for post-editing work. Much of the dissatisfaction with MT is related to this being done badly or unfairly.
It is important that the translation industry develop a means to properly determine this compensation issue in a way that is acceptable to all stakeholders. Thus, developing a scheme that is considered fair and reasonable by the post-editor, the LSP and the final enterprise customer is valuable to all in the industry. It is my feeling that economic systems that provide equitable benefits to all stakeholders are the ones most likely to succeed in the long term. Achieving consensus on this issue would enable the professional translation industry to reach higher levels of productivity and also increase the scope and reach of business translation as enterprises start translating new kinds of content with higher-quality, mature, domain-focused MT engines.
While it took many years for TM compensation rates to reach general consensus within the industry, there is some consensus today on how TM fuzzy match rates relate to the compensation rate, even though there is still dissatisfaction amongst some about the methodology and commoditization of TM and fuzzy-match based compensation schemes cause to the art of translation. Basically, today it is understood that 100% matches are compensated at a lower rate than fuzzy matches and that the higher the fuzzy match level the greater the value of the segment in the new translation task. Today fuzzy match ratings provided by the different tools in the market are roughly equivalent and for the most part, trusted. There are some (or many) who complain about how this approach commoditizes translation work but for the most part, translators work with an approach that says they should get paid less for projects that contain a lot of the exact same phrases, i.e. 100% matches in the TM that is provided to do new projects.
However, in the world of MT, it is quite different. Many are just beginning to understand that all MT systems are not equal and that all MT output does not necessarily equate to what is available on Google and Bing. Some systems are better and many are worse (especially the instant Moses kind) and to apply the same rates to any and all MT editing work is not an intelligent approach. Thus, the quality assessment of the MT output to be edited is a critical task that should precede any large project involving post-editing MT output. The first wave of many MT projects just applied an arbitrarily lower (e.g. 60%) word rate to any and all MT post-editing work with no regard to the actual quality of the MT output. This has led many to protest the nature of the work and the compensation. Many still fail to understand that MT should only be used if it does indeed improve productivity and this is a key measure of value and thus should drive compensation calculations.
The first fact to understand and have in hand before you implement any kind of MT is the production rate BEFORE MT is implemented. It is important to know what your translation production throughput is before you use any MT. The better you understand this, the higher the probability that you will be able to measure the impact of MT on your production process. This was pointed out very clearly in this post. (I should state that many of my comments here apply to PEMT use in localization TEP type projects only).
Many now understand that the key to production efficiency with MT is to customize it for a specific domain and tune it for a specific purpose. This results in higher quality MT output but only if done with skill and expertise. We are now seeing some practitioners making attempts to make quality assessments prior to undertaking post-editing projects, but there is a lot of confusion since the quality metrics being used are not well understood. In general, any metric used, automated or human assessment based, requires extended use and use experience before they can produce useful input to rate-setting practices. BLEU is possibly the one metric that is most misunderstood and has the least value in helping to establish the correct rates for PEMT work, mostly because it is usually misused. There is one MT vendor making outlandish claims of getting a BLEU of .9 (90) or better. (This is clearly a bullshit alert!) This is somewhat ridiculous since it is quite typical for two competent human translations to score no higher than .7 when their translations are compared unless they use exactly the same phrasing and vocabulary to translate the same source material. The value of BLEU in establishing PEMT rates is limited unless the practitioner has long-term experience and a deep understanding of the many flaws of BLEU.
Another popular approach is to use human-based quality assessment metrics like SAE J2450 or Edit Distance. They work best for those companies that have used them over a long period and understand how the metric measurements relate to past historical project experience. These are better and more reliable than most automated metrics but are much more expensive to deploy and also their link to setting correct compensation levels is not clear. There is much room for misinterpretation and like BLEU, they too can be dangerous in the hands of those with little understanding or expertise with extended use of these metrics. It is important that whatever metric is used should be trusted and easily understood by editors to build efficient and effective production systems.
While all these measurements of quality provide valuable information, I think the only metric that should matter is productivity. It is useful to use MT only if the translation production process is more efficient and more productive with the use of MT. This means that the same work is done faster and at a lower cost. This can be stated very simply in terms of average productivity as follows (I chose a number that can be easily divided by 8 and stay with round numbers):
Any MT system that cannot produce translated output and related productivity that beats this throughput, is of negative value to your production efficiency, and you should stay with your old pre-MT process or find a better MT system. MT systems must beat this level of productivity to be economically useful to the production goals and to be useful in general. (BTW most Moses and Instant MT attempts often do not meet this requirement.)
Thus it is important to measure the productivity impact of the specific MT system that you are dealing with and measure the productivity implications of the very specific MT output your editors will be dealing with. To ensure that post-editors feel that compensation rates have been fairly set it is wise to use trusted and competent translators in the rate-setting process. It would also be good to be able to do this reliably with a sample or have a reconciliation process after the whole job is done to ensure that the rate was fair. The simplest way to do this could be as follows:
It is of course possible to do a larger sample or test where more translators are used and a longer test period is measured, e.g. 3 translators working for 8 hours. Though based on experiential evidence across multiple customers we have seen at Asia Online, a 2 hour test with a trusted translator provides a very accurate estimate of the productivity and can help establish a rate that is considered fair and reasonable for the work involved. I am sure there are other opinions on this and it would be interesting to hear them, but I would opt for an approach where trusted partners and actual direct production data experience are the key drivers to setting rates over metrics that may or may not be properly implemented.
I continue to see more and more examples of good MT systems that produce output that clearly leverages production efficiency, and I hope that we will see more examples of good compensation practices in which translators and editors find that they actually make more money as I pointed out in my last post, than they would in typical TEP scenarios using just TM.
Whatever you may think of this approach, the issue of post-editing MT needs to be linked to an accurate assessment of the quality of the MT output and the resultant productivity benefit. It is in everybody’s interest to do this accurately, fairly, and in a way that builds trust and helps drive translation into new kinds of areas. This quality assessment and productivity measurement process may be an area that translators can take a lead in and help to establish useful procedures and measurement methodology that the industry could adopt.
I have written previously on post-editing compensation and there are several links to other research material and opinions on this issue in that posting. I would recommend it to anybody interested in the subject.
It is important that the translation industry develop a means to properly determine this compensation issue in a way that is acceptable to all stakeholders. Thus, developing a scheme that is considered fair and reasonable by the post-editor, the LSP and the final enterprise customer is valuable to all in the industry. It is my feeling that economic systems that provide equitable benefits to all stakeholders are the ones most likely to succeed in the long term. Achieving consensus on this issue would enable the professional translation industry to reach higher levels of productivity and also increase the scope and reach of business translation as enterprises start translating new kinds of content with higher-quality, mature, domain-focused MT engines.
While it took many years for TM compensation rates to reach general consensus within the industry, there is some consensus today on how TM fuzzy match rates relate to the compensation rate, even though there is still dissatisfaction amongst some about the methodology and commoditization of TM and fuzzy-match based compensation schemes cause to the art of translation. Basically, today it is understood that 100% matches are compensated at a lower rate than fuzzy matches and that the higher the fuzzy match level the greater the value of the segment in the new translation task. Today fuzzy match ratings provided by the different tools in the market are roughly equivalent and for the most part, trusted. There are some (or many) who complain about how this approach commoditizes translation work but for the most part, translators work with an approach that says they should get paid less for projects that contain a lot of the exact same phrases, i.e. 100% matches in the TM that is provided to do new projects.
However, in the world of MT, it is quite different. Many are just beginning to understand that all MT systems are not equal and that all MT output does not necessarily equate to what is available on Google and Bing. Some systems are better and many are worse (especially the instant Moses kind) and to apply the same rates to any and all MT editing work is not an intelligent approach. Thus, the quality assessment of the MT output to be edited is a critical task that should precede any large project involving post-editing MT output. The first wave of many MT projects just applied an arbitrarily lower (e.g. 60%) word rate to any and all MT post-editing work with no regard to the actual quality of the MT output. This has led many to protest the nature of the work and the compensation. Many still fail to understand that MT should only be used if it does indeed improve productivity and this is a key measure of value and thus should drive compensation calculations.
The first fact to understand and have in hand before you implement any kind of MT is the production rate BEFORE MT is implemented. It is important to know what your translation production throughput is before you use any MT. The better you understand this, the higher the probability that you will be able to measure the impact of MT on your production process. This was pointed out very clearly in this post. (I should state that many of my comments here apply to PEMT use in localization TEP type projects only).
Many now understand that the key to production efficiency with MT is to customize it for a specific domain and tune it for a specific purpose. This results in higher quality MT output but only if done with skill and expertise. We are now seeing some practitioners making attempts to make quality assessments prior to undertaking post-editing projects, but there is a lot of confusion since the quality metrics being used are not well understood. In general, any metric used, automated or human assessment based, requires extended use and use experience before they can produce useful input to rate-setting practices. BLEU is possibly the one metric that is most misunderstood and has the least value in helping to establish the correct rates for PEMT work, mostly because it is usually misused. There is one MT vendor making outlandish claims of getting a BLEU of .9 (90) or better. (This is clearly a bullshit alert!) This is somewhat ridiculous since it is quite typical for two competent human translations to score no higher than .7 when their translations are compared unless they use exactly the same phrasing and vocabulary to translate the same source material. The value of BLEU in establishing PEMT rates is limited unless the practitioner has long-term experience and a deep understanding of the many flaws of BLEU.
Another popular approach is to use human-based quality assessment metrics like SAE J2450 or Edit Distance. They work best for those companies that have used them over a long period and understand how the metric measurements relate to past historical project experience. These are better and more reliable than most automated metrics but are much more expensive to deploy and also their link to setting correct compensation levels is not clear. There is much room for misinterpretation and like BLEU, they too can be dangerous in the hands of those with little understanding or expertise with extended use of these metrics. It is important that whatever metric is used should be trusted and easily understood by editors to build efficient and effective production systems.
While all these measurements of quality provide valuable information, I think the only metric that should matter is productivity. It is useful to use MT only if the translation production process is more efficient and more productive with the use of MT. This means that the same work is done faster and at a lower cost. This can be stated very simply in terms of average productivity as follows (I chose a number that can be easily divided by 8 and stay with round numbers):
Translator Productivity before MT 2400 Words / Day or 300 Words / Hour
Any MT system that cannot produce translated output and related productivity that beats this throughput, is of negative value to your production efficiency, and you should stay with your old pre-MT process or find a better MT system. MT systems must beat this level of productivity to be economically useful to the production goals and to be useful in general. (BTW most Moses and Instant MT attempts often do not meet this requirement.)
Thus it is important to measure the productivity impact of the specific MT system that you are dealing with and measure the productivity implications of the very specific MT output your editors will be dealing with. To ensure that post-editors feel that compensation rates have been fairly set it is wise to use trusted and competent translators in the rate-setting process. It would also be good to be able to do this reliably with a sample or have a reconciliation process after the whole job is done to ensure that the rate was fair. The simplest way to do this could be as follows:
1. Identify a “trusted” translator and have this person do 2 hours of PEMT work that is directly related to the material that will be post edited.Good MT systems will produce output that is very much like high fuzzy match TM. The better the system, the higher the average level of fuzzy match. This still means that you will get occasional low matches and make sure you understand what average means in the statistical sampling sense. Thus, if a system produces output that the trusted translator can edit at a rate of 750 words an hour, we can see that this is 2.5X the productivity rate without MT. Based on this data point, there is justification to reduce the rate paid to 40% of the regular rate, but since this is a small sample it would be wiser to adjust this upwards to a level that will accommodate more variance in the MT output. Thus perhaps the optimal rate would be to set the PEMT rate at 50% of the regular rate in this specific case based on this trusted measurement. It may also be advisable to offer incentives for the highest productivity to ensure that editors focus only on necessary modification and avoid excessive correction. Other editors should be informed that the rates were set based on actual measured work throughput. And at least in the early days, it would be wise to measure the productivity as often and as much as possible on larger data sets. In time, editors will learn to trust these measurements and will remain motivated to work on ongoing projects assuming the initial measurements are accurate and fair.
2. Measure the productivity carefully both before and after the use of MT.
3. Establish the PEMT rates based on this productivity rate and err on the side of over paying editors initially to ensure that they are motivated.
It is of course possible to do a larger sample or test where more translators are used and a longer test period is measured, e.g. 3 translators working for 8 hours. Though based on experiential evidence across multiple customers we have seen at Asia Online, a 2 hour test with a trusted translator provides a very accurate estimate of the productivity and can help establish a rate that is considered fair and reasonable for the work involved. I am sure there are other opinions on this and it would be interesting to hear them, but I would opt for an approach where trusted partners and actual direct production data experience are the key drivers to setting rates over metrics that may or may not be properly implemented.
I continue to see more and more examples of good MT systems that produce output that clearly leverages production efficiency, and I hope that we will see more examples of good compensation practices in which translators and editors find that they actually make more money as I pointed out in my last post, than they would in typical TEP scenarios using just TM.
Whatever you may think of this approach, the issue of post-editing MT needs to be linked to an accurate assessment of the quality of the MT output and the resultant productivity benefit. It is in everybody’s interest to do this accurately, fairly, and in a way that builds trust and helps drive translation into new kinds of areas. This quality assessment and productivity measurement process may be an area that translators can take a lead in and help to establish useful procedures and measurement methodology that the industry could adopt.
I have written previously on post-editing compensation and there are several links to other research material and opinions on this issue in that posting. I would recommend it to anybody interested in the subject.
I totally agree that increase in productivity is linked with the quality of the MT output and so frequently discounts are considered without MT output QA assessments....
ReplyDeleteThe 2 hour assessment is good (we normally use 4 hour assessments) but it is true that 2 or 4 hours cannot represent 8 hours nor all translators. So it is only another data point, to have an idea or to confirm a Bleu or other score. A post-edit ED can also provide another data point, to see what actually happened afterwards, and in this way evaluate the process over time.
Great post, Kirti.
A full day workshop on MT Post-Editing, including several case-studies and research analyses of post-editing productivity measurement is taking place at the AMTA 2012 conference in San Diego this coming Sunday (October 28). Onsite registration is welcome, so if you are attending the ATA conference this week and can stick around for another day, please join us!
ReplyDelete