Pages

Friday, October 28, 2011

The Building Momentum for Post-Edited Machine Translation (PEMT)

This is an (opinionated) summary of interesting findings from a flurry of conferences that I attended earlier this month. The conferences were the TAUS User Conference, Localization World and tekom. Even though it is tiring to have so many so close together, it is interesting to see what sticks out a few weeks later. For me TAUS and tekom were clearly worthwhile, and Localization World was not, and I believe that #LWSV is an event that is losing it’s mojo in spite of big attendance numbers.

Some of the big themes that stand out (mostly from TAUS) were:
  • Detailed case studies that provide clear and specific evidence that customized MT enhances and improves the productivity of traditional (TEP) translation processes
  • The Instant on-demand Moses MT engine parade
  • Initial attempts at defining post-editing effort and difficulty from MemoQ and Memosource
  • A future session on the multilingual web from speakers who actually are involved with big perspective, global web-wide changes and requirements
  • More MT hyperbole
  • The bigger context and content production chain for translation that is visible at tekom
  • Post-editor feedback at tekom
  • The lack of innovation in most of the content presented at Localization World
 The archived twitter stream from TAUS (#tausuc11) is available here, the tekom tag is #tcworld11 and Localization World is #lwsv. Many of the TAUS presentations will be available as web video shortly and I recommend that you check some of them out.


PEMT Case Studies
In the last month I have seen several case studies that document the time and cost savings and overall consistency benefits of good customized MT systems. At TAUS, Caterpillar indicated that their demand for translation was rising rapidly and thus they instituted their famed controlled language (Caterpillar English) based translation production process using MT. The MT process was initially more expensive since 100% of the segments needed to be reviewed but they are now seeing better results on their quality measurements from MT than from human translators on Brazilian Portuguese and Russian according to Don Johnson, Caterpillar. They expect to expand to new kinds of content as these engines mature.

Catherine Dove of PayPal described how the human translation process got bogged down on review and rework cycles (to ensure PayPal brand’s tone and style was intact) and was unable to meet production requirements of 15K words per week with a 3 day turnaround in 25 languages. They found that “machine-aided human translation” delivers better, more consistent terminology in the first pass and thus they were able to focus more on style and fluency. Deadlines are easier to meet and she also commented that MT can handle tags better than humans. They also focus on source cleanup and improvement to leverage the MT efforts and interestingly the MT is also useful in catching errors in the authoring phase. PayPal uses an “edit distance” measurement to determine the amount of rework and have found that the MT process reduces this effort by 20% on 8 of 10 languages they are using MT on. An additional benefit is that there is a new quality improvement process in place that should continue to yield increasing benefits.

A PEMT user case study was also presented by Asia Online and Sajan at the Localization Research Conference in September 2011. The global enterprise customer is a major information technology software developer, hardware/IT OEM manufacturer, and comprehensive IT services provider for mission critical enterprise systems in 100+ countries. This company had a legacy MT system developed internally that had been used in the past by the key customer stakeholders. Sajan and Asia Online customized English to Chinese and English to Spanish engines for this customer. These MT systems have been delivering translated output that even beats the first pass output from their human translators due to the highly technical terminology, especially in Chinese.  A summary of the use case is provided below:
  • 27 million words have been processed by this client using MT
  • Large amounts of quality TM (many millions of words) and glossaries were provided and these engines are expected to continue to improve with additional feedback.
  • The customized engine was focused on the broad IT domain and was intended to translate new documentation and support content from English into Chinese and Spanish.
  • A key objective of the project was to eliminate the need for full translation and limit it to MT + Post-editing as a new modified production process.
  • The custom engine output delivered higher quality than their first pass human translators especially in Chinese
  • All output was proof read to deliver publication quality.
  • Using Asia Online Language Studio the customer saved 60% in costs and 77% in time over previous production processes based on their own structured time and cost measurements.
  • The client also produces an MT product, but the business units prefer to use Asia Online because of considerable quality and cost differences.
  • Client extremely impressed with result especially when compared to the output of their own engine.
  • The new pricing model enabled by MT creates a situation where the higher the volume the more beneficial the outcome.
The video presentation below by Sajan begins at 27 minutes (in case you want to skip over the Asia Online part) and even if you only watch the Sajan presentation for 5 minutes you will get a clear sense for the benefit delivered by the PEMT process.

A session on the multilingual web at TAUS by the trio Bruno Fernandez Ruiz, Yahoo! Fellow and Vice President, Bill Dolan, Head of NLP Research, Microsoft, Addison Phillips, Chair, W3C Internationalization Group / Amazon also produced many interesting observations such as:
  • The impact of “Big Data” and the cloud will affect language perspectives of the future and the tools and processes of the future need to change to handle the new floating content.
  • Future applications will be built once and go to multiple platforms (PC, Web, Mobile, Tablets)
  • The number of small nuggets of information that need to be translated instantly will increase dramatically
  • HTML5 will enable publishers to be much freer in information creation and transformation processes and together with CSS3 and Javascript can handle translation of flowing data across multiple platforms
  • Semantics have not proven to be necessary to solve a lot of MT problems contrary to what many believed even 5 years ago. Big Data will help us to solve many linguistic problems that involve semantics
  • Linking text to location and topic to find cultural meaning will become more important to developing a larger translation perspective
  • Engagement around content happens in communities where there is a definable culture, language and values dimension
  • While data availability continues to explode for the major languages we are seeing a digital divide for the smaller languages and users will need to engage in translation to make more content in these languages happen
  • Even small GUI projects of 2,000 words are found to have better results with MT + crowdsourcing than with professional translation
  • More translation will be of words and small phrases where MT + crowdsourcing can outperform HT
  • User s need to be involved in improving MT and several choices can be presented to users to determine the “best” ones
  • The community that cares about solving language translation problems will grow beyond the professional translation industry.

At TAUS, there were several presentations on Moses tools and instant Moses MT engines via a one or two step push button approach. While these tools facilitate the creation of “quick and dirty data” MT engines, I am skeptical of the value of this approach for real production quality engines where the objective is provide long-term translation production productivity. As Austin Powers once said, “This is emPHASIS on the wrong syllABLE" My professional experience is that the key to long-term success (i.e. really good MT systems) is to really clean the data and this means more than removing formatting tags and removing the most obvious crap. This is harder than most think. Real cleaning also involves linguistic and bilingual human supervised alignment analysis. Also, I have seen that it takes perhaps thousands of attempts across many different language pairs to understand what is happening when you throw data into the hopper, and that this learning is critical to fundamental success with MT and developing continuous improvement architectures. I expect that some Moses initiatives will produce decent gist engines, but are unlikely to do much better than Google/Bing for the most part. I disagree with Jaap’s call to the community to produce thousands of MT systems, what we really need to see are a few hundred really good, kick-ass systems, rather than thousands that do not even measure up to the free online engines. And so far, getting a really good MT engine is not possible without real engagement from linguists and translators and more effort than pushing a button. We all need to be wary of instant solutions, with thousands of MT engines produced rapidly but all lacking in quality and "new" super semantic approaches that promise to solve the automated translation problem without human assistance. I predict that the best systems will still come from close collaboration with linguists and translators and insight borne from experience.

I was also excited to see the initiative from MemoQ to establish a measure of translator productivity or post-editing effort expended, by creating an open source measurement of post-edited output, where the assumption is that an untouched segment is a good one. MemoQ will use an open and published edit distance algorithm that could be helpful in establishing better pricing for MT post-editing and they also stressed the high value of terminology in building productivity. While there is already much criticism of the approach, I think this is a great first step to formulating a useful measurement. At tekom I also got a chance to see the scheme that MemSource has developed where post-edited output is mapped back to a fuzzy matching scheme to establish a more equitable post-editing pricing scheme than advocated by some LSPs. I look forward to seeing this idea spread and hope to cover it in more detail in the coming months.

Localization World was a disappointing affair and I was struck by how mundane, unimaginative and irrelevant much of the content of the conference was. While the focus of the keynotes was apparently innovation, I found the @sarahcuda presentation interesting, but not very compelling or convincing at all in terms of insight into innovation. The second day keynote was just plain bad, filled with clichés and obvious truisms e.g. “You have to have a localization plan” or “I like to sort ideas in a funnel”. (Somebody needs to tell Tapling that he is not the CEO anymore even though it might say so on his card). I heard several others complain about the quality of many sessions, and apparently in some sessions audience members were openly upset. The MT sessions were really weak in comparison to TAUS and rather than broadening the discussion they succeeded in mostly making them vague and insubstantial. The most interesting (and innovative) sessions that I was witness to were the Smartling use case studies and a pre-conference session on Social Translation. Both of these sessions focused on how the production model is changing and both were not particularly well attended. I am sure that there were others that were worthwhile (or maybe not), but it appears that this conference will matter less and less in terms of producing compelling and relevant content that provides value in the Web 2.0 world. This event is useful to meet with people but I truly wonder how many will attend for the quality of the content.

The tekom event is a good event to get a sense for how technical business translation fits into the overall content creation chain and also see how synergies could be created within this chain. There were many excellent sessions and it is the kind of event that helps you to broaden your perspective and understand how you fit into a bigger picture and ecosystem. The event has 3300 visitors so it is also a much larger perspective in terms of many different views points. I had a detailed conversation with some translators about post-editing. They were most concerned about the compensation structure and post-editor recruitment practices. They specifically pointed out how unfair the SDL practice of paying post-editors 60% of standard rates was, and asked that more equitable and fair systems be put into place. LSPs and buyers would be wise to heed this feedback if they want to be able to recruit quality people in future. I got a close look at the MemSource approach to making this more fair, and I think that this approach which measures the actual work done at a segment level should be acceptable to many. This approach measures the effort after the fact. However, we still need to do more on making the difficulty of the task before the translators begin more transparent. This begins with an understanding of how good the individual MT system is and how much effort is needed to get to production quality levels. This is an area that I hope to explore further in the coming weeks.

I continue to see more progress on the PEMT front and I now have good data of measurable productivity even on a language pair as tough as English to Hungarian. I expect that a partnership of language and MT experts will be more likely to produce compelling results than many DIY initiatives, but hopefully we learn from all the efforts being made.

Tuesday, September 27, 2011

The Growing Interest & Concern About the Future of Professional Translation

I have noticed of late that every conference has a session or two that focuses on the future. Probably because many sense that change is in the air. Some of you may also have noticed that the protest from some quarters has grown more strident or even outright rude, to some of the ideas presented at these future outlook sessions. The most vocal protests seem to be directed at predictions about the increasing use of machine translation, anything about “good enough” quality and the process and production process changes necessary to deal with the increasing translation volume. (There are still some who think that the data deluge is a myth). 

Some feel personally threatened by those who speak on these subjects and rush to kill or at least stab the messenger. I think they miss the point that what is happening in translation, is just part of a larger upheaval in the way global enterprises are interacting with customers. The forces causing change in translation are also creating upheaval in marketing, operations and product development departments as many analysts have remarked for some time now. The discussion in the professional translation blogosphere is polarized enough (translators vs. technology advocates) that dialogue is difficult, but hopefully we all continue to speak with increasing clarity, so that the polemic subsides. The truth is that none of us really knows the definite future, but that should not stop us from making educated (or even wild) guesses at where current trends may lead. (I highly recommend you skim The Cluetrain Manifesto to get a sense for the broader forces at play.)
Brian Solis has a new book coming out that describes the overall change quite succinctly. The End of Business As Usual (his new book) explores each layer of the complex consumer revolution that is changing the future of business, media, and culture. As consumers further connect with one another, a vast and efficient information network takes shape and begins to steer experiences, decisions, and markets. It is nothing short of disruptive.
I was watching the Twitter feed from two conferences last week (LRC XVI in Ireland and Translation Forum in Russia)  and I thought it would be interesting to summarize and highlight some of the tweets as they pertain to this changing world and perhaps provide more clarity about the trends from a variety of voices and perspectives. The LRC conference had several speakers from large IT companies who talked about their specific experience, as well as technology vendor and LSP presentations. For those who are not aware, CSA research identifies IT as one of the single largest sectors buying professional translation services. The chart below shows the sectors with the largest share of global business. This chart is also probably a good way to understand where these changes are being felt most strongly.
image

Here are some Twitter highlights from LRC on the increasing volume of translation, changing content, improving MT and changing translation production needs. I would recommend that you check out @therosettafound for the most complete Twitter trail. I have made minor edits to provide context and clarify abbreviations and have attempted to provide some basic organization to the tweet trail to make it somewhat readable.
image
@CNGL Changing content consumption and creation models require new translation and localisation models – (according to) @RobVandenberg
@TheRosettaFound We are all authors, the enterprise is going social - implications for localisation?
@ArleLommel Quality even worse than Rob Vandenberg says: we have no real idea what it is/how to measure, especially in terms of customer impact
Issue is NOT MT vs. human translation (HT). It's becoming MT AND HT. Creates new opportunities for domain experts.
Dion Wiggins. LSPs not using MT will put themselves out of business? Prediction: yes in four/five years
CNGL says 25% of translators/linguists use MT. I wonder how many use it but say they don't use it due to (negative) perception (with peers)?
Waiting for translation technology equivalent of iPhone: something that transforms what we do in ways we can't yet imagine.

Tweets from Jason Rickhard’s presentation on Collaborative Translation (i.e. Crowdsourcing) and IT go Social.
@TheRosettaFound Jason of Symantec giving the enterprise perspective, added 15-20 languages to small but popular product, built tech to support this. Not just linguistic but also legal, organizational issues to be resolved in collaborative, paid-for product.
Is collaborative translation bad & not-timely? #lrcconf Not so, a lot of translators = involved users of the content/product they translate.
Review process is different in collaborative translation. Done by voting, not by editors
The smaller the language gets, the more motivated volunteer translators are and the better collaborative translation works.
Is volunteering something for people who don't have to worry that their day-to-day basics are covered?
Does collaborative translation and collaboration mean that content owners "give up the illusion of control" over their content?
Enterprises do collaborative translation for languages they would/could not cover otherwise - true, for both profit and non-profits
Collaborative/Community will not replace existing service providers but open up more content for more languages
Language Service Providers could play an important role in community translation by building, supporting, moderating communities
It's not correct to say Community Translation = bad; Professional Translation = good
Microsoft appoints moderators with a passion for the project/language for community localization
>1,200 users translated Symantec products into 35 languages
If >1,200 were needed to translate 2 small-ish products, how can millions of translators translating 1 ZB be 'managed'?
@ArleLommel Symantec research: Community involvement in support often leads to ~25% reduction in support spend
“Super users” are what make communities scalable. Key is to identify/cultivate them early in the process
Jason Rickard: Dell is a good example of using Facebook for support. One of few companies with real metrics and insight in this area.
Jason Rickard: Symantec has really cool/systemic/well-thought ways to support community


@TheRosettaFound 21st generation localisation is about the user, about user-generated content - Ellen Langer: Give up the Illusion of Control
@ArleLommel Illusion of control? You mean we can have even less control that we have now? That's a scary thought!
@TheRosettaFound The most dramatic shifts driven by the web happened because communities took over - Imagine: 100000s of user translators translating billions of words into 100s of languages - control that!
Seems the deep and complex problems of localisation are a minute drop in the ocean of digital content management
@CNGL Discovery, analysis, transformation - Alex O'Connor tells how CNGL is addressing the grand challenges of digital content management
@TheRosettaFound Is the L10N industry due for a wave of destruction or positive transformation?
@ArleLommel Yes, Most of the mainstream technologies for translators are non-ergonomic and still in 20-year-old paradigms

Tweets from Tony Allen, Product Manager Intel Localisation Solutions presentation
@TheRosettaFound 30+ langs >200k pages >40% localised @ Intel's web presence. Intel: important to have user-driven content, interaction with the customer. Integration important, e.g. multilingual support chat. Integration, Interoperability key issues for Intel L10N. To figure out how content flows, without loss of metadata, interoperates with internal/external range of systems, is crucial.
2.5b netizens, >15b connected devices >1 zetabyte of traffic by 2015 and companies will interact with their customers using social media - type setups; new challenges for localization.
#intel What does it mean for localization infrastructures if we have >1 zetabyte of content in 2015? Current methods won't keep up
@ArleLommel #intel says that interoperability standards are required for cloud to meet future demands. L10n must evolve to meet this need too.

@ArleLommel Alison Toon (#hp) puts it this way: “localization (people) are the garbage collectors of the documentation world”
@TheRosettaFound 600GB of data in Oracle's Translation Platform - We need concise well-structured content - then we're going to be able to deliver efficient translation services - How to get it right: analyze content, identify problems and throw it back into the face of writers and developers. I18N and l10n have to get into the core curriculum at Universities says Paul Leahy (of Oracle), since we spend too much time teaching it.

Tweets from Sajan / Asia Online MT presentation
@TheRosettaFound MT cannot perform magic on bad source text - user-generated non-native-speaker content is often 'bad'
MT errors make me laugh... but human errors make me cry - an old quote from previously recycled presentations... Asia Online
Dirty Data SMT - what kind of translations would you expect? If there are no humans involved you are training on dirty data, says Asia Online. Sajan achieved 60% reduction in costs and 77% time savings for specific project - a miracle? Probably no, let’s see.
Millions of words faster, cheaper, better translated by Sajan using Asia Online - is this phenomenal success transferable? How?
XLIFF contributed to the success of Sajan/Asia Online's MT project. Asia Online's process rejected 26% of TM training data.

Tweets from Martin Orsted, Microsoft presentation
@TheRosettaFound Cloud will lead to improved cycle times and scalability: 100+ languages, millions of words
Extraordinary scale: 106 languages for the next version of Office. Need a process that scales up & down in proportion.
Microsoft: We have fewer people than ever and we are doing more and more languages than ever
Martin: "The Language Game - Make it fun to review Office"... here is a challenge :) Great idea to involve community via game
How can a "Game" approach be used for translation? Levels of experience, quality, domains, complexity; rewards?
No more 'stop & go', just let it flow @robvandenburg >>Continuous publishing requires continuous translation. New workflows

Tweets from Derek Coffey, Welocalize presentation Are we the FedEx or the WallMart of words?
@TheRosettaFound TMS of SDL = burning stacks of cash - Reality: we support your XLIFF, but not your implementation
Lack of collaboration, workflow integration, content integration = most important bottle necks. Welocalize, MemoQ, Kilgray and Ontram working on reference implementation - Derek: Make it compelling for translators to work for us
It's all about the translators and they will seek to maximise their earning potential according to Derek.

Tweets from Future Panel
@TheRosettaFound Many translators don't know what XML looks like
Rob: more collaborative, community translation - Rob: Users who consume content will have a large input into the translation BINGO
Tony: users will drive localisation decision, translation live
Derek: future is in cooking xxx? Open up a whole new market - user generated, currently untranslatable content. HUGE market
Derek: need to re-invent our industry, with focus on supply chain
The end of the big projects - how are we going to make money (question from audience)
From service/product to community - the radical change for enterprises, according to Fred
No spark, big bang, revolution - but continuous change, Derek
Big Spark (Dion): English will no longer remain the almost exclusive source language

image
The Translation Forum Russia twitter trail has a much more translator oriented focus and is also bilingual. Here are some highlights below, again with minor edits to improve readability.

@antonkunin Listened to an information-packed keynote by @Doug_Lawrence at #tfru this morning. As rates keep falling, translators' income keeps rising.
@ilbarbaro Talking about "the art of interpreting and translation" in the last quarter of 2011 is definitely outdated
Language and "quality" are important for translators, speed and competence for (final) clients. Really?
Translators are the weakest link in the translation process
Bert: here and now translation more important than perfect translation
Bert on fan subbing as an unstoppable new trend in translation
Is Bert anticipating my conclusions? Noah's ark was made and run by amateurs, RMS Titanic by professionals
Carlos Incharraulde: terminology is pivotal in translators training < Primarily as a knowledge transfer tool
To renatobeninatto at who said: Translation companies can do without process standards < I don't agree
@renatobeninatto: Start looking at income rather than price/rates
Evaluating translation is like evaluating haircuts - it's better to be good on time than perfect too late
Few translation companies do like airlines: 1st class/ Economy/ Discount rates – Esselink
Traditional translation models only deal w/ tip of iceberg. New models required for other 90%. Esselink
Good enough revolution. Good enough translation for Wikileaks, for example. Bert Esselink
In 2007 English Russian was $0.22 per word, in 2010 it dropped to $0.16 @Doug_Lawrence
There's much talk on innovation but not much action - don't expect SDL and Lionbridge to be innovative
@Doug_Lawrence all languages except German and French decreased in pricing from 2007 to 2010
@AndreyLeites @ilbarbaro problem-solving is the most important feature translator should acquire - Don't teach translators technology, teach them to solve problems - language is a technology, we need to learn how to use it like technology - 85% of translators are still women
@ilbarbaro 3 points on quality: 1. Quality is never absolute, 2. Quality is defined by the customer, 3. Quality can be measured - it is necessary to learn to define quality requirements (precisely)
@Kilgraymemoq announces that they will open Kilgray Russia before the end of the year

This is of course my biased extraction from the stream, but the original Twitter trail will be there for a few more weeks and you can check it out yourself. It is clear to me from seeing the comments above, that at the enterprise level, MT and Community initiatives will continue to gather momentum.  Translation volumes will continue to rise and production processes will have to change to adapt to this. Also, I believe, there are translators who are seeking ways to add value in this changing world and I hope that they will provide the example that leads the way in this changing flux.

And for a completely different view of "the future" check this out.


Monday, September 12, 2011

Understanding Where Machine Translation (MT) Makes Sense

One of the reasons that many find MT threatening I think, is the claim by some MT enthusiasts that it that it will do EXACTLY the same work that was previously done by multiple individuals in the “translate-edit-proof” chain without the humans, of course. To the best of my knowledge this is not possible today, even though one may produce an occasional sentence where this does indeed happen. If you want final output that is indistinguishable from competent human translation, then you are going to have to use the human “edit-proof” chain to make this happen.
image

Some in the industry have attempted to restate the potential of MT from Fully Automated High Quality Translation - FAHQT (Notice how that sounds suspiciously like f*&ked?) to Fully Automated Useful Translation – FAUT. However, in some highly technical domains it is actually possible to see that carefully customized MT systems can outperform exclusively human-based production, because it is simply not possible to find as many competent technical translators as are required to get the work done.
image

We have seen that both Google and Bing have gotten dramatically better since they switched from RbMT to statistical data-driven approaches, but the free MT solutions have yet to deliver real compelling quality outside of some Romance languages, and the quality is usually far from competent human translation. They also offer very little in terms of control, even if you are not concerned about the serious data privacy issues that their use brings to the user. It is usually worthwhile for professionals to work with specialists who can help them customize these systems to the specific purpose they are intended for. MT systems evolve and can get better with with small amounts of corrective feedback if they are designed from the outset to do this. Somebody who has built thousands of MT systems, across many language combinations, is likely to offer more value and skill than most can get from using tools like Moses building a handful of systems, or even the limited dictionary building input possible with many RbMT systems. And how much better can customized systems get than the free systems? Depending on the data volume and quality, it can range from small but very meaningful improvements to significantly better overall quality.

So where does MT make most sense? Given that there is a significant effort required to customize an MT system, it usually makes most sense when you have ongoing high volume, dynamically created source data and tolerant users or any combination thereof. It is also important to understand that the higher the quality requirements, the greater the need for human editing and proofing. The graphic below elaborates on this.
image

While MT is unlikely to replace human beings in any application where quality is really important, there are a growing number of cases that show that MT is suitable for:
  • Highly repetitive content where productivity gains with MT can exceed what is possible with just using TM alone
  • Content that would just not get translated otherwise
  • Content that simply cannot afford human translation
  • High value content that is changing every hour and every day but has a short shelf life
  • Knowledge content that facilitates and enhances the global spread of critical knowledge, especially for health and social services
  • Content that is created to enhance and accelerate communication with global customers who prefer a self-service model
  • Real-time customer conversations in social networks and customer support scenarios
  • Content that does not need to be perfect but just approximately understandable

So while there are some who would say that MT can be used anywhere and everywhere, I would suggest that a better fit for professional use is where you have ongoing volume, and dynamic but high value source content that can enhance international initiatives. To my mind, customized MT does not make sense for one-time, small localization projects where the customization efforts cannot be leveraged frequently. Free online MT might still prove of some value in these cases, to boost productivity, but as language service providers learn to better use and steer MT, I expect that we will see that they will provide translators access to “highly customized internal systems”  for project work, and the value to the translators will be very similar to the value provided by high quality translation memory.  Simply put – it can and will boost productivity even for things like user documentation and software interfaces.
image

It is worth understanding that while “good” MT systems can enhance translator productivity in traditional localization projects, they can also enable completely new kinds of translation projects that have larger volumes and much more dynamic content. While we can expect that these systems will continue to improve in quality, they are not likely to produce TEP equivalent output. I expect that these new applications will be a major source for work in the professional translation industry but will require production models that differ from traditional TEP production.
image  image

However, we are still at a point in time where there is not a lot of clarity on what post-editing, linguistic steering and MT engine refinement really involve. They do in fact involve many of the same things that are of value in standard localization processes , e.g. unknown word resolution, terminological consistency, DNT lists and style adjustments. They also increasingly include new kinds of linguistic steering designed to “train” the MT system to learn from historical error patterns and corrections. Unfortunately many of the prescriptions on post-editing principles available on LinkedIn and translator forums, are either linked to older generation MT systems (RbMT), systems that really cannot improve much beyond a very limited point or are linked to a specific MT system. In the age of data-driven systems new approaches are necessary and we have only just begun to define this. These new hybrid systems also allow translators and linguists to create linguistic and grammar rules around the pure data patterns. Hopefully we will see much better “user-friendly” post-editing environments that bring powerful error detection and correction utilities into much more linguistically logical and pro-active feedback loops. These tools can only emerge if more savvy translators are involved (and properly compensated) and we successfully liberate translators from dealing with the horrors of file and format conversions and other non-linguistic tedium that translation today requires. This shift to a mostly linguistic focus could also be much easier with better data interchange standards. The best examples of these are from Google and Microsoft rather than the translation industry, thus far.

image

Talking about standards, possibly the most worthwhile of the initiatives focusing on translation data interchange standards is meeting in Warsaw later this month. The XLIFF symposium IMO is the most concrete and most practical standards discussion going on at the moment and includes academics, LSPs, TAUS, tools vendors and large buyers sharing experiences. The future is all about data flowing in and out of translation processes and we all stand to benefit from real, robust standards that work for all constituencies.