Tuesday, June 15, 2010

Machine Translation Themes at Localization World Berlin 2010

I spent last week in Berlin at what is considered the premiere conference event in the localization industry and I also attended two other events that were focused on translation automation. About 500 people were in attendance, more if you count the people at the Translingual event.

It is clear that MT has become a much more central issue, and of great interest to the Localization World attendees as there were several sessions that were well attended. It was also clear that many have started to experiment or at least start serious explorations of the technology to better understand it and that people really do want to understand how and when to use the technology effectively.

I thought it would be useful to review and summarize the sessions even though I was involved in some of the sessions and to some extent this is self-promotion, I will try and make it useful to keep the dialogue going.

MT Pricing – Buyers, Sellers, Developers
This session had several short presentations from the panelists and a detailed introduction from Josef van Genabith of the CNGL on how MT cost/benefit/value could be viewed. The session had hoped to provide some insight on how to price MT, post-editing and better understand the value that MT could deliver to the various stakeholders. While some perspective was provided on these issues, I think it failed to answer the question that many in the audience had: What rate should I charge for post-editing MT output?  The problem with this question is that it depends on the quality of the MT system, the skill levels of the editors and probably the volume of the total project. There is not a single clear formula and MT does require customization effort and quality assessment before production use, for maximum benefit. I think this subject will be an area worth further exploration as there is a relationship between the quality of the MT system and the cost/benefit and thus the value of the system.(I think the session was “uneven” as one of the panelists put it.)

The feedback I got on the session was probably 40% positive, 60% negative and many said the discussion was derailed when one of the panelists said that MT should be “free” since it only costs the electricity to run. This ignores the real effort and skill required to get a good MT engine in place and this kind of specious argument could logically extend to saying all software should be free, since it too only requires electricity and a computer to run.

MT in the Real World — Successes, Challenges and Insight from Teams of Customers and Providers
I felt this was one of the best MT related sessions (even though I had doubts about the format before hand) as it provided both the customer and the vendor perspective on the same situation and described the following issues in three different “real world” situations:
  • The rationale for MT use
  • What were the challenges during the startup phase?
  • What were the costs and metrics used to measure success?
  • Descriptions of the ramp-up experience and potential expansion and scalability outlook
  • Brief overview of the results in terms of MT quality and business benefits
  • Some description of things that went wrong and what could have been done to avoid this
  • Lessons learned and recommendations for best practices
In an attempt at humor I made a comment about my customers’ unsuccessful evaluation of a ProMT Russian engine for patent translations in this session. (It was not my intent to impugn this product, as many will tell you that it is a fine Russian MT engine.) I wanted to also point out how critical a major terminology effort is in a patent MT application as I was aware of the 400,000 term effort made for the Japanese engine that we had built. Anyway I thought I should clear this up as many seemed to interpret my comment as a deliberate and intended slam on the competition. It was not.

Optimizing Content for Machine Translation
This was another MT session that I think had very high quality content, though much more technically complex and perhaps more demanding of the audience. The slides for this session are worth a look as they are dense and packed with information.
Karen Combe of PTC outlined several kinds of typical TM practices that can be problematic for SMT. She gave specific examples of the kinds of issues I highlighted earlier. This helped to highlight that TM in it’s natural state may not be quite ready to pour into an SMT training engine.

I found the contrast in getting data ready for the MT engine between Kerstin Bier and Olga Beregovaya very instructive. It showed how fundamentally different the SMT approach is from the RbMT approach with very specific and concrete examples on the data preparation strategy. It was interesting to see that some of the tags and TM metadata were very useful to a RbMT engine even though it could be a problem for SMT.

Melissa Biggs of Sun (Oracle) and Jessica Roland of EMC provided some examples of unsuccessful and successful uses of “controlled language” technology. A great session, filled with useful information.

TAUS Data Association Update
While not strictly focused on MT there was a lot of mention about MT in this session. A large part of the session focused on why the panelists had joined the TDA, but there was also some useful information from the session:
  • TDA currently has 2.5B words of TM and hopes to double this in the next year
  • TDA currently has 70 members and is trying to make it easier for smaller members to join
  • They are trying to get more open source tools available to help members process the data more easily.
  • They have annual operating costs of about $500,000
The three main uses of TDA data so far have been:
  1. Monitor terminology use and practices across an industry
  2. TM Leveraging
  3. Provide larger mass of training corpus for SMT (But use with care after cleaning and normalization)

Translingual Europe 2010: International Conference on Advanced Translation Technology was held on the day before and had much about MT initiatives across the world and especially in the EU. Dave Grunwald has provided a good summary of this event in his blog.

The EU and the DFKI are working hard to further the state of MT and related language technology and I think this conference could become a source of interesting initiatives. The moderator of my group of presenters threw us all off balance, by suddenly announcing that the presentation time would be 1/3rd less than we had assumed to that point, so some of the presentations looked really hurried.

The presentation by Microsoft on their effort to develop the Haitian Creole system and user presentations by the EC, EPO and Symantec were very interesting. Also it is good to hear that the EC will be funding more research to advance the state of the technology and that they are targeting small companies in particular as Kimmo Rossi stated in his opening presentation.

I hope they make Translingual an annual affair, loosely linked to Localization World,  as it has great promise to becoming a major MT and language technology event, especially if it becomes a two day event with one day focusing on policy issues and the second day focusing on practice. I also had the good fortune to celebrate Hans Uzkoreit's birthday party earlier in the weekend at the Wasserwerk facility which was also the scene of many interesting discussions on broader language technology issues. The open forum approach  brought forward many interesting ideas at the "Berlin Theme Tank" proceedings and Hans was clearly a driving force behind the discussions.


  1. Really very relavant artical with the topic it is good work.

  2. it is very realistic article about the machine theme translation and i appreciate it.