Saturday, July 2, 2011

The Google Translate API Furor: Analysis of the Impact on the Professional Translation Industry – Part II

This is a continuation of this posting.

Is Google Right For The Professional Language Services Industry?

For more than 40 years, machine translation has promised much but consistently failed to deliver. MT promises had come to be seen more as “empty promises”. In recent years, for enterprises and language service providers who want control of the translation, Google’s machine translation has been an eye-opener for many but still not a real solution to their requirements.

Before Google launched its own SMT translation technology in October 2007, Google used Systran, as Yahoo Babelfish still does today. The measure of machine translation in the public eye was Systran technology, even if the public did not know the name of the technology behind the free translation services. Today that measure has moved to Google, with a common perception that Google is the state-of-the-art in machine translation. Google has shown that it can rapidly improve the quality of the translated output in a generalized context and this has impressed many individual users as well as companies. In turn this has led many companies to consider using machine translation for real world applications.

When Asia Online was founded in 2007, we talked with many companies about machine translation and the comments were consistently negative. Since that time, Google has helped machine translation in terms of credibility and by educating the market that considerable advances in the quality of machine translation have been made. Today, many companies that would have written such technologies off as a bad joke just a few years ago are now using or considering machine translation. For those in the machine translation industry, having such barriers removed and user perceptions adjusted has been a great asset, for which considerable credit must be given to Google.

Where the Google Approach Fails the Professional Language Industry

While Google is most certainly state-of-the-art in terms of machine translation scale and in terms of free translation, there are many reasons that Google may not be right for the professional translation industry. It is too easy to measure Google by looking at mainstream Romance languages such as French, Italian, German and Spanish, commonly known in the industry as FIGS. While Google does a reasonable job for general translation in these languages, the same cannot be said for Tier 2 European languages or Asian languages or for most languages where the content is in specialized domains.

Google’s one-size-fits-all approach is great to get an understanding of a document, but the results are not suitable for publication in almost all cases. A professional translator is, by definition, a professional and has specific expertise in certain languages and usually also in a specialized domain of knowledge. Consider the following question:

If you owned a newspaper, would you hire a journalist that specialized in finance to write about game strategy for football or the skills of a specific baseball player when compared against another? Would you hire a sports journalist to write about politics and international monetary policy?

A professional with the necessary skills, education, training credentials and work background would be hired to write an article in the desired domain and writing style. Not just to get the story right, but to give the publication the appropriate credibility and focus the text on a specific audience. So why is it that when machine translation is evaluated it is nearly always evaluated as a comparison to “out of the box” or “free online” translation software? There are literally thousands of posts and articles published on this basis. It is the metaphorical equivalent of evaluating the performance of a Ferrari by test driving a Honda Civic.

Consider the differences in target audience, writing style, vocabulary and terminology in Forbes or Economist (Business News) when compared to Wikipedia or Harry Potter (Young Student).

*Spanish Original:*	Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos.
*Business News Translation:*	Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents.
*Young Student Translation:*	A lot of care was taken to not upset others when organizing the meeting between the two longtime enemies.

It seems that common sense is discarded and the individuals doing such comparisons expect a machine to automatically understand who the target audience is and to study the topic and write using their preferred writing style with their preferred vocabulary.

In the Google model, all data is equal and significant volumes of data are required in order to get statistical relevance. A professional translator is able to understand the intent of the article, the context and audience it was intended for, and also apply a style guide. Using out-of-the-box or free translation software does not give you the ability to customize, control of guide any of these things. Google Translator Toolkit does offer some of this functionality, but only in a very limited manner.

The bottom line is simple. Until recently machine translation software was at best useful for gaining an understanding of what text in a foreign language was about. It was not designed or intended for use in publication of content. This is exactly what free or out-of-the-box translation systems offer today and it is known simply as “gist translation”.

Beyond Gist Translation - The Professional Language Services Industry Requires Control

Asia Online is one of several machine translation providers who develop custom translation engines for clients that are designed for a specific purpose and domain based on the client’s specific vocabulary, terminology and writing style. Some of the tasks included when customizing an engine for a client include analysis of the client’s target audience, glossary preparation, definition of non-translatable terms and preferred terminology, normalization of content, determination of preferred writing style and grammar. These tasks are not dissimilar to how the professional language community works with clients in order to deliver high quality translation using humans.

In this context, both human and machine translation projects require professional human linguistic input prior to beginning translation. The professional skills provided by LSPs play a critical role in both.

Google describes how it translates as follows:

When Google Translate generates a translation, it looks for patterns in hundreds of millions of documents to help decide on the best translation for you. By detecting patterns in documents that have already been translated by human translators, Google Translate can make intelligent guesses as to what an appropriate translation should be.

Hundreds of millions of documents are what is needed for Google to translate anything to the most common form and meaning. This technique is very appropriate if you want a general understanding or gist translation, but not appropriate if you want to publish. What is missing at even the simplest level is domain knowledge, from which greater relevance of context can be derived. Without context, many words can be ambiguous. Consider the use of the English words “bank” and “banked” when translated using Google.

English Source	Human Translation	Google Translation	Google Context
I went to the bank	Fui al banco	Fui al banco	Bank as in finance
I went to the bank to deposit money	Fui al banco para depositar dinero	Fui al banco a depositar el dinero	Bank as in finance
I went to the bank of the turn in my car	Fui en coche a la inclinación de la vuelta	Fui a la orilla de la vuelta en mi coche	Bank as in river bank
I put my car into the bank of the turn	Puse mi coche en la inclinación de la vuelta.	Pongo mi coche en el banco de la vuelta	X Bank as in finance
I swam to the bank of the river	Nadé en la orilla del río	Nadé hasta la orilla del río	Bank as in river bank
I banked my money	Deposité mi dinero	Yo depositado mi dinero	Banked as in finance
I banked my car into the turn	Incliné mi coche en la vuelta	Yo depositado mi coche en la vuelta	X Banked as in finance
I banked my plane into a steep dive	Incliné mi avión en para una zambullida.	Yo depositado en mi avión en picada	X Banked as in finance

The examples above show clearly the shortcomings of the one-size-fits-all approach. Even with the millions of documents that Google claims to have learned from, Google still favors a particular domain (finance) based on the volume of data in that domain when compared to other domains. For example: There is much more multilingual banking and finance data available than there is aeronautical or water sports data.

Studies by Asia Online, TAUS and others have shown that customized engines built using a lesser quantity of high quality data in the appropriate domain can deliver a considerably higher quality translation than customized engines built with more data that is not necessarily in the domain being translated and is of mixed quality.

Purists will argue that with enough data, the correct and better data will become more relevant statistically, while the lower quality data will become less relevant statistically. However, what Google has shown is that even with “hundreds of millions of documents to help decide on the best translation for you”, it often decides on the translation that has the most data available in a given domain, which in turn statistically overpowers domains with less data. In the above example, a clear bias can be seen towards the finance domain, while the domains of sports, automotive and aeronautics are less statistically relevant.

Google’s approach is right for Google’s purpose – that of trying to translate anything for everyone irrespective of purpose. However, this approach is not right for the professional language services industry – where greater control, style and terminology management is required to meet specific purposes.

Even in the same industry, preferred terms frequently vary. Microsoft, Oracle, IBM and Sybase all produce database software. Each may prefer different terms such as RDBMS, DB, database, relational database, relational database management system, relational DB, etc. when producing documentation. With a human translation project managed by professionals, style guides, glossaries and other guidance is provided to human translators.

Google Translator Toolkit gives you a limited amount of control with glossaries and translation memories, but it is not sufficient to meet the needs of a professional translator.

There are also limits (maximum data sizes) for both the learning material you can provide to Google and the amount of data that you are allowed to process. By proof reading in Google Translator Toolkit, you are not just using a free tool, you are part of an informal, but professional, crowdsourcing initiative that delivers high quality proof read translations directly into Google’s tools, which in turn improve the Google technology for every user – including competitors.

The goal of customizing a translation engine for the professional language services industry must be to produce an output that requires the least amount of human editing in order to publish. It must not be to get a general understanding or the gist of the meaning. By focusing on this goal, the productivity of human translators is greatly improved, more content can be translated and more companies will be attracted to translating their content for alternative language markets.

Managing and Setting Machine Translation Expectations with LSPs

When engaging in discussions with LSPs and professional translators, the most common fear is that machines will replace humans. Oddly, for some LSPs this is also the desire and hope. Consider the following true-to-life LSP anecdotes in the context of expectations for machine translation:

· LSP A: After doing a very minimal amount of customization using just a just a few thousand relatively low quality translation memory segments, LSP A received their first version of their engine. The first version is a diagnostic engine, from which it is possible to determine the best path to quality improvement. Despite numerous presentations, emails and discussions, LSP A quickly came back with “we want to replace human translation with machine translation and this is not good enough, we are disappointed. We cannot replace our humans with this.”

Subsequent discussions did not help LSP A understand any better that a customized translation engine is as good as the data and the effort put into creating it and that the volume of high quality data and effort will determine how much human work can be reduced or accelerated. Ignoring reality, LSP A still expected that machine translation would instantly replace humans. Surely if it was this easy, would not every LSP do translation this way and the professional language industry would cease to exist?

· LSP B: After customizing an engine for LSP B, a freelance human translator was hired to review the quality of the machine translation output. LSP B had been in the translation industry for more than 10 years and made it clear that they knew how to measure translation quality. This indeed was true, but only in the context of how to measure the quality of a human translator. LSP B used the same metrics such as grammar, word choice, etc. that they would use for humans and rapidly came to the conclusion that the machine translation was not good enough.

LSP B was taught how to measure the human effort and time to completion for a publication quality translation, after which a human only approach was compared to a machine and human hybrid approach. Multiple machine translation platforms were compared, including Google. When a generic out-of-the-box or free solution was used, it was often more productive to translate with human only. However, with a customized translation engine focused on the target audience, domain, vocabulary and writing style of the client, the delivery time and the cost was considerably lower.

Managing expectations of an LSP with machine translation is not an easy task. The quantity and quality of data that is available to customize an engine is often unknown or questionable. Nearly every LSP says they have great data, yet Asia Online tools typically reject between 20%-40% of all data submitted by LSPs. When the data is examined by humans, the reasons are clear and the LSP agrees. It is easy to forget that human work varies also, as do budgets for quality assurance, project management and other tasks such as terminology and glossary definition.

Using Google as a Base Point Quality Measurement Metric

LSPs want assurances that if they invest in a customized translation engine, they will get a measurable and predictable level of quality. This is one area where we have found Google to be very useful.

The quality of Google can be measured against a human reference. This can be compared easily with both human and automated metrics against other translations. Because of the focused approach to delivering a customized engine based around a specified domain, vocabulary and writing style, Asia Online's customers are able to use Google as a baseline for measurement of Asia Online’s output above which an acceptance criteria level (i.e. an agreed quality level better than Google) can be set. Using this technique, a clear expectation of translation quality can be set with a customer even before their engine is customized.

The Google Pricing Model

Success in the machine translation market requires commitment and a deep understanding of the industry, the level of technology acceptance, technology literacy within LSPs and how automation tools are used. As with all industries, the sector does not adopt new processes, technologies or changes simply or easily. Asia Online has found that persuading professionals in the industry to make even a simple change such as a new technique or method for quality measurement that takes into account machine translation in combination with humans can be a challenge.

As both a machine translation technology vendor and a publisher of content (through Asia Online’s web sites in Asia), Asia Online is in a unique position to understand the issues of production as well as the issues of publication in relation to translation. The learning curve Asia Online has experienced in recent times has been interesting and challenging. At the same time, this combination of production and publication knowledge is what will help enable the professional language and professional publishing industry to go further, faster and with lower cost per unit for translation.

In contrast, Google has made little effort to understand the professional translation industry and its needs. There are established processes and business models that will not be easily changed. There must be a solid business reason or benefit in order to adjust. Machine translation, irrespective of provider, should be a natural fit within established workflows and processes. It should not require new or unnecessary changes.

As an illustration of this lack of understanding of the industry by Google, it has been well established that the pricing model for translation across nearly all languages is measured by the number of words translated. And yet Google has not bothered to take this into account and is instead offering its Translation API V2 on a per character basis. When Google starts charging, it would be much more readily accepted if it were to adapt to the industry’s norms, rather than forcing an unnecessary new method of cost calculation. If a character based model remains and some in the industry decided to use Google Translate, complex calculations on characters will need to be performed to determine the potential cost of a translation job. Google to date has not been clear on what is considered a character from a billing perspective, so even a space character could potentially be charged for.

BEWARE the Fine Print – “Don’t Say You Weren’t Warned!

Note:

I am not a lawyer. The analysis below is my own interpretation of Google’s legal documents and my own opinions of events in Google’s history. These comments are based on information widely available online. I strongly advise that, prior to using any third party provider or service, professional legal advice is sought.

Of those LSPs and professional translators that have used Google Translate, few that I have talked to have taken into consideration the legal aspects of using the Google Translate service. Some are aware, but are downplaying the issues or simply ignoring them. The task of understanding the legal obligations of using Google’s APIs is complicated by the fact that they are placed in a variety of terms that apply when a translation is performed in multiple documents spanning multiple Google sites (sometimes more than one document per site).

There are at least the 5 sets of terms listed below that have to be taken into consideration, possibly more depending on how Google Translate is being accessed:

http://www.google.com/accounts/TOS
http://www.google.com/accounts/tos/highlights/utos-us-en-h.html
http://translate.google.com/toolkit/TOS.html?hl=en
https://code.google.com/apis/language/translate/terms.html
http://code.google.com/terms.html

In these documents are a number of legal terms that every professional translator or LSP should be aware of when using Google tools and technologies for business purposes. While products from companies such as those from SDL may warn you that you are submitting content to a third party over a public network, they do not offer any insight into the potential legal ramifications of using any third party service. Language professionals should be careful when submitting any content to Google Translate – before doing so, they should ensure that:

1. They possess sufficient rights to the content that they are submitting.

2. Both the individual using the service and the company the individual is employed by are authorized to grant Google the specified license to the content.

It is fairly common for LSPs to sign legal agreements with their clients stating that they will protect the intellectual property (e.g. copyright) and other rights (e.g. confidentiality) of their client’s data. As such, in many instances they are not authorized to grant rights to Google. Even without a legal document protecting the client’s rights, the LSP still does not have the right to grant Google rights on the client’s data.

Google’s Terms of Service are very clear about what Google may and may not do with data submitted. It is also very clear that you alone and not Google are responsible if rights are assigned to Google unlawfully.

When you submit content to Google Translate, my lay understanding is that:

· You acknowledge to Google that you are the originator of the content and that you are authorized and have the right to assign rights and license the content to Google.

· You acknowledge that while Google may, you may not, modify, rent, lease, loan, sell, distribute or create derivative works based on the content within Google services without permission of Google or the content owner.

· You retain all your original rights to the content, but also grant Google the rights to do almost anything it wants with the content. This includes using it to build new services, sell services or data to others so that they may build services or even provide it to other parties that may find value in your content in other ways. The services offered through the use of your data may be used to help competitors. The data itself could even be provided to competitors if Google wishes to do so.

The rights granted to Google cannot be revoked in the future.
You acknowledge that should there be any legal issue in the future that you are solely responsible and not Google.

This can be further summarized into a single sentence:

By granting rights to Google in data that you do not own, you are taking a considerable risk and can be held legally liable should either Google or the owner of the data wish to take legal action.

So, why is my lay understanding important? Because frankly I think very few LSPs ever seek proper legal advice on Google’s Terms of Service, or even read them. However, even without proper advice, I think much of the intent and result of the terms is clear, and it’s not particularly beneficial to LSPs. At the very least LSPs need to be aware of these terms so that they can operate accordingly.

Google also provides a Terms of Service Highlights page that provides its own plain language summary of what the terms mean, where Google includes the sentence: “Don’t say you weren’t warned.”

Others Give, Google Takes

Charging a fee for the use of Google Translate is one example of how the translation memories or even monolingual content that has been processed using Google Translate or Google Translator Toolkit can be used by Google to directly offer services to competitors using such data and for Google to further profit. Google is already offering fee based API access to three other APIs (Search, Storage and Prediction), two of which have derived their knowledge and information from the data provided by others.

But Google has been known to go further, without permission of the authors, publishers or copyright owners. On March 22, 2011, New York Times published an article entitled “Judge Rejects Google’s Deal to Digitize Books” which starts out:

“Google’s ambition to create the world’s largest digital library and bookstore has run into the reality of a 300-year-old legal concept: copyright.”

Google has been actively scanning and processing books for some time. During this process, many copyright books were scanned without the authorization of their copyright owners. Google proceeded with the project knowing that it was in violation of copyright, but must have decided that it would be willing to go through a long lawsuit and come to some form of settlement at the end.

The gamble that Google is taking is significant, but the rewards, if successful, are equally significant. The means of execution can be described much more simply: If you have deep pockets and there is a huge business opportunity, then bend the law and deal with the potential consequences later – a classic scenario of “it is better to ask for forgiveness than to ask for permission.” There are few enterprises big enough that can take this approach, but it may just pay off. But in doing so, it certainly does not help the remaining companies that work within established legal frameworks.

Enterprises invest hundreds of thousands or even millions of dollars localizing content and products for new markets. In doing so an advantage is gained over competitors who have not made the effort or investment. Giving this work product away for free to Google along with an almost unrestricted license for its use is one means of helping your competition to catch up. When working with a machine translation provider, as many do with language service providers, ownership of the content and data should be tightly controlled and managed. Ironically, once Google has your data, it has controls in place with what others can do with the derived output generated from your data and that of others:

You agree that when using the Service, You will not, and will not permit your end users or other third parties to:

· incorporate Google Results as the primary content on your Property or any page on your Property;

· upload, post, email or transmit or otherwise make available any content that infringes any patent, trademark, copyright, trade secret or other proprietary right of any party, unless You (or the end user posting the content) are the owner of the rights or have the permission of the owner to post such content;

· distribute any file posted by another that You know, or reasonably should know, cannot be legally distributed in such manner;

· use any robot, spider, site search/retrieval application, or other device to retrieve or index any portion of Google services or collect information about users for any unauthorized purpose;

· copy, store, archive, republish or create a database of Google Results, in whole or in part, directly or indirectly

Combined the above could even be interpreted that a LSP could not use the output of Google Translate or load it into a translation memory or translation management system, even after it has been proof read. Because a translation memory (database/archive) cannot be used, this restriction also means that every time you have the same sentence in future documents, there will be additional costs that could have been avoided as the sentence must be re-translated by machine, human or a combination of both.

By losing control of high quality and high value data to Google, not only is the professional language services industry putting itself at risk from a legal perspective, but it also gives up competitive advantage. High quality multilingual data is often published on enterprise websites, so it could be argued that this data is already available. However without considerable work effort and investment a competitor is unable to take that data and leverage it for its own benefit as the data is not in an easy to leverage form such as a translation memory.

Should Google decide to compete in any of these industries in the future, the data, knowledge and insights that it has gained from the investment and creativity of others is already in its data library.

Earlier in 2011, Google acquired ITA Travel, and it is reasonable to assume that Google will expand into the travel field as a result. There are thousands of travel, hotel and flight websites on the Internet today that could be considerably disadvantaged by the knowledge that Google has accumulated by crawling, indexing, translating and analyzing content, traffic and data from within this industry. It is clear that if Google wishes, it will go to court to battle things out against any industry and challenge both industry and historical legal boundaries. This is but one of many examples where the knowledge that Google has obtained can be leveraged against those who created it in the first place.

How Google uses content submitted or acquired via its various systems, user submissions or other initiatives such as scanning of books is unclear. What is clear is that if you assign Google a license to use data that is submitted as per their Terms of Service, there is no means in future to restrict rights, reclaim rights or enforce your own rights (or those of your clients) to the data, nor can you stop Google or competitors benefiting from it. It is essential that enterprises and LSPs understand the risks, exposure, protection measures and changes in rights for their data once it has been submitted to Google or any other third party.

Friday, July 1, 2011

The Google Translate API Furor: Analysis of the Impact on the Professional Translation Industry – Part I

This is a post further exploring the Google API announcements by guest writer Dion Wiggins, CEO of Asia Online (dion.wiggins@asiaonline.net) and former Gartner Vice President and Research Director. The opinions and analysis are those of the author alone.

Overview

This is Part II of the posting that was posted on June 1, 2011. Part I detailed the reasons behind the Google announcement that it will shut down access to the Google Translate API completely on December 1, 2011, and reduce capacity prior to the shutdown. Part II which will be released as two posts, analyzes the impact that the announcement will have on the professional language services industry and also explores the implications of Google charging for it's MT services.

Summary

Humans will be involved in delivering quality language translation for the foreseeable future. The ability to understand context, language and nuance is beyond the capabilities of any machine today. If machine translation ever becomes perfect, then it truly will be artificially intelligent. But there are many roles for machine translation in the professional language services industry today, despite the limitations of the technology in comparison to human capability.

With a combination of machine translation technology with human editors, a quality level of translation output that is the same as a human only approach can be delivered in a fraction of the time and cost. The perception that machine translation is not good enough and it is easier to translate by human from the outset is outdated. It is time to put that idea to rest, since there are now many examples that clearly prove the validity of using machine translation with human editing to deliver high quality results.

The old adage of “there is no such thing as a free lunch” can be adapted to “there is no such thing as free translation” – you get what you pay for. The professional language service industry needs more than a generalized translation tool – control, protection, quality, security, proprietary rights and management are necessities.

· Google’s decision to move to a payment model for its Translate API is not a trivial one. It is part of a long term strategic initiative that is the right thing to do for Google’s business. Google’s primary rationale is to address issues relating to control of how and when translation is used and by whom, which in turn addresses the problem of “polluted drinking water” and will help clean up some of the lower quality content that Google has been criticized for ranking highly in its search results. This is a key strategic decision that will be part of their core business for the next decade and beyond.

· The professional language services industry (or Language Service Providers – LSPs) will not be negatively or positively impacted by Google moving to a paid translation model. True professionals do not use free or out-of-the-box translation solutions. Google’s business model does not fit well with LSPs and does not deliver the services which make LSPs professional. LSPs are not a market or customer demographic of significance to Google. Google’s customers are primarily Advertisers, not content providers or other peripheral industries. While these tools may give an initial impression that Google is serious about the language industry, the tools are in reality a thinly veiled cover over a professional crowd-sourcing initiative that delivers data and knowledge to Google under license that can then be used by Google to achieve greater advertising revenue and market share.

· Google Translate is a one-size-fits-all approach designed to give a basic understanding (or ‘gist’) of a document. This is insufficient in meeting the needs of the professional language services industry. What the industry needs are customized translation engines based around clean data, focused on a client’s specific audience, vocabulary, terminology, writing style and domain knowledge because this results in a document that is translated with the goal of publication and with reader satisfaction in mind.

· Google is not in the business of constructing data sets based on individual customer requirements or fine tuning to meet a customer’s specific need. The model of individual domain customizations is not economical for Google and, due to the human element required to deliver high-quality translation engines, this model does not scale even remotely close to other Google service offerings or revenue opportunities.

· Where enterprises have a real need for translation and a desire to use technology to help, expect some to try experimenting with open source. A small number of enterprises will succeed if they have sufficient linguistic skills, technical capability and data resources. Most will not. Others will try commercial machine translation technology. Out-of-the-box solutions will be insufficient, but those who invest the time and energy with commercial translation technology providers and LSPs to deliver higher quality output that is targeted for specific audiences and domain will be more likely to be successful in their machine translation efforts.

· Enterprises considering machine translation should ensure that machine translation providers and/or LSPs that they are working with will protect their data. Contracts should allow for the use of a customer’s proprietary data with third parties in order to deliver lower cost and faster services, but should ensure that the data is protected and not used for any other purpose other than service delivery. It may be wise to ensure a legal sign off process from within your own organization before any third party service is used.

· Like Google, enterprises should protect how, when and by whom their data and knowledge are used. Translations, knowledge, content and ideas are all data that Google gains advantage from and leverages from third party and user efforts. Google does this legally by having users grant them an almost totally non-restrictive license. As Google states in its own highlights of its Terms of Service “Don’t say you weren’t warned.”

Detailed Analysis

And Then Came The Worst Kept Secret in the Translation Industry… Google Wants To Charge!

Somewhat predictably, Google has changed its public position and is now going to charge for access to the Translate API. Announcing the shutdown may have been nothing more than a marketing ploy as there are clear indications that Google was intending to charge all along.

On June 3, Google’s APIs Product Manager, Adam Feldman announced the following with a small edit to the top of their original blog post:

“In the days since we announced the deprecation of the Translate API, we’ve seen the passion and interest expressed by so many of you, through comments here (believe me, we read every one of them) and elsewhere. I’m happy to share that we’re working hard to address your concerns, and will be releasing an updated plan to offer a paid version of the Translate API. Please stay tuned; we’ll post a full update as soon as possible.”

The reason the announcement came as no surprise is very simple – Google already has paid models for the Custom Search API, Google Storage API and the Prediction API via the API Console (https://code.google.com/apis/console/).

Google Translate API V2 has been listed in the API Console for a number of months already and offers 100,000 characters (approximately 15,000 words) per day limit. There is also a limit of 100.0 characters per second. While there is a link for requesting a higher quota, clicking on the link currently presents the following information:

Google Translation API Quota Request

The Google Translate API has been officially deprecated as of May 26, 2011. We are not currently able to offer additional quota. If you would like to tell us about your proposed usage of the API, we may be able to take it into account in future development (though we cannot respond to each request individually). In the mean time, for website translations, we encourage you to use the Google Translate Element.

For those who choose to respond, be prepared to reveal potentially sensitive information to Google. The form presented asks for a company profile, number of employees, expected translation volume per week and a field to tell Google how you intend to use the Translate API. I do not believe that Google offers any real privacy guarantees on much of the data it collects, and is in essence crowd-sourcing for interesting innovations and use of its own API. The Terms of Service at the bottom of the page include a very interesting clause:

By submitting, posting or displaying the content you give Google a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive license to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute any Content which you submit, post or display on or through, the Services.

So quite simply – be careful. You are giving Google your ideas and at the same time granting them the right to do pretty much anything they wish with it. One could argue, as others have done in the past, that this type of broad legal permission is required by an operator such as Google in order to operate its network in a reasonable manner. Even if that is so, it does not give LSPs any comfort with respect to their confidential client data.

Payment for services managed by the API console is via Google Checkout and all Google needs to do now is publicly set a price for a specified number of characters and turn the billing function on in the API console. Meanwhile they have had a number of “developers” testing the API (for free) and ironing out any issues since the launch of the Translate API V2.

Given that the Translate API V2 is already tested and in use, billing and quota management features are already available in the API console, and the API console allows for business registration and authentication, it would appear that Google’s initial announcement that it was shutting down the Translate API was little more than a marketing stunt designed to bring attention to the Translate API ahead of the change to a fee based model.

What Does Google Achieve By Charging and Managing the Google Translate API by the API Console Control: Google can now control who can and cannot use the API in addition to how much the API is used and at what speed. This solves nearly all of the abuse problems that were discussed in the prior blog post on this topic. Control will most likely be at a level of an individual or at the level of a company, but not at the level of software products. Products will adapt to allow the user to enter their own key and be billed by Google directly.

When you sign up for the paid Translate API or purchase translation capacity via Google Checkout, you are acknowledging the Google Terms of Service. Google has a much more explicit commitment from you and knows who you are. If the Terms of Service are abused in any way, Google has the means to track the use and take the appropriate legal action. This will most certainly have a significant impact on the “polluted drinking water” problem.

· Revenue: It does give Google some revenue. However in comparison to other revenue streams such as advertising, this is likely to be insignificant. It is unlikely that Google will offer post-pay options, so users should expect to pay in advance using Google Checkout.

· Blocking Free Services: Developers that would have (or have already) built applications that offer free translation will cease to use the Google Translate API. In reality, these applications offer little value-add to users and this feature is offered by Google in other tools. Free applications that integrate Google Translate within competitors’ products such as Facebook and Apple will cease to exist, giving Google products such as Android a competitive advantage by being the exclusive developer of products that leverage its translation service without a charge to the user. If so, is this potentially anti-competitive?

Google may wish to keep some free third-party applications around in order to give the perception that it is encouraging innovation and to gather ideas for its own use of the Translate API, so it would come as no surprise if Google offers a smaller amount of words for free and possibly even require individual users to log in using a Google User ID in order to not just control the application’s use of the Translate API, but also the individuals who use the application. By requiring individual users of an application to log in, tracking is extended to an individual level and blocking one errant user will not block an entire application.

· Blocking of Abusive Applications: As Google has control over who accesses the API and Google is also charging, it will no longer be economical to mass translate content in an attempt to build up content for Search Engine Optimization (SEO).

Encourage Value Add Applications: Developers that have created a true value-added product (i.e. a translation management platform) where the Google Translate API is a component of the overall offering, but not a main feature, will gain from there being less competitive noise in the market place. Google wants to be seen as empowering innovation. User perception of innovation can be further expanded when Google’s technology is embedded in other innovative products. Commercial products of this nature are often expensive and often used by larger corporations. Customers who use their products may be required to get their own access key unless they have a billing agreement with the service provider. This provides yet another mechanism for Google to create a commercial relationship with enterprises

Impact on the Professional Language Industry

There are many different segments within the professional language industry that are impacted by Google’s decisions about translation technology.

Impact on Machine Translation Providers

Those who offer machine translation free of charge using Google as the back-end will cease to exist unless they are able to generate an alternative revenue stream or other value-added features that users are willing to pay for. Those who offer free translation using other non-Google translation technology will likely see an increase in traffic to their sites as the Google-based providers start to vanish. Google will experience an increase in users going directly to their translation tools instead of via third-party websites.

Anecdotally, Asia Online has seen a considerable increase in inquiries from companies that have a commercial use for machine translation since the Google Translate Shutdown announcement. It is expected that other machine translation providers have seen a similar rise in interest.

There has been some speculation that machine translation providers may increase their prices as a result of the Google announcement. However, this is unlikely. Most offerings are relatively low cost, especially in comparison to large scale human translation costs. Asia Online views the change in Google’s translation strategy as an opportunity to stand above the crowd and demonstrate how customized translation systems can significantly outperform Google in terms of quality.

Impact on Open Source Machine Translation

Open Source machine translation projects will see some additional interest, but implementing these technologies is not at all simple and well beyond the technology maturity of many language industry developers and organizations. There are many open source translation platforms, and they vary in their underlying technique. These include rules-based, example-based and statistical-based machine translation systems. Most of these systems are not intended for real world commercial use, and many open source initiatives are part of ongoing research and development at universities. These are mostly academic development systems and have not been designed nor were they ever intended for commercial projects.

One of the most popular open source machine translation projects is the Moses Decoder, which is a statistical machine translation (SMT) platform that was originated by Asia Online’s Chief Scientist, Philipp Koehn, with continued development from a large number of developers researching natural language programming (NLP), including Hieu Hoang who also recently joined Asia Online.

In addition to the complexity of building an SMT translation platform such as Moses, expertise in linguistics is required to build pre and post-processing modules that are specific to each language pair. But the biggest barrier to building out an SMT platform such as Moses is simply the lack of data. While there are publically available sets of parallel corpora (collections of bilingual sentences translated by humans), such parallel corpora are usually not in the right domain (subject or topic area) and is usually insufficient in both quantity and quality to produce high quality translation engines.

Many companies will try open source machine translation projects, but few will succeed. The effort, linguistic knowledge and data required to build a quality machine translation platform is often underestimated. As an example, many of Asia Online’s translation engines now have tens of millions of bilingual sentences as data to learn from. For more complex languages, statistics alone are not sufficient. Technologies that perform additional syntactic analysis and data restructuring are required. Every language pair combination has unique differences and machine translation systems such as Moses accommodate for very little of the nuances between each unique language pair.

Even Google does not handle some of the most basic nuances for some languages. As an example, if you translate a Thai date that represents the current year of 2011, it will be translated from Thai into English in its original Thai Buddhist calendar form of 2554. (e.g. “Part 2 of Harry Potter and the Deathly Hallows film will be released in July 2554”). For languages like Chinese, Japanese, Korean and Thai, additional technologies are required in order to separate words as there are no spaces between words as there are in romanized languages. In Thai, there are not even markers that indicate the end of a sentence. Most commercial machine translation vendors have not yet invested in the necessary expertise required to process more complex languages. As an example, SDL Language Weaver does not even try to determine the end of a sentence in Thai and simply translates the entire paragraph as if it is one long sentence. If commercial machine translation vendors are so far unable to conquer some of these complex technical tasks, it is not realistic to expect that the experimental ambitions of even sophisticated enterprises will be enough to be successful.

The language industry has been active in building open source technologies that convert various document formats into industry standard formats such as TMX or XLIFF. However, these tools, while improving, still leave much to be desired, and using them often results in format loss. With the demise of LISA, the XLIFF standard is gaining traction faster than ever before, but there still remains much disagreement and incompatibilities with the XLIFF development community that are unlikely to be resolved in the near term. Companies like SDL continue to “extend” the XLIFF standard, as they did with the TMX standard, by modifying the standard into a proprietary format that is not supported by other tools.

Claims will be made that the standard does not support their tools requirements. But the reality is these requirements can be supported within the extensions to XLIFF without breaking the actual standard format and the real reason for modifying or “enhancing” the standard is vendor lock-in, a familiar occurrence in the history of software.

There will be an increase in development activity of “Moses for Dummies” or “Do it yourself MT” type projects. These kits will try to dumb down the installation and will mostly be offered as open source. While this will allow for the installation of Moses to be streamlined, it will not resolve many critical technical or linguistic issues and most importantly will not resolve issues relative to data volume or quality. Without a robust linguistic skill-set, knowledge of Natural Language Programming (NLP) techniques and vast volumes of data, this is still a daunting challenge. Unfortunately most users will not have such skills, and through attempting this approach will learn a time consuming and often costly lesson. If high quality machine translation was as easy as getting the install process for open source solutions right, these tools would have been built long ago and many companies would already be using them and offering show cases of their high quality output.

These attempts may result in organizations turning to commercial machine translation providers. If a company is willing to invest in trying to build their own machine translation platform, they most likely have a real business need. If these companies fail in using open source machine translation software, the need may be filled by commercial machine translation providers once the experimentation phase with open source machine translation ends, with a portion of the work of data gathering already complete and a customer who understands technical aspects to some degree.

Impact on Language Tools

Expect to see tool vendors like SDL and Kilgray updating their commercial products to support Google’s Translate API V2 and adding features around purchasing, cost management and integration of Google Checkout functionality.

Users of these products will most likely have to get their own access key from Google and will need to set themselves up for Google Checkout. It would be reasonable to assume that Google will update the Translate API V2 to include a purchase feature so that applications that embed Google Translate can integrate the purchase process directly into their workflow and processes.

But updating language tools to support billing is not the end of the story. Current processes, such as the 2 examples below, will need to be updated:

Pre-translating the entire document: A translation memory should always be used to match against previously translated material.
Mixed source reviews: Some systems provide the ability to show the translation memory output and the machine translation output beside each other.

Both these processes, while useful, can become very expensive if system processes are not updated and such translations occur automatically without authorization of the user. Software updates to manage these processes more effectively will be important.

As more companies try to leverage open source machine translation technologies, vendors such as NoBabel and Terminotix that provide tools that extract, format, clean and convert data into translation memories may see an increase in their business.

Impact on Language Service Providers (LSPs)

LSPs who use software from SDL, Kilgray, Across and other tools that integrated with the Google API should be prepared to update their software and learn new processes to ensure that they do not get billed inadvertently for machine translation that was sent to Google without consideration for costs. If the software is not updated, the Google Translate function will simply stop working on December 1, 2011 when Google terminates the Google Translate V1.0 API.

LSPs may still use the Google Translator Toolkit or copy and paste content into the Google Translate web page, but should be very aware of the relevant privacy and data security issues.

Overall, there should be little impact to LSPs, with the exception of those who were using Google Translate behind the scenes or offering machine translation to customers using Google as the back-end. While this was a breach of Google’s Terms of Service, I believe a a number of LSPs were doing this.

Due to insufficient and unclear warning or in some cases no warnings at all, it is understandable that some may not have been aware of or have not fully understood Google’s and other machine translation service providers’ Terms of Service. However, with the Google Translate V2 API, users of the service must expressly sign up and agree to the terms. It will no longer be possible reasonably to claim ignorance of the terms or of the associated risks for customer data when it is submitted to the API.

LSPs are focused on delivering quality services with the higher-end skills required for translating and localizing content for a target market and then ultimately publishing to a market. Without a doubt, machine translation has a significant role in the future of translation, in particular for accelerating production and giving LSPs access to new markets of mass translation. But high quality translation systems that meet the needs of LSPs and ultimately their customers will require customized translation engines that are focused on a much narrower domain of knowledge than Google’s engines and are ultimately combined with both a human post editing effort and a human feedback cycle that continually improves the engine by giving high-quality human driven input back to the engine. It is a built for purpose machine and human collaboration that will ultimately deliver to the end customer’s needs, not just machine translation alone.

Impact on Corpus Providers

Industry organizations such as TAUS may gain some traction from the short term increase in demand for data while companies experiment with open source machine translation. However, as research has clearly shown, data quality from a variety of sources can actually reduce machine translation quality.

In 2008, Asia Online participated in a study with TAUS, during which Asia Online built 29 translation engines using its own data in various combinations combined with 3 TAUS members’ data. In the resulting report the impact of each data set can be seen clearly. But another factor was the cleanliness of the data. The study clearly showed that having more data can provide some Improvements in quality, but if the source data is not cleaned and processed correctly, the quality of the data can cause considerably lower translation output quality. It was also shown that smaller amounts of clean source data can produce better quality output results than using source data sets even two times larger. TAUS has recently worked to improve the quality of their data and the data continues to slowly improve. But in reality, the human effort required for such a task is considerable and not practical on a large scale without the right linguistic expertise and tools in each language.

This analysis is continued in Part II

eMpTy Pages

Pages

Saturday, July 2, 2011

The Google Translate API Furor: Analysis of the Impact on the Professional Translation Industry – Part II

Friday, July 1, 2011

The Google Translate API Furor: Analysis of the Impact on the Professional Translation Industry – Part I

Overview

Summary

Detailed Analysis

Get new posts by email:

Search This Blog

Pages

Featured Post

Comparing MT System Performance