This is a continuation of this posting.
Is Google Right For The Professional Language Services Industry?
For more than 40 years, machine translation has promised much but consistently failed to deliver. MT promises had come to be seen more as “empty promises”. In recent years, for enterprises and language service providers who want control of the translation, Google’s machine translation has been an eye-opener for many but still not a real solution to their requirements.
Before Google launched its own SMT translation technology in October 2007, Google used Systran, as Yahoo Babelfish still does today. The measure of machine translation in the public eye was Systran technology, even if the public did not know the name of the technology behind the free translation services. Today that measure has moved to Google, with a common perception that Google is the state-of-the-art in machine translation. Google has shown that it can rapidly improve the quality of the translated output in a generalized context and this has impressed many individual users as well as companies. In turn this has led many companies to consider using machine translation for real world applications.
When Asia Online was founded in 2007, we talked with many companies about machine translation and the comments were consistently negative. Since that time, Google has helped machine translation in terms of credibility and by educating the market that considerable advances in the quality of machine translation have been made. Today, many companies that would have written such technologies off as a bad joke just a few years ago are now using or considering machine translation. For those in the machine translation industry, having such barriers removed and user perceptions adjusted has been a great asset, for which considerable credit must be given to Google.
Where the Google Approach Fails the Professional Language Industry
While Google is most certainly state-of-the-art in terms of machine translation scale and in terms of free translation, there are many reasons that Google may not be right for the professional translation industry. It is too easy to measure Google by looking at mainstream Romance languages such as French, Italian, German and Spanish, commonly known in the industry as FIGS. While Google does a reasonable job for general translation in these languages, the same cannot be said for Tier 2 European languages or Asian languages or for most languages where the content is in specialized domains.
Google’s one-size-fits-all approach is great to get an understanding of a document, but the results are not suitable for publication in almost all cases. A professional translator is, by definition, a professional and has specific expertise in certain languages and usually also in a specialized domain of knowledge. Consider the following question:
If you owned a newspaper, would you hire a journalist that specialized in finance to write about game strategy for football or the skills of a specific baseball player when compared against another? Would you hire a sports journalist to write about politics and international monetary policy?
A professional with the necessary skills, education, training credentials and work background would be hired to write an article in the desired domain and writing style. Not just to get the story right, but to give the publication the appropriate credibility and focus the text on a specific audience. So why is it that when machine translation is evaluated it is nearly always evaluated as a comparison to “out of the box” or “free online” translation software? There are literally thousands of posts and articles published on this basis. It is the metaphorical equivalent of evaluating the performance of a Ferrari by test driving a Honda Civic.
Consider the differences in target audience, writing style, vocabulary and terminology in Forbes or Economist (Business News) when compared to Wikipedia or Harry Potter (Young Student).
Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos.
Business News Translation:
Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents.
Young Student Translation:
A lot of care was taken to not upset others when organizing the meeting between the two longtime enemies.
It seems that common sense is discarded and the individuals doing such comparisons expect a machine to automatically understand who the target audience is and to study the topic and write using their preferred writing style with their preferred vocabulary.
In the Google model, all data is equal and significant volumes of data are required in order to get statistical relevance. A professional translator is able to understand the intent of the article, the context and audience it was intended for, and also apply a style guide. Using out-of-the-box or free translation software does not give you the ability to customize, control of guide any of these things. Google Translator Toolkit does offer some of this functionality, but only in a very limited manner.
The bottom line is simple. Until recently machine translation software was at best useful for gaining an understanding of what text in a foreign language was about. It was not designed or intended for use in publication of content. This is exactly what free or out-of-the-box translation systems offer today and it is known simply as “gist translation”.
Beyond Gist Translation - The Professional Language Services Industry Requires Control
Asia Online is one of several machine translation providers who develop custom translation engines for clients that are designed for a specific purpose and domain based on the client’s specific vocabulary, terminology and writing style. Some of the tasks included when customizing an engine for a client include analysis of the client’s target audience, glossary preparation, definition of non-translatable terms and preferred terminology, normalization of content, determination of preferred writing style and grammar. These tasks are not dissimilar to how the professional language community works with clients in order to deliver high quality translation using humans.
In this context, both human and machine translation projects require professional human linguistic input prior to beginning translation. The professional skills provided by LSPs play a critical role in both.
Google describes how it translates as follows:
When Google Translate generates a translation, it looks for patterns in hundreds of millions of documents to help decide on the best translation for you. By detecting patterns in documents that have already been translated by human translators, Google Translate can make intelligent guesses as to what an appropriate translation should be.
Hundreds of millions of documents are what is needed for Google to translate anything to the most common form and meaning. This technique is very appropriate if you want a general understanding or gist translation, but not appropriate if you want to publish. What is missing at even the simplest level is domain knowledge, from which greater relevance of context can be derived. Without context, many words can be ambiguous. Consider the use of the English words “bank” and “banked” when translated using Google.
|English Source||Human Translation||Google Translation||Google Context|
|I went to the bank||Fui al banco||Fui al banco||Bank as in finance|
|I went to the bank to deposit money||Fui al banco para depositar dinero||Fui al banco a depositar el dinero||Bank as in finance|
|I went to the bank of the turn in my car||Fui en coche a la inclinación de la vuelta||Fui a la orilla de la vuelta en mi coche||Bank as in river bank|
|I put my car into the bank of the turn||Puse mi coche en la inclinación de la vuelta.||Pongo mi coche en el banco de la vuelta||X Bank as in finance|
|I swam to the bank of the river||Nadé en la orilla del río||Nadé hasta la orilla del río||Bank as in river bank|
|I banked my money||Deposité mi dinero||Yo depositado mi dinero||Banked as in finance|
|I banked my car into the turn||Incliné mi coche en la vuelta||Yo depositado mi coche en la vuelta||X Banked as in finance|
|I banked my plane into a steep dive||Incliné mi avión en para una zambullida.||Yo depositado en mi avión en picada||X Banked as in finance|
The examples above show clearly the shortcomings of the one-size-fits-all approach. Even with the millions of documents that Google claims to have learned from, Google still favors a particular domain (finance) based on the volume of data in that domain when compared to other domains. For example: There is much more multilingual banking and finance data available than there is aeronautical or water sports data.
Studies by Asia Online, TAUS and others have shown that customized engines built using a lesser quantity of high quality data in the appropriate domain can deliver a considerably higher quality translation than customized engines built with more data that is not necessarily in the domain being translated and is of mixed quality.
Purists will argue that with enough data, the correct and better data will become more relevant statistically, while the lower quality data will become less relevant statistically. However, what Google has shown is that even with “hundreds of millions of documents to help decide on the best translation for you”, it often decides on the translation that has the most data available in a given domain, which in turn statistically overpowers domains with less data. In the above example, a clear bias can be seen towards the finance domain, while the domains of sports, automotive and aeronautics are less statistically relevant.
Google’s approach is right for Google’s purpose – that of trying to translate anything for everyone irrespective of purpose. However, this approach is not right for the professional language services industry – where greater control, style and terminology management is required to meet specific purposes.
Even in the same industry, preferred terms frequently vary. Microsoft, Oracle, IBM and Sybase all produce database software. Each may prefer different terms such as RDBMS, DB, database, relational database, relational database management system, relational DB, etc. when producing documentation. With a human translation project managed by professionals, style guides, glossaries and other guidance is provided to human translators.
Google Translator Toolkit gives you a limited amount of control with glossaries and translation memories, but it is not sufficient to meet the needs of a professional translator.
There are also limits (maximum data sizes) for both the learning material you can provide to Google and the amount of data that you are allowed to process. By proof reading in Google Translator Toolkit, you are not just using a free tool, you are part of an informal, but professional, crowdsourcing initiative that delivers high quality proof read translations directly into Google’s tools, which in turn improve the Google technology for every user – including competitors.
The goal of customizing a translation engine for the professional language services industry must be to produce an output that requires the least amount of human editing in order to publish. It must not be to get a general understanding or the gist of the meaning. By focusing on this goal, the productivity of human translators is greatly improved, more content can be translated and more companies will be attracted to translating their content for alternative language markets.
Managing and Setting Machine Translation Expectations with LSPs
When engaging in discussions with LSPs and professional translators, the most common fear is that machines will replace humans. Oddly, for some LSPs this is also the desire and hope. Consider the following true-to-life LSP anecdotes in the context of expectations for machine translation:
· LSP A: After doing a very minimal amount of customization using just a just a few thousand relatively low quality translation memory segments, LSP A received their first version of their engine. The first version is a diagnostic engine, from which it is possible to determine the best path to quality improvement. Despite numerous presentations, emails and discussions, LSP A quickly came back with “we want to replace human translation with machine translation and this is not good enough, we are disappointed. We cannot replace our humans with this.”
Subsequent discussions did not help LSP A understand any better that a customized translation engine is as good as the data and the effort put into creating it and that the volume of high quality data and effort will determine how much human work can be reduced or accelerated. Ignoring reality, LSP A still expected that machine translation would instantly replace humans. Surely if it was this easy, would not every LSP do translation this way and the professional language industry would cease to exist?
· LSP B: After customizing an engine for LSP B, a freelance human translator was hired to review the quality of the machine translation output. LSP B had been in the translation industry for more than 10 years and made it clear that they knew how to measure translation quality. This indeed was true, but only in the context of how to measure the quality of a human translator. LSP B used the same metrics such as grammar, word choice, etc. that they would use for humans and rapidly came to the conclusion that the machine translation was not good enough.
LSP B was taught how to measure the human effort and time to completion for a publication quality translation, after which a human only approach was compared to a machine and human hybrid approach. Multiple machine translation platforms were compared, including Google. When a generic out-of-the-box or free solution was used, it was often more productive to translate with human only. However, with a customized translation engine focused on the target audience, domain, vocabulary and writing style of the client, the delivery time and the cost was considerably lower.
Managing expectations of an LSP with machine translation is not an easy task. The quantity and quality of data that is available to customize an engine is often unknown or questionable. Nearly every LSP says they have great data, yet Asia Online tools typically reject between 20%-40% of all data submitted by LSPs. When the data is examined by humans, the reasons are clear and the LSP agrees. It is easy to forget that human work varies also, as do budgets for quality assurance, project management and other tasks such as terminology and glossary definition.
Using Google as a Base Point Quality Measurement Metric
LSPs want assurances that if they invest in a customized translation engine, they will get a measurable and predictable level of quality. This is one area where we have found Google to be very useful.
The quality of Google can be measured against a human reference. This can be compared easily with both human and automated metrics against other translations. Because of the focused approach to delivering a customized engine based around a specified domain, vocabulary and writing style, Asia Online's customers are able to use Google as a baseline for measurement of Asia Online’s output above which an acceptance criteria level (i.e. an agreed quality level better than Google) can be set. Using this technique, a clear expectation of translation quality can be set with a customer even before their engine is customized.
The Google Pricing Model
Success in the machine translation market requires commitment and a deep understanding of the industry, the level of technology acceptance, technology literacy within LSPs and how automation tools are used. As with all industries, the sector does not adopt new processes, technologies or changes simply or easily. Asia Online has found that persuading professionals in the industry to make even a simple change such as a new technique or method for quality measurement that takes into account machine translation in combination with humans can be a challenge.
As both a machine translation technology vendor and a publisher of content (through Asia Online’s web sites in Asia), Asia Online is in a unique position to understand the issues of production as well as the issues of publication in relation to translation. The learning curve Asia Online has experienced in recent times has been interesting and challenging. At the same time, this combination of production and publication knowledge is what will help enable the professional language and professional publishing industry to go further, faster and with lower cost per unit for translation.
In contrast, Google has made little effort to understand the professional translation industry and its needs. There are established processes and business models that will not be easily changed. There must be a solid business reason or benefit in order to adjust. Machine translation, irrespective of provider, should be a natural fit within established workflows and processes. It should not require new or unnecessary changes.
As an illustration of this lack of understanding of the industry by Google, it has been well established that the pricing model for translation across nearly all languages is measured by the number of words translated. And yet Google has not bothered to take this into account and is instead offering its Translation API V2 on a per character basis. When Google starts charging, it would be much more readily accepted if it were to adapt to the industry’s norms, rather than forcing an unnecessary new method of cost calculation. If a character based model remains and some in the industry decided to use Google Translate, complex calculations on characters will need to be performed to determine the potential cost of a translation job. Google to date has not been clear on what is considered a character from a billing perspective, so even a space character could potentially be charged for.
BEWARE the Fine Print – “Don’t Say You Weren’t Warned!
I am not a lawyer. The analysis below is my own interpretation of Google’s legal documents and my own opinions of events in Google’s history. These comments are based on information widely available online. I strongly advise that, prior to using any third party provider or service, professional legal advice is sought.
Of those LSPs and professional translators that have used Google Translate, few that I have talked to have taken into consideration the legal aspects of using the Google Translate service. Some are aware, but are downplaying the issues or simply ignoring them. The task of understanding the legal obligations of using Google’s APIs is complicated by the fact that they are placed in a variety of terms that apply when a translation is performed in multiple documents spanning multiple Google sites (sometimes more than one document per site).
There are at least the 5 sets of terms listed below that have to be taken into consideration, possibly more depending on how Google Translate is being accessed:
In these documents are a number of legal terms that every professional translator or LSP should be aware of when using Google tools and technologies for business purposes. While products from companies such as those from SDL may warn you that you are submitting content to a third party over a public network, they do not offer any insight into the potential legal ramifications of using any third party service. Language professionals should be careful when submitting any content to Google Translate – before doing so, they should ensure that:
1. They possess sufficient rights to the content that they are submitting.
2. Both the individual using the service and the company the individual is employed by are authorized to grant Google the specified license to the content.
It is fairly common for LSPs to sign legal agreements with their clients stating that they will protect the intellectual property (e.g. copyright) and other rights (e.g. confidentiality) of their client’s data. As such, in many instances they are not authorized to grant rights to Google. Even without a legal document protecting the client’s rights, the LSP still does not have the right to grant Google rights on the client’s data.
Google’s Terms of Service are very clear about what Google may and may not do with data submitted. It is also very clear that you alone and not Google are responsible if rights are assigned to Google unlawfully.
When you submit content to Google Translate, my lay understanding is that:
· You acknowledge to Google that you are the originator of the content and that you are authorized and have the right to assign rights and license the content to Google.
· You acknowledge that while Google may, you may not, modify, rent, lease, loan, sell, distribute or create derivative works based on the content within Google services without permission of Google or the content owner.
· You retain all your original rights to the content, but also grant Google the rights to do almost anything it wants with the content. This includes using it to build new services, sell services or data to others so that they may build services or even provide it to other parties that may find value in your content in other ways. The services offered through the use of your data may be used to help competitors. The data itself could even be provided to competitors if Google wishes to do so.
- The rights granted to Google cannot be revoked in the future.
- You acknowledge that should there be any legal issue in the future that you are solely responsible and not Google.
This can be further summarized into a single sentence:
By granting rights to Google in data that you do not own, you are taking a considerable risk and can be held legally liable should either Google or the owner of the data wish to take legal action.
So, why is my lay understanding important? Because frankly I think very few LSPs ever seek proper legal advice on Google’s Terms of Service, or even read them. However, even without proper advice, I think much of the intent and result of the terms is clear, and it’s not particularly beneficial to LSPs. At the very least LSPs need to be aware of these terms so that they can operate accordingly.
Google also provides a Terms of Service Highlights page that provides its own plain language summary of what the terms mean, where Google includes the sentence: “Don’t say you weren’t warned.”
Others Give, Google Takes
Charging a fee for the use of Google Translate is one example of how the translation memories or even monolingual content that has been processed using Google Translate or Google Translator Toolkit can be used by Google to directly offer services to competitors using such data and for Google to further profit. Google is already offering fee based API access to three other APIs (Search, Storage and Prediction), two of which have derived their knowledge and information from the data provided by others.
But Google has been known to go further, without permission of the authors, publishers or copyright owners. On March 22, 2011, New York Times published an article entitled “Judge Rejects Google’s Deal to Digitize Books” which starts out:
“Google’s ambition to create the world’s largest digital library and bookstore has run into the reality of a 300-year-old legal concept: copyright.”
Google has been actively scanning and processing books for some time. During this process, many copyright books were scanned without the authorization of their copyright owners. Google proceeded with the project knowing that it was in violation of copyright, but must have decided that it would be willing to go through a long lawsuit and come to some form of settlement at the end.
The gamble that Google is taking is significant, but the rewards, if successful, are equally significant. The means of execution can be described much more simply: If you have deep pockets and there is a huge business opportunity, then bend the law and deal with the potential consequences later – a classic scenario of “it is better to ask for forgiveness than to ask for permission.” There are few enterprises big enough that can take this approach, but it may just pay off. But in doing so, it certainly does not help the remaining companies that work within established legal frameworks.
Enterprises invest hundreds of thousands or even millions of dollars localizing content and products for new markets. In doing so an advantage is gained over competitors who have not made the effort or investment. Giving this work product away for free to Google along with an almost unrestricted license for its use is one means of helping your competition to catch up. When working with a machine translation provider, as many do with language service providers, ownership of the content and data should be tightly controlled and managed. Ironically, once Google has your data, it has controls in place with what others can do with the derived output generated from your data and that of others:
You agree that when using the Service, You will not, and will not permit your end users or other third parties to:
· incorporate Google Results as the primary content on your Property or any page on your Property;
· upload, post, email or transmit or otherwise make available any content that infringes any patent, trademark, copyright, trade secret or other proprietary right of any party, unless You (or the end user posting the content) are the owner of the rights or have the permission of the owner to post such content;
· distribute any file posted by another that You know, or reasonably should know, cannot be legally distributed in such manner;
· use any robot, spider, site search/retrieval application, or other device to retrieve or index any portion of Google services or collect information about users for any unauthorized purpose;
· copy, store, archive, republish or create a database of Google Results, in whole or in part, directly or indirectly
Combined the above could even be interpreted that a LSP could not use the output of Google Translate or load it into a translation memory or translation management system, even after it has been proof read. Because a translation memory (database/archive) cannot be used, this restriction also means that every time you have the same sentence in future documents, there will be additional costs that could have been avoided as the sentence must be re-translated by machine, human or a combination of both.
By losing control of high quality and high value data to Google, not only is the professional language services industry putting itself at risk from a legal perspective, but it also gives up competitive advantage. High quality multilingual data is often published on enterprise websites, so it could be argued that this data is already available. However without considerable work effort and investment a competitor is unable to take that data and leverage it for its own benefit as the data is not in an easy to leverage form such as a translation memory.
Should Google decide to compete in any of these industries in the future, the data, knowledge and insights that it has gained from the investment and creativity of others is already in its data library.
Earlier in 2011, Google acquired ITA Travel, and it is reasonable to assume that Google will expand into the travel field as a result. There are thousands of travel, hotel and flight websites on the Internet today that could be considerably disadvantaged by the knowledge that Google has accumulated by crawling, indexing, translating and analyzing content, traffic and data from within this industry. It is clear that if Google wishes, it will go to court to battle things out against any industry and challenge both industry and historical legal boundaries. This is but one of many examples where the knowledge that Google has obtained can be leveraged against those who created it in the first place.
How Google uses content submitted or acquired via its various systems, user submissions or other initiatives such as scanning of books is unclear. What is clear is that if you assign Google a license to use data that is submitted as per their Terms of Service, there is no means in future to restrict rights, reclaim rights or enforce your own rights (or those of your clients) to the data, nor can you stop Google or competitors benefiting from it. It is essential that enterprises and LSPs understand the risks, exposure, protection measures and changes in rights for their data once it has been submitted to Google or any other third party.