Pages

Tuesday, January 18, 2011

Has Google Translate Reached the Limits of its Ongoing Improvement?

The End of the Road for “The More Data The Better” Assumption?

It has been commonly understood by many in the world of statistical MT, that success with SMT is based almost completely on the volume of data that you have. The experts at Google, ISI and TAUS have all been saying this for years. There has been some discussion that questioned this “only data matters” assumption but largely many in the SMT world continue to believe the statement below, because to some extent it has actually been true. Many of us have witnessed the steady quality improvements at Google Translate in our casual use to read an occasional web page (especially after they switched to SMT), but for the most part these MT engines rarely rise above gisting quality.

 

"The more data we feed into the system, the better it gets..." Franz Och, Head of SMT at Google


However, in an interesting review of the challenges of Google’s MT efforts in the Guardian, we begin to see some recognition that MT is a REALLY TOUGH problem to solve with machines, data and science alone. The article also quotes Douglas Hofstader who questions whether MT will ever work as a human replacement, since language is the most human of human activities. He is very skeptical and suggests that this quest to create accurate MT (as a total replacement for human translators), is basically impossible. While I too have serious doubts whether machines will ever learn meaning and nuance at a level that compares with competent humans, I think we should focus on the real discovery here, i.e. more data is not always better and/or that computers and data alone are not enough.  MT is still a valuable tool, and if used correctly can provide great value in many different situations. The Google admission according to this article is as follows:

“Each doubling of the amount of translated data input led to about a 0.5% improvement in the quality of the output,”  and "We are now at this limit where there isn't that much more data in the world that we can use." Andreas Zollmann of Google Translate.

 

But, Google is hardly throwing in the towel on MT, they will try “to add on different approaches and (explore) rules-based models."

Interestingly, the “more data is better” issue is also being challenged in the search arena. In their zeal to index the world’s information, Google attempts to crawl and index as many sites as possible (because more data is better, right?). However, spammers are creating SEO focused “crap content” that increasingly shows up at the top of Google searches. (I experienced this first hand myself, when I searched for  widgets to enhance this blog. I gave up after going through page after page of SEO focused crap.) This article describes the impact of this low-quality content created by companies like Demand Media and are summarized succinctly in the quote below.

Searching Google is now like asking a question in a crowded flea market of hungry, desperate, sleazy salesmen who all claim to have the answer to every question you ask.    Marco Arment

 

But getting back to the issue of data volume and MT engine improvements, have we reached the end of the road? I think this is possibly true for some languages, i.e. data-rich languages like French, Spanish and Portuguese, where it is quite possible that tens of billions of words underlie the MT systems already. It is not necessarily true for sparse-data, or less present languages on the net (pretty much anything other than FIGS and maybe CJK), and we will hopefully see these other languages continue to improve as more data becomes available. In the graphic below we can see a very rough and generalized relationship between data volume and engine quality. I have a very rough estimate of the Google scale on top, and a lower data volume scale for customized systems at the bottom (that are generally focused on a single domain) where less is often more.
Data
Ultan O’Broin provides an important clue (I think anyway) for continued progress: “There's a message about information quality there, surely.” At Asia Online we have always been skeptical of the “the more data the better” view and we have ALWAYS claimed that data quality is more important than volume. One of the problems created by large scale automated data-scraping is that it is more than possible to pick-up large amounts of noise and digital dirt or just plain crap through this approach. Early SMT developers all use crawler based web-scraping techniques to acquire the training data to build their baseline systems. We have all learned by now I hope, that it is very very difficult to identify and remove noise from a large corpus, since by definition noise is random and unidentifiable through automated cleaning routines which can usually only target known patterns. (It is interesting to see that “crap content” also undermines the search algorithms, since machines (i.e.spider programs) don’t make quality judgments on the data they crawl. Thus Google can, and does easily identify crap content as the most relevant and important content for all the wrong reasons as Arment points out above.)

Though corporate translation memories (TM) can be of higher quality than web-scraped data sometimes, TM also tends to gather digital debris over time. This noise comes from a) tools vendors who try to create lock-in situations by adding proprietary meta-data to the basic linguistic data, b) the lack of uniformity between human translators and c) poor standards that make consolidation and data sharing highly problematic. In a blog article describing a study of TAUS TM data consolidation, Common Sense Advisory describes this problem quite clearly: “Our recent MT research contended that many organizations will find that their TMs are not up to snuff — these manually created memories often carve into stone the aggregated work of lots of people of random capabilities, passed back and forth among LSPs over the years with little oversight or management.”
 

So what is the way forward, if we still want to see ongoing improvements?

 

I cannot really speak to what Google should do (they have lots of people smarter than me thinking about this), but I can share the basic elements of a strategy that I see is clearly working in producing continuously improving  customized MT systems developed by Asia Online. It is much easier to improve customer specific systems than a universal baseline.
  • Make sure that your foundation data is squeaky clean and of good linguistic quality (which means that linguistically competent humans are involved in assessing and approving all the data that is used in developing these systems).
  • Normalize, clean and standardize your data on an ongoing and regular basis.
  • Focus 75% of your development effort on data analysis and data preparation.
  • Focus on a single domain.
  • Understand that dealing with MT is more akin to interaction with an idiot-savant than with a competent and intelligent human translator.
  • Involve competent linguists through various stages of the process to ensure that the right quality focused decisions are being made.
  • Use linguistically informed development strategies as pure data based strategies are only likely to work to a point.
  • For language pairs with very different  syntax, morphology and grammar it will probably be necessary to add linguistic rules.
  • Use linguists to identify error patterns and develop corrective strategies.
  • Understand the content that you are going to translate and understand the quality that you need to deliver.
  • Clean and simplify the source content before translation.
  • And if quality really matters always use human validation and review.
Clean data reduces Unpredictability
All of this could be summarized simply as, make sure that your data is of high quality and use competent human linguists throughout the development process to improve the quality. This is true today and will be true tomorrow. 

I suspect that effective man-machine collaborations will outperform pure data-driven approaches in future, as we are already seeing with both MT and search, and I would not be so quick to write off Google. I am sure that they can still find many ways to continue to improve. As long as the 6 billion people not working in the professional translation industry care about getting access to multilingual content, people will continue to try and improve MT. And if somebody tells you that machines can generally outperform or replace human translators (in 5 years no less), don’t believe them, (but understand there is great value in learning how to use MT technology more effectively anyway). We have quite a ways to go yet till we get there, if ever at all.
universal_translator
I recall a conversation with somebody at DARPA a few years ago, who said that the universal translator in Star Trek was the single most complex piece of technology on the Starship Enterprise, and that mankind was likely to invent everything else on the ship, before they had anything close to the translator that Captain Kirk used

MT is actually still making great progress but it is wise to be always be skeptical of the hype. As we have seen lately, huge hype does not necessarily lead to success, as Google Buzz and Wave have shown.


PS: Just a few days after this post was originally published Google admitted that they need to address the search SPAM problem, and this was further reinforced by a story in the Wall St. Journal.





Monday, January 10, 2011

The Most Worthwhile Conferences to Attend in 2011 and Finding the Real Customer

This is an expansion of a conversation I had with Renato Beninatto which he also blogged on and was also video taped here. I am sharing my opinion here not so much because I am endorsing one event or the other, rather this list is just my personal list of preferences and nothing more. I do not claim there is any special status to my personal preferences and you may notice I tend to like conferences where translation technology is emphasized.

We live in an age, where increasingly marketing and corporate-speak is challenged, undermined and sometimes even seen as disingenuous and false. (Raise your hand if you trust and respect corporate press releases).  Today we see customer voices rise above the din of corporate messaging, and taking control of branding and corporate reputations with their own “authentic” discussions of actual customer experiences, while marketing departments look on haplessly. I think this phenomenon is happening on many fronts, including conferences in the localization industry. There are too many events in the L10N industry that seem formulaic, routine, repetitive and engineered based on the same old viewpoints. This, I think affects the ability of these events to really spark dialogue, excitement and generate vital learning experiences that make these conferences must-attend events. While these events remain useful for “face-time”, they often have little value for really engaging attendees at a professional level and providing insights that drive new action plans.

tekomacross
What makes for a great conference or professional industry event? To my mind: high quality content, interactive and engaged audiences in sessions that broaden one’s horizons, interesting people who continue the professional dialogue outside of the sessions and share learning experiences and of course a good location. And if you can offer all of this at a reasonable cost, even better. A great professional event is characterized by learning, the more intensive the learning experience, the better. The best ones leave you thinking for awhile after the event.  Intense learning rarely happens at “really big” events because it is hard to scale this, but hopefully you have a few intense one-on-one interactions.

I also really like events that really focus on the customer: the real customer. The real customer would be the management team that runs, handles and is held accountable for success and failure of international initiatives (rarely the localization department of that company IMO). So the real customer would be senior sales, marketing, product management and customer support people who may also fund and direct the localization department mission in the global enterprise. (We rarely see them at any localization conferences because localization is rarely a central focus for them.) The real customer is more likely to focus on market share trends and customer satisfaction / loyalty  rather than word rates, fuzzy matching rates, TM ownership, SimShip or vendor management.
The question that Ultan O'Broin posed most recently in Quora was:

He also presents a categorization of these conferences as follows Generalist, Specialty and Geography-focused events. He said that he liked to attend one or two of each category and his preferences are stated in the links above. 

I still see myself as more of a pragmatic technology guy, trying to solve meaningful and useful translation problems (hopefully) with technology, rather than an industry insider (not quite a localization professional), so I would organize this a little bit differently but it still has much in common with Renato’s view. For MT especially it is all about understanding and learning how to use it at this point in time.  

I think there are 3 or more categories of conferences that touch localization and translation. The following is my very crude categorization. (Hopefully somebody can suggest a better categorization scheme. Please feel free to tear this apart).
 
1) Traditional Corporate L10N & Translation Focused Conferences
2) Translation Technology & L10N Research Focused Conferences
3) Special Focus & Miscellaneous  and

New Opportunities and Events/Industries/Markets to Explore 
If I limit myself to three or so in each category I would select the following events.
Category 1: Localization World and ELIA are the best in terms of content quality and networking value IMO. Localization World is the largest industry insider event (and the only one where I have seen a real customer view occasionally) and ELIA is a great example of sharing and collaboration between peers and competitors. This is the most crowded conference category and my recommendation would be for people to choose carefully amongst the available options (using location and content as a guide).  There are some who prefer the LISA and GALA versions of this category and they can also be good and sometimes slightly different like the LISA event at UC Berkeley.

Category 2:  TAUS Annual User Conference for MT focus from enterprise customer perspectives, (not the regional meetups held around the world) even though this event has some overlap with the first category.
AMTA for a deep dive on machine translation related issues that covers both the gory details of the technology and its use in public and private sectors, but this event has a very strong US focus.
LRC for broad and innovative localization research and thinking that is truly focused on the next generation of needs.
Translingual Europe 2010:  This was a free event held in Berlin that I think shows promise had much about MT and broader language technology initiatives across the world but especially in the EU. 

Category 3: IMTT events have great content and a wonderful collaborative and sharing culture and I also think are one of the few events where you really get to see both LSP and translator perspectives engage together.
AGIS to understand the issues in the non-profit world where translation is often linked to national development priorities or alleviation of information poverty. A different perspective and much more ambitious initiatives that involve national policy oftentimes. I am willing to bet that the leaders in the non-profit arena will also be the first to really use technology well and drive standards forward. I suspect that the most interesting crowdsourcing initiatives will also come from the non-profit area and passionate community leaders rather than global enterprises.
tekom is an opportunity for  localization professionals to connect to a broader customer community and I hope that more of this happens in 2011. The industry can only grow and gain momentum by becoming more involved with larger broader and vertical market focused shows which are important to real customers.
I think some of the smaller events can also be very interesting e.g. The last LISA Crowdsourcing round table was informative and showed potential and promise but lost momentum because of weak follow-up. I am told that some of the smaller events in Eastern & Southern Europe also have very high quality content and great engagement. Web based events are growing in popularity but very few have found the right mix of content and engagement.

I hope that we will see more events focused on resolving issues around data interchange and exchange standards so that translation data flows much more easily and fewer on process standards.

New Opportunities and Events/Industries/Markets to Explore
Possibly the best and most exciting business opportunities and ability to learn about new long-term strategic opportunities are shows that have been off the beaten path and ignored by most industry insiders.
 
Some possibilities for translation industry collaborations in future and where I think the best opportunities lie for emerging translation and localization demand are listed below. Again just some suggestions and not a complete list by any means. It would be worth finding out which are the best conferences to meet customers, providers and thought leaders in the following areas and develop marketing communications that interest, educate and engage these attendees, on localization and translation issues.
I think video content will be a major new opportunity, and will likely cover all of the above segments. I have seen that there are video subtitling/dubbing focused conferences but have not attended one yet and I suspect that this is an area worth exploring as a long-term opportunity. Video content translation is likely to be the fastest growing new sector for the industry as Cisco estimates that 90% of IP traffic will be video related by 2013. If that is where the end-customer is, it makes sense to focus on it. I am sure mobile will also be a growing and strategic area.
 
Finally, I think we should all be exploring how to get more connected in to BRICI global commerce focused events. (This is more complicated than holding an event in China and/or India.) It is very likely that as the export/import sectors in these fast growing countries expands there will be events and conferences, that will be worth tapping into to really get access into these markets. I would bet that the best events will be organized by locals in these regions who are interested in globalization issues, possibly even government sponsored events.

Please join the discussion on Quora or comment on this blog as this is by no means a definitive or authoritative list. What do you think? And lets hope that we all find “a real customer” at the events we go to.
dialogue

Monday, January 3, 2011

Most Popular Blog Posts of 2010

Award winning picture, part of the Wikimedia Commons Pictures of the Year collection.

As I have been watching people summarize the year in many ways I decided to look into the traffic stats on the eMpTy Pages blog to see what were the most popular posts (in terms of traffic anyway). And here is the list in order of traffic popularity. 

The top 3 entries were written in July so it appears the thoughts and the news were really flowing at the time. 

1. The most popular entry was the summary of a conversation that Renato & Bob had at the IMTT Vendor Management conference in Las Vegas and my additional comments on this issue. This is an ongoing discussion and far from over, we will see more unfold on this subject in this coming year I expect.

2. My thoughts and analysis of the SDL acquisition of Language Weaver, which was clearly between a rock and a hard place in 2010 after doubling manpower/sales investments and overall expenses and seeing hardly a budge in revenue or translation quality of their systems. The facts speak for themselves in spite of careful PR efforts to create impressions that suggested growth and momentum.(Yup Mark, some of us do realize there was hardly any revenue growth in the last three years.)

3. This next one seemed to resonate with all the MT enthusiasts in particular.This one got a lot of comments as well.

4. I am surprised that this entry was as popular, but it is clear that TAUS is becoming more relevant and a great place to find some information (though not always the best and most accurate info IMO) on MT deployments throughout the corporate world.The understanding that clean data does matter is growing and that can only help the quality of future MT systems developed with TAUS data. I hope that TAUS will lead the charge in helping us all understand what are the driving factors behind really good MT systems. My sense is that everything we have seen so far is just about getting familiar with the technology and was mostly driven by localization ROI rather than real raw MT quality tuning efforts. The best is yet to come.

5. This entry was I think the best entry of the year, even though I did not offer any solutions. I think it was the clearest articulation of the problem and explanation of why standards matter and why we need better solutions. I hope that this discussion continues and grows in 2011. I saw that it was also a major issue and area of concern at the AGIS10 conference in Delhi. I am much more optimistic that the best thinking and solutions on standards will come from the non profit translation world since the corporate localization industry has barely delivered a real TM standard after 10 years of trying. There were also many interesting comments and feedback on this entry and I hope we will see more discussion emerge from this. PostRank also tells me that this article continues to get steady traffic over time and might be influencing others.

6. This was a summary of an interview with Rob Vandenberg, CEO of Lingotek about community collaboration tools for translation which is likely to become much more important in future.

7. This was a summary of key messages from ATA leadership to the AMTA community. Hopefully this dialogue grows even though there are some  strident voices on both sides. I think the recent admission by Google that they have reached the limits of what is possible in terms of driving MT system improvements by just feeding more data to the engines. This I think will lead to an increasing awareness that getting linguistic experts involved and improving information quality (yes, clean data rears it ugly head again) is necessary for continued progress. Another one with interesting comments.

8. This focused on my view of where the highest value translation work would be in the future. I do not believe that transcreation is the best definition of high value/high skill work in translation. Value has to be determined by what customers find most useful at critical stages (pre and post sales) of their relationship with a company, brand or product.

9. This is a summary of key messages from my keynote presentation at ELIA Dublin which was possibly my favorite conference of the year. This article summarizes some details about the content explosion and how it may be impacting global customer interactions and how this relates to the world of professional translation.

10. And more on standards in the localization industry, a subject that will be key to make real progress on, to raise productivity and  respond to the very growing volume of content.

I wish you all a Happy, Healthy and Prosperous New Year. I think 2011 will be memorable in many ways, and translation will continue to grow in strategic importance for both global businesses and for countries that are rising economically and entering the knowledge economy.

We are seeing a lot of forecasts as is typical at the beginning of the year but most of these are very technology focused. I found a really interesting forecast  that casts a much wider net and is the most interesting one I have found on the world at large. They have a pretty good track record for 2010 as well. Good news for my new friends in the Ukraine, they see the region rising in 2011. This is worth taking a look at.