Friday, October 8, 2010

Highlights from the TAUS User Conference

Earlier this week, there were a 100+ people gathered in Portland,Oregon at the TAUS annual user conference. This included a large contingent of translation buyers, mostly from the IT space - (Intel, Oracle, EMC, Sun, Adobe, Cisco, Symantec, Sony),  a few LSPs with real MT experience and people from several MT and translation automation technology providers gathered to share information about what was working and what was not. Like any conference, some presentations were much more compelling and interesting than others, and I wanted to share some of the things that stood out for me. It was not always easy to tweet, since the connectivity faded in and out but a few of us did try to get some coverage out. CSA also has a nice blog entry on the event that focuses on the high level themes. I think that some of this will be made available as streaming video eventually.

If you are interested in the business use of machine translation this was a  useful event and it showed you many examples of successful use-cases and also get technology presentations from MT vendors in some depth, perhaps too much depth for some. There was a full day that was filled with MT related presentations from users, MT tool developers and from LSPs using MT. Some of the highlights:

Jaap stated that translation has become much more strategic and that global enterprises will need language strategies. He stated that he felt that there would not be any substantial breakthroughs in MT technology research in the foreseeable future. I actually agree with him on this, in the sense that the rate of improvement from pure MT technology research alone will decline, but I believe that we are only at the beginning of the improvements possible around better man-machine collaboration. It is my opinion that the many of the systems presented at the event are far from the best systems possible today with the technology and people available today.  Another way to say this is that I think the improvements from free online MT will slow down, but the systems coming from professionals collaborations like Asia Online has with LSP partners will rise rapidly in quality and show clearly that just data, computers and algorithms are not enough. I predict that collaboration with skilled, focused and informed professional linguistic feedback (LSPs) will drive improvements in MT quality faster than anything done in MT research labs. The two most interesting pure MT technology  presentations included an overview of morpho-syntactic SMT (a hybrid with 500M rules! from Kevin Knight at USC/ISI) and the overview of a “deep” hybrid RbMT approaches from ProMT where statistical selection is introduced all through the various stages of an enhanced RbMT transfer process. This is in contrast to the “shallow” hybrid model used by Systran which uses SMT concepts as a post-process to improve output fluency. All the RbMT vendors/users stressed the value of upfront terminology work to get better quality. For those of you who heard the presentation that Rustin Gibbs (mostly, who was also a star performer at the conference) and I did, will know that I am much more bullish on what a "vendor-LSP collaboration"  could accomplish over any technology-only approach.


It was good to see (finally!) that many, including Jaap admit that data quality and cleanliness do matter in addition to data volume.


Several people came up to me and mentioned that they had experienced that sometimes less is more, when dealing with the TDA data. TM cleaning, corpus linguistics analysis to better understand the data and assessing TM quality at a supplier level at least is now a recognized and real issue that is getting  more attention within TAUS. There were several presentations that mentioned data quality, “harmonizing terminology”, TM scrubbing and strategies to reduce the risk of data pollution when large amounts of TM are gathered together.  The TDA database now has 2.7 billion words and TAUS admitted that it has become more difficult to search and use and thus they are integrating the data into a Globalsight based repository to make the data more usable. They are hoping to add more open source components to further add value to the data and it’s ongoing use. Their ability to further refine this data and really standardize and normalize it in useful ways will define the TDA’s success as a data resource in future, in my opinion.

There was an effective and very interesting point counterpoint session (unlike the ones I have seen at GALA) between Keith Mills, SDL and Smith Yewell, Welocalize that positioned the “practical” legacy system view versus the open systems view of the world. It really did justice to both viewpoints in a common framework and thus provided insight to the careful observer. It was interesting to see that SDL used the word “practical” as many as 30 times while presenting the SDL view of the world. In brief, SDL claimed that they will spend $15M in 2010 on R&D to create a platform to link content creation more closely to language. He said that SDL does not make money on TMS because there is too little leverage due to too many one-off translation processes. He also said that SDL will not go open source but was “really into standards” and will create APIs to “let” customers integrate with other software infrastructure. 

Smith presented his view of a world in contrast to the “walled garden” of SDL that was interoperable, “open” and collaborative and involved multiple vendors. Keith responded they will “eventually” connect to other products and “are working on it,” with counters on how closed black boxes make it difficult to scale up and meet new customer requirements of translating fast flowing content from disparate sources. It was interesting to see Jaap characterize the debate saying that perhaps Smith was “too visionary” (a nice way of saying impractical?)  and that the SDL perspective is “realistic and practical”. I actually have captured the flow of the debate in my twitter stream. Keith also made good points about how MT must learn to deal with formatted data flows to be really usable but seemed to completely miss the growing urgency for language data to move in an out of SDL software systems. Smith also pointed out that MT is not revolutionary, rather it is just another tool that needs to be integrated into the right business processes to add value. I liked the debate because it accurately presented the two viewpoints in an authentic way and let you see the strengths and weaknesses of both perspectives. I, of course, have an open systems and collaboration bias but this session gave me some perspective on the value of the walled garden perspective as well. 

I had some back channel Twitter chatter with @paulfilkin and @ajt_sdl about what meaningful openness means. In my view SDL does NOT have it. My advice to SDL on this is to share API information like the TAUS API, that is published for self-service use and so any customer/member can connect to get data in and out of the TDA repository efficiently. Facilitate the work to let the data flow and give customers enough information (API, SDK) so that they can do this themselves without SDL permission and move language data to wherever it is needed easily e.g. TMS to TMS, TMS to TM, TMS to CMS, TMS to MT, TMS to Web. They should provide basic information access for free and customers should only be required to pay for this if they need engineering support. SDL needs to understand that language data can be useful outside of SDL tools and that they can make this easier by delivering real openness to both their customers and the overall market.  I have written about this previously. My advice and warning to customers – stay away from SDL until they do this or you will find yourself constantly wounded and tending to “integration friction”. 

One thing that I found really painful about the conference was a horrendously detailed continuing presentation (it refused to end) on the history of MT. I am not sure that there is anything to be learnt from such minutiae. For me this is clearly a case where history was horrifically boring  and did not teach any real lessons. It made me want to poke my eyes and see how much pain I could tolerate before crying out. I hope they never do it again and that the presentation and recordings are destroyed.

I also feel that the conference was way too focused on “internal” content. That is, content created by and within corporations. It is no surprise that SDL (which originally stood for Software and Documentation Localization) is so committed to the walled garden since in the legacy world, command and control has always been the culture, even within most global companies. In an age where social networks and customers sharing information with each other increasingly drives purchase behavior, this is a path to irrelevance or obscurity. I am not a big believer in any Global Content Value Chain (GCVS) that does not have a strong link to high value community, partner and customer created content. The future is about BOTH internal and external content. I think the TAUS community would be wise to wake up to this, to stay relevant and attract new and higher levels of corporate sponsorship. We should not forget that the larger goal of localization and translation efforts is to build strong relationships with global customers.  Conversations in social networks is how these relationships with brands and customer loyalty are being formed today. Localization will need to learn to connect into these customer conversations and add value on behalf of the company in these living, ever present conversations happening mostly outside the corporations content control initiatives. Richard Margetic from Dell said it quite clearly at Localization World Seattle: “Corporations will have to take their heads out of the sand and listen to their customers, we believe that the engagement of customers on Twitter is critical to success (and sales)” He also said “We had to teach our corporate culture that negative comments have value. It teaches you to improve your products” and I hope that the TAUS board will take my comments in that spirit.

There was surprisingly little discussion about data sharing and the focus had moved to much more pragmatic issues like data quality, standards, categorization and meaningful identification and extraction from the big mother database but there were few details on this other than some basic demos of how Globalsight and the search engine worked. If you have ever seen a database search and lookup demo you will know that it is seriously underwhelming. The TDA is great if you are looking for Europarl or IT data but is pretty thin if you want data in other domains. A lot of this data is available elsewhere with no strings attached. The TDA needs to give people a reason to come to their site for this data.The value of TDA in future, IMO, is going to depend on how they add value to the data, normalize it, clean it and categorize it.  This to my mind is the real value creation opportunity.  They also need to find ways to attract more people who are not from the localization world but have an interest in large scale translation. Additionally I believe they should expand the supervisory board beyond IT company people to increase the possibility of being a real force of change. When everybody has the same background, groupthink is inevitable. (Aren't they forced to watch those workforce diversity presentations that I was forced to when I was at EMC?) I suspect that open innovation / collaboration models will be more likely to come up with ways to add value to data resources and new ways to share data and hopefully TDA finds a way to engage with people like Meedan, TRF, Yeeyan and other information poverty focused initiatives. While there was a focus on innovation and collaboration I got the feeling that it was too focused on getting more open source tools and nothing else. I think open innovation needs more diversity in opinion and ideas than we had in the room. What is open innovation?  Henry Chesbrough; “Open innovation is a paradigm that assumes that firms can and should use external ideas as well as internal ideas, and internal and external paths to market, as the firms look to advance their technology”

Here is a presentation on Data Is the New Oil. I think the point they make is very useful to TDA: Refine, Refine, Refine to create value. Value is getting the right data to the right people at the right time in the right format. I think it is worth finding out what that actually means in terms of deliverables. Make it easy to slice and dice and package and wrap. 

Some of you who know me, know that I am a Frank Zappa fan and Frank (well ahead of his time as usual) said it well in 1979 on the Joe’s Garage album: (I would add “Data is not information” for this blog and recommend you look at more Zappa quotes)

"Information is not knowledge.Knowledge is not wisdom.

Wisdom is not truth.Truth is not beauty.

Beauty is not love, Love is not music.

Music is THE BEST" 

Finally, one thing about conferences is that there are always a few conversations that stand out. While there were many professional conversations of substance, for me, the ones that stand out the most are conversations that come from the heart. I was fortunate to have four such conversations at this event. One with Elia Yuste of Pangeanic about building trust, another with Alon Lavie of AMTA about where innovation and the real exciting MT opportunities will come from, a third with Smith of Welocalize about building openness with substance, integrity & honor and finally a conversation with Jessica Roland about life, finding purpose and family. I thank them all for bringing out the best in me.

As I prepare for my ELIA keynote, I am compelled to share one more quote that I think is worth pondering:

In revolution, the best of the new is incompatible with the best of the old. It’s about doing things a whole new way… Clay Shirky


  1. I followed that Twitter stream live with great interest.

    I was apparently mistake in my impression that SDL's APIs are now freely accessible to licensed users. What are in fact the objectionable terms?

  2. As far as I can gather, there is always a fee to get the SDK even if you own the product. If you are lucky it is only $5K but it can be $25K or higher for some products.

    I invite somebody from SDL to come and clarify and educate us better on the policy if I am misstating the facts.

  3. Alexandre RafalovitchOctober 9, 2010 at 5:59 PM

    Re: SDL API

    I used SDL MultiTerm 2007 API and it was ok. In some places it was less than ok. The telling thing for me was that I was unable to find either external community or any sort of help from SDL support.

    More recent APIs seem to be getting a bit more effort, but they still feel like a lip service rather than something that company is betting on. Same with SDL developer program, with its participation fees, etc.
    Posted by Alexandre Rafalovitch

  4. The interesting news is that most people were from the IT industry. This means that translation, and especially translation automation, is still appealing, at least intriguing for them.
    I really don't think, though, that TAUS has the right approach as, so far, it has been disregarding data quality and cleanliness, but I could be mistaking and this something I'd appreciate to hear your opinion on.
    For example, I do think that we have to expect much news from the MT industry in the near future and that some of them will actually carry "substantial breakthroughs" because there will be more and more people interesting and maybe involved in MT and in MT R&D for its factual implications. I'm afraid that Jaap could be confusing his views and predictions on TAUS's future with the MT industry.
    A few words on SDL. As long as it will keep the service and the product divisions inside one company, sharing funding and cash-flow, no expectation is realistic about a possible change in its attitude and for SDL to prove its actual strength.

  5. @Kirti Vashee re getting the SDK

    You are misstating the facts. There is no cost other than that you must be a license holder of the software. You must apply to the developer program, free of charge, through your Customer Centre and then you will be given access to the API's and fully documented SDK with regularly updated online help.
    This is available for the 2009 suite of desktop and server products.
    We're not opensource but we are open in this regard. In my opinion moreso than most commercial applications.

  6. I was at an event in Limerick, Ireland where SDL mentioned something about OpenBeta API's to improve connectivity and workflow. This was last month so hopefully we will here more on that in the future.

  7. I now have additional confirmations and specific evidence that getting access to the SDK not only involves fees and charges to the developer (e.g. Asia Online) but also sometimes to the SDL Licensee so I am not sure under what circumstances the "facts" that Paul Filkin mentions above apply.

    Clearly does not seem to be the case for customers of the translation management systems.