Tuesday, January 12, 2010

Introduction and Soft Censorship

This is my first attempt at blogging, though I have been quite active in Linked In and on Twitter.
I am calling this blog "eMpTy Pages" because it will often be about MT (Machine Translation) or Automated Language Translation as I prefer to call it. Also, there is a song by Traffic by the same name that I like, and one of my favorite quotes related to machine translation is: "The history of MT is filled with eMpTy promises."

As we all know, MT is not the most scintillating topic so it will often, or at least sometimes be empty i.e. I won't have anything to say. I am not sure how often I will have anything to say, so I make no commitments at this time. I find Twitter easy to do while I do my real work. I am not sure about blogging. I promise to continue to do this if it is easy, fun and does not become a technical challenge.

I have decided to do this independently, as I may not always be representing the views of my employer and they may at times prefer to keep a distance - to a great extent it will just be my musings and thoughts about the topics listed below.

We have Dave Grunwald @davegrun to thank (or blame) for pushing me into this and moving me out of my inertia by naming me as a top blogger in the industry even though I technically did not have a blog.

I reserve the right to speak about things I find interesting in general, though I will try and keep the bulk of my posts focused on Translation Technology, Localization, Globalization, Internet Trends, Social Networking, Crowdsourcing, Collaboration, Global Business and things like that.
So now that the introduction is done on to my first entry.

I was recently moved to comment online, about what I felt was clearly unwarranted and unjustifiable censorship, and yesterday I realized that I had been subjected to a similar exclusion. I would characterize my experience as "soft censorship" in contrast to the arbitrary deletion that @renatobeninatto faced. So anyway here is the excerpt from the discussion on TM sharing in the Automated Language Translation Group in LinkedIn I wrote yesterday. (Repurposing already.)

As the New Year begins I notice that the TDA has publicized several data consolidation experiments last year, which show the unambiguous and definite benefit of sharing TM. The results are all positive and wonderfully, there are no problems or no failures. It ALWAYS works. Using TDA data is ALWAYS beneficial.

Unfortunately not all of us believe this. Especially one such as me who has seen that this does not always happen.

I have also noticed that TAUS has decided to keep the results of the study I was involved with, on the "down low" i.e. not mention it at all. The results of the Asia Online data consolidation experiment showed that all data is not equal and that some work needed to be done to make TM consolidation work well. This is well documented in this LinkedIn thread and the detailed 50 page Asia Online report on their specific experiment with TM data consolidation that can be downloaded from here.

Like everything else in data processing, SMT does indeed follow the "Garbage In Garbage Out" rule. This should lead to questions that help your possibility of success like: What is Clean Data for SMT? How does one keep data in a format useful for both TM and SMT? However, this did not happen.

I bring this up now because there has been a little bit of a storm in the Localization Professional group where I was disturbed by the arbitrary censorship of a carefully stated differing view. Source of the Controversy in this LinkedIn group.

If you have an hour or two to kill, go and take a look, it is both a waste of time and quite interesting. In my opinion it was also quite wrong and an abuse of moderator power.

It struck me that the unwillingness to share and make the Asia Online report visible is also a kind of soft censorship and I began to wonder why. This, perhaps explained why I felt such a strong urge to make some statements in the @renatobeninatto discussion (apart from it simply being the right thing to do) as I really felt it was wrong.

Could the test that Asia Online conducted have been flawed?
The report shows in detail what the process and methodology was and invites scrutiny and process criticism and thus deliberately provided gory detail. I have never heard any direct criticism so I am not sure.

Could it be that since we were not, and still are not TAUS or TDA members we were being kept out of view as it is reasonable that only TDA members should get the limelight?

Could it be that somebody did not care for the suggestion that all data is not equal and that cleaning and normalizing TM was necessary?

As there was no feedback I am not really sure, and can only ask these questions and hope that somebody will step up and provide clarification.

My intent is not merely to poke fun or provoke, I truly do believe that the TDA will be taken more seriously if it provides some information about when TM sharing does NOT work, for surely there are some cases where this is true. I know of at least two instances where this is true, even in a common domain.

We all need to learn what is necessary to do, to make an initiative like TDA work and I think it would be helpful for everybody to know why it sometimes might not work. As specifically as possible.

It might also be useful to get a better understanding not only on what clean data is, in terms of SMT, but also what makes one TM policy/process work better than another when used in SMT. There might be some value in this for all those interested in using shared TM for SMT engine building even beyond TDA.

I am a strong believer in openness, transparency and in the promise of Web 2.0 and "real" collaboration to bring about change. These characteristics together, I feel, can produce real meritocracies and functioning, effective organizations and governance, so I continue to reach out. And when I sense a lack of openness, I am filled with curiosity about what are they hiding, and why.

Anyway, here's hoping that I find out answers to this, and many other questions I have in 2010.

I wish all those who make it to this blog entry a wonderful, prosperous, discovery and collaboration-filled New Year.

P.S. My guide on the mechanics and spirit of how to start this blog was Penelope Trunk. Just do it she says, and I like her because she is authentic and real.

P.P.S. Staring at empty pages
Centered 'round the same old plot
Staring at empty pages
Flowing along the ages

Listen to the song
Empty Pages


  1. Welcome to the world of blogging! I enjoyed your inaugural post very much.

  2. Kirti,

    Welcome to blogging. It is good to see a space where one of the leading minds of the language services industry can share his insights beyond the 140 characters of Twitter's microblogging service. You are now macroblogging!


  3. Kirti, I like the name you chose for your new blog. Happy blogging. Dave

  4. Welcome to the blogosphere. Very good post to start with. No matter how much you are going to blog. I love your tweets and I'm sure I will be caught by your posts.

  5. Kirti,
    Love the title of the blog and your introductory subject. Must get back on the blogging job meself...:)

    Given the developments in China announced today (or yesterday), the issue of censorship is timely.

    One issue in relation to these LinkedIn Firestorms is they are conducted behind 'closed doors' - i.e, you need to be an approved member of the group concerned.

    So, let's see the details (within reason of length and permissions) of these discussions externalized outside of these Papal Conclaves and into an indexed searchable forum (like this or Renato's blog, or others).

    Good job!

  6. Hi Kirti,

    I just wanted to drop by to wish you lots of fun with your new blogging adventure.



  7. Many thanks for your all your kind wishes. I am now trying to figure out some of the mechanics, like making comments more visible.

  8. There is a very constructive and useful discussion going on at the Pangeanic blog on this subject.

    So we have more validation on the issue of data quality but they also provide some guidance on how TDA data can be useful with careful pruning.

    This was exactly our key finding from the data that we experimented with.

  9. Hi Kirti, Great blog. I have linked to you from my blog (, since I am upstream of you :-)