Pages

Monday, January 23, 2017

Finding the Needle in the Digital Multilingual Haystack

There are some kinds of translation applications where MT just makes sense. Usually, this is because these applications have some combination of the following factors: 
  • Very large volume of source content that could NOT be translated without MT
  • Rapid turnaround requirement (days)
  • Tolerance for lower quality translations at least in early stages of information review
  • To enable triage requirements and help to identify highest priority content from a large mass of undifferentiated content
  • Cost prohibitions (usually related to volume)
This is a guest post by Pete Afrasiabi, of iQwest Information Technologies that goes into some detail into the strategies employed to effectively MT in a business application area, that is sometimes called eDiscovery, (often litigation related), but in a broader sense could be any application where it is useful to sort through a large amount of multilingual content to find high-value content. In today's world, we are seeing a lot more litigation involving large volumes of multilingual documents, especially in cases that involve patent infringement and product liability. MT serves a very valuable purpose in these scenarios, namely, it enables some degree of information triage. When Apple sues Samsung for patent infringement, it is possible that tens of thousands of documents and emails are made available by Samsung (in Korean) for review by Apple attorneys. It is NOT POSSIBLE to translate them all through traditional means, so MT, or some other volume reduction process must be used to identify the documents that matter. Because these use-cases are often present in litigation, it is generally considered risky to use the public MT engines, and most prefer to work within a more controlled environment. I think this is an application area that the MT vendors could service much more effectively by working with expert users like the guest author more closely.


 ----------------------------------------------------------------------------------


Whether you manage your organization’s eDiscovery needs, are a litigator working with multi-national corporations or are a Compliance officer, you commonly work with multilingual document collections. If you are an executive that needs to know everything about your organization, you would have a triage strategy helping you get the right information ASAP. If the document count is over 50-100k you typically employ native speaking reviewers to perform a linear one by one review of documents or utilize various search mechanisms to help you in this endeavor or both. What you may find is that most documents being reviewed by these expensive reviewers is often irrelevant or requires an expert to review. If the population includes documents from 3 or more languages, then the task becomes even more difficult!

There is a better solution. A solution that if used wisely can benefit your organization, save time/money and a huge amount of head ache. I am proposing that in these document populations the first thing you need to do is eliminate non-relevant documents and if they are in a foreign language you need to see an accurate translation of the document. In this article, you will learn in detail how to improve the quality of these translations using machines at a cost of hundreds of times less than human translation and naturally much faster.

With the advent of new machine translation technologies comes the challenge of proving its efficacy in various industries. Historically MT has been looked at not only inferior but as something to avoid. Unfortunately, the stigma that comes with this technology is not necessarily far from the truth. Adding to that, the incorrect methods utilized in presenting its capabilities by various vendors has led to its demise in active use across most industries. The general feeling is “if we can human translate them, why should we use an inferior method” and that is true for the most part, except that human translation is very expensive, especially when the subject matter is more than a few hundred documents. So is there really a compromise? Is there a point where we can rely on MT to complement existing human translations?

The goal of this article is to look under the hood of these technologies and provide a defensible argument for how MT can be supercharged with human translations. Human being’s innate ability to analyze content provides an opportunity to help and aid some of these machine learning technologies. An attempt to transfer that human based analytical information into a training model for these technologies can provide translation results that are dramatically improved.

Machine Translation technologies are based on dictionaries, translation memories and some rules-based grammar that differs from one software solution to another. Although there are newer technologies that utilize statistical analysis and mathematical algorithms to construct these rules and have been available for the past several years, unfortunately, individuals that have the core competencies to utilize these technologies are few and far between. On top of that, these software solutions are not by themselves the whole solution and just a part of a larger process that entails understanding language translation and how to utilize various aspects of each language and features of each of the software solutions.

I have personally witnessed most if not all the various technologies utilized in MT and about 5 years ago, developed a methodology that has proven itself in real life situations as well. Here is a link to a case study on a regulatory matter that I worked on.


If followed correctly, these instructions can turn machine translated documents into documents with minimal post editing requirements and at a cost of hundreds of times less than human translation. They will also look more closely like their human translated counterparts with proper flow of sentence and grammatical accuracy, far beyond the raw machine translated documents. I have referred to this methodology as “Enhanced Machine Translation”, still not a human translation but much improved from where we have been till now.

Language Characteristics

To understand the nuances of language translation we first must standardize our understanding of the simplest components within most if not all languages. I have provided a summary of what this may look like below.
  • Definition
    • Standard/Expanded Dictionaries
  • Meaning
    • Dimensions of a words definition in Context
  • Attributes
    • Stereotypical description of characteristics
  • Relations
    • Between concepts, attributes and definitions
  • Linguistics
    • Part of Speech / Grammar Rules
  • Context
    • Common understanding based on existing document examples

Simply accepting that this base of understanding is common amongst most, if not all languages is important, since the model we will build on makes assumptions that these building blocks will provide a solid foundation for any solution that we propose.

Furthermore, familiarity with various classes of technologies available is also important, with a clear understanding of each technology solution’s pros and cons. I have included a basic summary below.
  • Basic (Linear) rule based Tools
  • Online Tools (Google, Microsoft, etc.)
  • Statistical Tools
  • Tools combining the best of both worlds of rules-based and statistical analysis


Linear Dictionaries & Translation Memories

Pros
  • Ability to understand the form of word (noun, verb, etc.) in a dictionary
  • One to one relationship between words/phrases in translation memories
  • Fully customizable based on language
Cons
  • Inability to formulate correct sentence structure
  • Ambiguous results, often not understandable
  • Usually, a waste of resources in most case use examples if relied on exclusively

Statistical Machine Translation

Pros
  • Ability to understand co-occurrence of words and building an algorithm to use as reference
  • Capable of comparing sentence structures based on examples given and further building on the algorithm
  • Can be designed to be case-centric
Cons
  • Words are not numbers
  • No understanding of form of words
  • Results could be similar to some concept searching tools that often fall off the cliff if relied on too much
 Now that we understand what is available, building a model and process that takes advantage of benefits of various technologies, while minimizing the disadvantages of them would be crucial. In order to enhance any and all of these solution’s capabilities, it is important to understand that machines and machine learning by itself cannot be the only mechanism we build our processes on. This is where human translations come into the picture. If there was some way to utilize the natural ability of human translators to analyze content and build out a foundation for our solutions, would we be able to improve on the resulting translations? The answer is a resounding yes!


BabelQwest : A combination of tools designed to assist in Enhancing Quality of MT

To understand how we would accomplish this, we need to review some of the machine based concept analysis terminologies first. In a nutshell, these definitions and solutions are what we have actually based our solutions on. I have made reference to some of the most important of these definitions below. I have also enhanced these definitions with how as linguists and technologists we will utilize them in building out the “Enhanced Machine Translation” (EMT for short) solutions.
  • Classification: Gather a select representative set of the documents from the existing document corpus that represent the majority of subject matters to be analyzed
  • Clustering: Build out documents selected in the classification stage to find similar documents that match the cluster definitions and algorithms of the representative documents
  • Summarization: Select key sections of these documents as keywords, phrases, and summaries
  • N-Grams: N-Grams are the basic co-occurrence of multiple words that are within any context. We will build these N-Grams from the summarization stage earlier and create a spreadsheet with each depicting each N-Gram and their raw machine translated counterparts. The spreadsheet is built into a voting worksheet that allows human translators to analyze each line and provide feedback as to the correct translations and even whether certain N-Grams captured should be part of the final training seed data or not. This seed data will fine tune the algorithms built out in the next stage down to the context level and with human input. A basic depiction of this spreadsheet is shown below.

Voting Mechanism


iQwest Information Technologies Sample Translation Native Reviewer Suggestion Table

YES NO SUGGESTED
Japanese English
 X


アナログ・デバイス Analog Devices
 X


デバイスの種類によりスティック品 Stick with the type of product devices
 X


トレイ品として包装・ as the product packaging tray

 X

新たに納入仕様書 the new technical specifications
 X
 Common Parameters
共通仕様書 Common Specifications

X

で新梱包方法を提出してもらうことになった have had to submit to the new packing method

  • Simultaneously human translate the source documents that generated these N-Grams. The human translation stage will build out a number of document pairs with the original content in the original language in one document and the human translated English version in another document. These will be imported into a statistical and analytical model to build the basic algorithms. By incorporating these human-translated documents into the statistical translation engine training, the engine will discover word co-occurrences and their relations to the sentences they appear in as well as discovering variations of terms as they appear in different sentences. They will be further fine-tuned with the results of the N-Gram extraction and translation performed by human translators.
  • Define and/or extract key names and titles of key individuals. This stage is crucial and usually the simplest information to gather since most if not all parties involved already have references in email addresses, company org charts, etc. that can be gathered easily.
  • Start training process of translation engines from the results of the steps above (multilevel and conditioned on volume and type of documents)
  • Once a basic training model has been built we would test machine translate original representative documents and compare with their human translated counterparts. This stage can be accomplished with as little as less than one hundred documents to prove the efficacy of this process. This is why we refer to this stage as the “Pilot” stage.
  • Repeat the same steps with a larger subset of documents to build a larger training model and to prove the overall process is fruitful and can be utilized to machine translate the entire document corpus. We refer to this stage as the “Proof of Concept” stage and it is the final stage. We would then start staging the entirety of the documents subject to this process in a “Batch Process” stage.
In summary, we are building a foundation based on human intellect and analytical abilities to perform the final translations. In using an analogy of a large building, the representative documents and their human translated counterparts (pairs) serve as the concrete foundation and steel beams, the N-Grams serve as the building blocks in between the steel beams and the key names and titles of individuals serve as the fascia of the building.

Naturally, we are not looking to replace human translation completely and in cases where certified human translations are necessary (Regulatory compliance, court submitted documents, etc.) we will still rely heavily on this aspect of the solution. Although the overall time and expense to complete a large-scale translation project is reduced by hundreds of times. The following chart depicts the ROI of a case on a time scale to help understand the impact such a process can have
  


This process has additional benefits as well. Imagine for a moment a document production with over 2 Million of Korean language documents that were produced over a long-time scale and from various locations across the world. Your organization has a choice of either reviewing every single document and classifying them into various categories utilizing native Korean native reviewers or utilize an Enhanced Machine Translation process to provide a larger contingent of English-speaking reviewers to search and eliminate non-relevant and classify the remainder of the documents.

One industry that this solution offers immense benefits is in the Electronic Discovery & Litigation support industry, where majority of attorneys that are experts in various fields are English-speaking attorneys and by utilizing these resources along with elaborate searching mechanisms (Boolean, Stemming, Concept Search, etc.) in English they can quickly reduce the population of documents. On the other hand, if the law firm relied only on native speaking human reviewers, a crew of 10 expert attorney reviewers, each reviewing 50 documents per hour (4000 documents per day on an 8-hour shift) would take them 500 working days to complete the review, with each charging hourly rates that can add up very quickly. 

We have constructed a chart from data over the past 15 years performing this type of work for some of the largest law firms around the world that shows the impact of a proper document reduction or classification strategy may have at every stage of their litigation. Please note the bars start from the bottom to top, with MT being the brown shaded area.

The difference is stark and if proper care is not given to implementation it often prevents organizations from knowing the content of documents within their control or supervision. This becomes a real issue with Compliance Officers that must rely on knowing every communication that occurs or has occurred within their organization at any given time.


---------------------


Mr. Pete Afrasiabi the President of iQwest, is a veteran of aggregating technology assisted business processes into organizations for almost 3 decades and in the litigation support industry for 18. He has been involved with projects involving MT (over 100 million documents processed), Manages Services and Ediscovery since the inception of the company as well as deployment of technology solutions (CRM, Email, Infrastructure, etc.) across large enterprises prior to that. He has a deep knowledge of business processes, project management and extensive experience working with C-Level executives.


Pete Afrasiabi
iQwest Information Technologies, Inc.

www.iqwestit.com
https://www.linkedin.com/in/peteafrasiabi
 

Wednesday, January 18, 2017

Objective Assessment of Machine Translation Technologies

Here are some comments by John Tinsley, CEO, Iconic Translation Machines that are a response to the Lilt Evaluation, CSA comments on this evaluation, and my last post, on the variety of problems with quick competitive quality evaluations. 

He makes the point about how the best MT engines are tuned to a specific business purpose very carefully and deliberately e.g. the MT systems at eBay and all the IT domain Knowledge Bases translated by MT. None of them would do well in an instant competitive evaluation like the one Lilt did, but they are all very high-value systems at a business level. Conversely, I think it is likely that Lilt would not do well in translating the type of MT use-case scenario that Iconic specializes in since they are optimized for other kinds of use cases where active and ongoing PEMT is involved (namely typical localization).

These comments describe yet another problem with a competitive evaluation of the kind done by LiltLabs. 

John explains this very clearly below and his statements hold true for others who provide deep expertise based customization like tauyou, SYSTRAN, and SDL.  However, it is possible that the Lilt evaluation approach could be valid for instant Moses systems and for comparisons to raw generic systems. I thought that these statements were interesting enough that they warranted a separate post.

Emphasis below is all mine.

---------------------------

 


The initiative by Lilt, the post by CSA, and the response from Kirti all serve to shine further light on a challenge we have in the industry that, despite the best efforts of the best minds, is very difficult to overcome. Similar efforts were proposed in the past at a number of TAUS events, and benchmarking continues to be a goal of the DQF (though not just of MT).

The challenge is in making an apples to apples comparison. MT systems put forward for such comparative evaluations are generally trying to cover a very broad type of content (which is what the likes of Google and Microsoft excel at). While most MT providers have such systems, they rarely represent their best offering or full technical capability.

For instance, at Iconic, we have generic engines and domain-specific engines for various language combinations and, on any given test set may or may not outperform another system. I certainly would not want our technology judged on this basis, though! 

From our perspective, these engines are just foundations upon which we build production-quality engines.

We have a very clear picture internally of how our value-add is extracted when we customise engines for a specific client, use case, and/or content type. This is when MT technology in general, is most effective. However, the only way these customisations actually get done are through client engagements and the resulting systems are typically either proprietary or too specific for a particular purpose to be useful for anyone else.

Therefore, the best examples of exceptional technology performance we have are not ones we can put forward in the public domain for the purpose of openness and transparency, however desirable that may be.

I've been saying for a while now that providing MT is a mix of cutting-edge technology, and the expertise and capability to enhance performance. In an ideal world, we will automate the capability to enhance performance as much as possible (which is what Lilt are doing for the post-editing use case) but the reality is that right now, comparative benchmarking is just evaluating the former and not the whole package.

This is why you won't see companies investing in MT technology on the basis of public comparisons just yet.

---------------


These comments are also available at:  


 
 

Tuesday, January 17, 2017

The Trouble With Competitive MT Output Quality Evaluations

The comparative measurement and quality assessment of the output of different MT systems is a task that has always been something that is difficult to do right. Right, in this context means, fair, reasonable, and accurate. The difficulty is closely related to the problems of measuring translation quality in general, that we discussed in this post. This difficulty is further aggravated when evaluating customized and/or adapted systems, since doing this requires special skills and real knowledge of each MT platform, in addition to time and money. The costs associated with doing this properly make it somewhat prohibitive.

BLEU is the established measurement of choice, we all use, but it is easy to deceive yourself, deceive others, and paint a picture that has the patina of scientific rigor, yet be completely biased and misleadingly false. BLEU, as we know is deeply flawed, but we don't have anything better, especially for longitudinal studies, even though if you use it carefully, it can be useful in providing some limited insight in a comparative evaluation.

In the days of the NIST competitive evaluations, the focus was on Chinese and Arabic to English (News Domain) and there were some clear and well-understood rules on how this should be done to enable fair competitive comparisons. Google was often a winner (i.e. highest BLEU score), but they sometimes "won" by using systems that took an hour to translate a single sentence, because they evaluated 1000X as many translation candidate options as their competitors, to produce their best one. Kind of bullshit, right? More recently, we have the (WMT16) that attempts to go beyond the news domain, does more human evaluations, evaluates PEMT scenarios, and again controls the training data used by participants, to attempt to fairly assess the competitors. Both of these structured evaluation initiatives provide useful information if you understand the data, the evaluation process, the potential bias, but both are also flawed in many ways, especially in the quality and consistency of human evaluations.

One big problem for any MT vendor in doing output quality comparisons, with Google, is that for a test to be meaningful, it has to be with something that the Google MT system does not already have in its knowledge database (training set).  Google crawls news sites extensively (and the internet in general) for bilingual text (TM) data, so the odds of finding data they have not seen are very low. If you give a college student all the questions and the answers that are on the test, before they take the test, the probability is high that they will do well on that test. This is why Google generally scores better on news domain tests against most other MT vendors, as they likely have 10X to 1,000X the news data that anybody except for Microsoft and Baidu has. I have also seen MT vendors and ignorant LSPs show off unnaturally high BLEU scores, by having an overlap between the training and test set data. The excitement dies quickly, once you get to actual data you want to translate, that the system has not seen before.

Thus, when an MT technology vendor tells us that they want to create a lab to address the lack of independent and objective information on quality and performance, and create a place where “research and language professionals meet,”one should be at least a little bit skeptical, because there is a conflict of interest here, as Don DePalma pointed out. But, after seeing the first "fair and balanced" evaluation from the labs, I think it might not be over-reaching to say that this effort is neither fair nor balanced, except in the way that Fox News is. At the very least, we have gross self-interest pretending to be in the public interest, just like we now have with Trump in Washington D.C. But sadly, in this case, they actually point out that, even with customization/adaptation, Google NMT outperforms all the competitive MT alternatives, including their own. This is like shooting yourself in the hand and foot at the same time with a single bullet!



A List of Specific Criticisms


Those who read my blog regularly know that I regard the Lilt technology favorably, and see it as a meaningful MT technology advance, especially for the business translation industry. The people at Lilt seem to be nice, smart, competent people, and thus this "study" is surprising. Is this deliberately disingenuous, or did they just get really bad marketing advice to do what they did here?

Here is a listing of specific problems that would be clear to any observer who did a careful review of this study and its protocol.

Seriously, This is Not Blind Data.

The probability of this specific data being truly blind data is very low. The problem with ANY publicly available data is, that it has a very high likelihood of having been used as training data by Google, Microsoft, and others. This is especially true for data that has been around as long the SwissAdmin corpus has been. Many of the tests typically used to determine if the data has been used previously are unreliable, as the data may have been used partially, or only in the language model. As Lilt says:  "Anything that can be scraped from the web will eventually find its way into some of the (public) systems" and any of the things I listed above happening, will compromise the study. If this data or something very similar is being used by the big public systems, it will skew the results and result in erroneous conclusions. How can Lilt assert with any confidence that this data was not used by others, especially Google? If Lilt was able to find this data, why would Google or Microsoft not be able to as well, especially since the SwissAdmin corpus is described in detail in this LREC 2014 paper

Quality Evaluation Scoring Inconsistencies: Apples vs. Oranges

  • The SDL results and test procedure seems to be particularly unfair and biased. They state that, "Due to the amount of manual labor required, it was infeasible for us to evaluate an “SDL Interactive” in which the system adapts incrementally to corrected translations." However, this unfeasibility does not seem to prevent them from giving SDL a low BLEU score.  The "adaptation" that was conducted was done in a way that SDL does not recommend for best results, thus publishing such sub-optimal results is rude and unsportsmanlike conduct. Would it not be more reasonable to say it was not possible, and leave it blank?
  • Microsoft and Google released their NMT systems on the same day, November 15, 2016. (Click on the links to see). But Lilt chose to only use the Google NMT in their evaluation.
  • SYSTRAN has been updating their PNMT engines on a very regular basis and it is quite possible that the engine tested was not the most current or best-performing one. At this point in time, they are still focusing on improving throughput performance, and this means that lower quality engines may be used for random, free, public access for fast throughput reasons. 
  • Neither SYSTRAN nor SDL seems to have benefited from the adaptation, which is very suspicious, and should they not be given an opportunity to show this adaptation improvement as well?
  • Finally, one wonders how the “Lilt Interactive” score is processed. How many sentences have been reviewed to provide feedback to the engine? I am sure Lilt took great care to put their own best systems forward, but they also seemed to have been less careful and even seem to have executed sub-optimal procedures with all the others, especially SDL. So how can we trust the scores they come up with?

Customization Irregularities

This is still basically news or very similar-to-news domain content. After making a big deal about using content that "is representative of typical paid translation work" they basically choose data that is heavily news-like. Press releases are very news-like and my review of some of the data suggests it also looks a lot like EU data, which is also in the training sets of public systems. News content is the default domain that public systems like Google and Microsoft are optimized for, and it is also a primary focus of the WMT systems. And for those who scour the web for training data, this domain has by far the greatest amount of relevant publicly available data. However, in the business translation world, which was supposedly the focus here, most domains that are relevant for customization are exactly UNLIKE News domain. The precise reason they need to develop customized MT solutions is because their language and vocabulary are different, from what public systems tend to do well (namely news). The business translation world tends to focus on areas where there is very little public data to harvest, either due to domain-specificity – medical, automotive, engineering, legal, eCommerce etc. or due to company-specific terminology. So, basically testing on news-like content does not say anything meaningful about the value of customization in a non-news domain. What it does say is that public generic systems do very well on news, which we already knew from years of WMT evaluations which were done with much with more experimental rigor and more equitable evaluation conditions.

Secondly,  the directionality of the content matters a lot. In “real life”, a global enterprise generates content in a source language where it is usually created from scratch by native speakers of that language and needs it translated into one or more target languages. Therefore, this is the kind of source data that we should test if we are trying to recreate the localization market scenario.  Unfortunately, this study does NOT do that (and to be fair this problem infects WMT and pretty much the whole academic field – I don’t mean to pick on Lilt!). The test data here started out as native Swiss German, and then was translated into English and French. In the actual test conducted, it was evaluated in the English⇒French and EnglishGerman direction. Which means that the source input text was obtained from (human) translations, NOT native text. This matters. Microsoft and others have done many evaluations to show this. Even good human translations are quite different from true native content. In the case of English⇒French, both the source and the reference is translated content. 

There is also the issue of questionable procedural methodology when working with competitive products. From everything I gathered in my recent conversations with SDL, it is clear, that adaptation by importing some TM into Trados is a sub-optimal way to customize an MT engine in their current product architecture. It is even worse when you try and jam a chunk of TM into their adaptive MT system, as Lilt also admitted. One should expect very different, and sub-optimal outcomes from this kind of an effort since the technology is designed to be used in an interactive mode for best results. I am also aware that most customization efforts with phrase-based SMT involves a refinement process, sometimes called hill-climbing. Just throwing some data in, and taking a BLEU snapshot, and then concluding that this is a representative outcome for that platform is just wrong and misleading. Most serious customization efforts require days of effort at least, if not weeks to complete, prior to a production release.


Another problem when using human translated content as source or reference is that in today’s world, many human translators start with a Google MT backbone and post-edit. Sometimes the post-edit is very light. This holds true whether you crowd-source, use a low-cost provider such as unBabel (which explicitly specifies that they use Google as a backbone), or a full-service provider (which may not admit this, but that is what their contract translators are doing with or without their permission). The only way to get a 100% from-scratch translation is to physically lock the translator in an internet-free room! We already know for the multi-reference data sets, that there are many equally valid ways to translate a text. When the “human” reference is edited based on Google, the scores naturally favor Google output. 

Finally,  the fact that the source data starts as Swiss German, rather than regular German may also be a minor problem. The differences between these German variants appear to be most pronounced when it is spoken rather than written, but Schriftsprache (written Swiss German) does seem to have some differences with standard high German. Wikipedia does state that: "Swiss German is intelligible to speakers of other Alemannic dialects, but poses greater difficulty in total comprehension to speakers of Standard German. Swiss German speakers on TV or in films are thus usually dubbed or subtitled if shown in Germany."

 Possible Conclusions from the Study


All this suggests that it is rather difficult for any MT vendor to conduct a competitive evaluation in a manner that would be considered satisfactory and fair to, and by, other MT vendor competitors. However, the study does provide some useful information:

  • Do NOT use News domain or news-like domain if you want to understand what the quality implications are for "typical translation work".  
  • Google has very good generic systems, which are also likely to be much better with News domain than with other specialized corporate content.
  • Comparative quality studies sponsored by an individual MT vendor are very likely to have a definite bias, especially on comparing customized systems.
  • According to this study, if these results were indeed ACTUALLY true, there would little point to using anything other than Google NMT.  However, it would be wrong to conclude that using Google would be better than properly using any of the customized options available since except for Lilt, we can presume they have not been optimally tuned. Lilt responded to my post comment on this point saying, "On slightly more repetitive and terminology-heavy domains we can usually observe larger improvements of more than 10% BLEU absolute by adaptation. In those cases, we expect that all adapted systems would outperform Google’s NMT."
  • Go to an independent agent (like me or TAUS) who has no vested interest other than to get accurate and meaningful results, which also means that everybody understands and trusts the study BEFORE they engage. A referee is necessary to ensure fair play, in any competitive sport as we all know from childhood.
  • It appears to me (only my interpretation and not a statement of fact) that Lilt's treatment of SDL was particularly unfair. In the stories of warring tribes in human literature, this usually is a sign that suggests one is particularly fearful of an adversary.  This intrigued me, so I did some exploration and found this patent which was filed and published years BEFORE Lilt came into existence.  The patent summary states: "The present technology relates generally to machine translation methodologies and systems, and more specifically, but not by way of limitation, to personalized machine translation via online adaptation, where translator feedback regarding machine translations may be intelligently evaluated and incorporated back into the translation methodology utilized by a machine translation system to improve and/or personalize the translations produced by the machine translation system."  This clearly shows that SDL was thinking about Adaptive MT long before Lilt. And, Microsoft was thinking about dynamic MT adaptation as far back as 2003. So who really came up with the basic idea of Adaptive MT technology? Not so easy to answer, is it?
  • Lilt has terrible sales and marketing advisors if they were not able to understand the negative ramifications of this "study", and did not try to adjust it or advise against publicizing it in its current form. For some of the people I talked to in my investigation, it even raises some credibility issues for the principals at Lilt.

 I am happy to offer Lilt an unedited guest post on eMpTy Pages if they care to, or wish to, respond to this critique in some detail rather than just through comments. In my eyes, they attempted to do something quite difficult and failed, which should not be condemned per se, but it should be acknowledged that the rankings they produced are not valid for "typical translation work".  We should also acknowledge that the basic idea behind the study is useful to many, even if this particular study is questionable in many ways. I could also be wrong on some of my specific criticisms, and am willing to be educated, to ensure that my criticism in this post is also fair. There is only value to this kind of discourse if it furthers the overall science and understanding of this technology, and my intent here is to question experiment fundamentals, and get to useful results, not bash on Lilt. It is good to see this kind of discussion beginning again, as it suggests that the MT marketplace is indeed evolving and maturing.

Peace.

P.S. I have added the Iconic comments as a short separate post here to provide the perspective of MT vendors who perform deep, careful, system customization for their clients and who were not included directly in the evaluation.