Showing posts with label Neural Machine Translation. Show all posts

Monday, May 24, 2021

ModernMT: A Closer Look At An Emerging Enterprise MT Powerhouse

As one observes the continuing evolution of MT use in the professional translation industry, we see that we have reached a point where we have some useful insights about producing successful outcomes in our use of MT. From my perspective as a long-term observer and expert analyst of enterprise MT use, some of these include:

Adaptation and customization of a generic MT engine done with expertise generally produces a better outcome than simply using a generic public MT system.
Working with enhanced baseline engines built by experts is likely to produce better outcomes than dabbling with Open Source options with limited expertise. While it has gotten easier to produce MT systems with open-source platforms, real expertise requires long-term exposure and repeated experimentation.
The algorithms underlying Neural MT have become largely commoditized and there is little advantage gained by jumping from one NMT platform to another.
More data is ONLY better if it is clean, relevant, and applicable to the enterprise use case in focus. It can be said today that (training) data often matters more than the algorithms used, but data quality and organization is a critical factor for creating successful outcomes.
A large majority of translators still view MT with great skepticism and see it as marginally useful, mostly because of repeated exposure to incompetently deployed MT systems that are used to reduce translator compensation. Getting active and enthusiastic translator buy-in continues to be a challenge for most MT developers and getting this approval is a clear indicator of superior expertise.
Attempts to compare different MT systems are largely unsuccessful or misleading, as they are typically based on irrelevant test data or draw conclusions based on very small samples.
A large number of enterprise use cases are limited by scarce training data resources and thus adaptation and customization attempts have limited success.

I have been skeptical of the validity of many of the comparisons we see of MT systems produced by LSPs and "independent" evaluators nowadays, because of the questionable evaluation methodologies used. The evaluators often produce nice graphics but just as often produce misleading results that need further investigation. However, these comparative evaluations of different MT systems can still be useful to get a rough idea of the performance of generic systems of these MT vendors. Over the last few years ModernMT has been consistently showing up amongst the top-performing MT systems in many different evaluations, and thus I decided to sit down with the ModernMT team to better understand their technology and product philosophy and understand what might be driving this consistent performance advantage. The level of transparency and forthcoming nature of the responses from the ModernMT team was refreshing in contrast to other conversations I have had with other MT developers.

The MT journey here began over 10 years ago with Moses and Statistical MT, but unlike most other long-term MT initiatives I know of, this effort was very translator-centric right from its inception. The system was used heavily by translators who worked for Translated and the MT systems were continually adapted and modified to meet the needs of production translators. This is a central design intention and it is important to not gloss over this, as this is the ONLY MT initiative I know of where Translator Acceptance is used as the primary criterion on an ongoing basis, in determining whether MT should be used for production work or not. The operations managers will simply not use MT if it does not add value to the production process and causes translator discontent. Over many years the ongoing collaboration with translators at ModernMT has triggered MT system and process development changes to reach the current status quo, where the MT value-add and efficiency is clear to all the stakeholders. The long-term collaboration between translators and MT developers, and resulting system and process modifications are a key reason why ModernMT does so well in both generic MT system comparisons, and especially in adapted/customized MT comparisons.

Thus, translators who actively use the ModernMT platform do so most often through MateCat, an open-source CAT tool that ties together MyMemory (a large free-access shared TM repository with around 50 billion words in it) together with ModernMT or other MT platforms. MT is presented to translators as an alternative to TM on a routine basis, and corrections are dynamically and systematically used to drive continuous improvements in the ModernMT engines. Trados and other CAT tools are also able to seamlessly connect to the ModernMT back-end but these systems may see less immediate improvements in the MT output quality. However, this has not stopped ~25,000 downloads of the ModernMT plugin for Trados on the SDL Appstore. Translators who do production work for Translated are often given a choice of using Google instead of ModernMT but most have learned that ModernMT output improves rapidly from corrective feedback and that collaborative input is also easier, and thus tend to prefer it as shown in the surveys below. Over the years the ModernMT product evolution has been driven by changes to identify and reduce post-editing effort rather than optimizing BLEU scores as most others have done.

In contrast to most MTPE experiences, the individual translator experience here is characterized by the following:

A close and symbiotic relationship between a relevant translation memory and MT, even at the translator UX level
An MT system that is constantly updated and can potentially improve with every single interaction and unit of corrective feedback
Immediate project startup possibilities as no batch MT training process is necessary
Translator control over all steering data used in a project means very straightforward control over terminology and term consistency, mirroring the latest TMs and linguistic preferences
Corrective feedback given to the MT system is dynamic and continuous and can have an immediate impact on the next sentence produced by the MT system
One of very few MT systems available today that can provide a context-sensitive translation
Measurable and palpable reduction in post-editing effort and translator UX compared to other MT platforms
Continuing free access to the CAT tool needed to integrate MT with TM, and interact proactively with MT with the option to use other highly regarded CAT tools if needed.

Memory here refers to user input data TM and glossaries to tune the generic system to the needs of the current translation task

Instance-Based Adaptation

ModernMT describes itself as an "Instance-Based Adaptive MT" platform. This means that it can start adapting and tuning the MT output to the customer subject domain immediately, without a batch customization phase. There is no long-running (hours/days/weeks) data preparation and pre-training process needed upfront. There is also no need to wait and gather a sufficient volume of corrective feedback to update and improve the MT engine on an ongoing basis. It is learning all the time.

Rapid adaptation to customer-unique language and terminology is perhaps the single most critical requirement for a global enterprise, and thus this is an optimal design for enterprises that works optimally with their specialized and unique content. This is also true for LSPs too, for that matter. ModernMT can adapt the MT system with as little as a single sentence, though the results are better if more data is provided. The team told me that 100K words (10-12,000 sentences) would generally produce consistently good results that are superior to any generic engine. The long-term impact of this close collaboration with translators who provide ongoing corrections, feedback on critical requirements to improve efficiency and process workflow, and careful acquisition of the right kind of data, results in the kind of relative performance rankings that ModernMT now regularly sees as a matter of course. One might even go so far as to say that they have built a sustainable competitive advantage.

I have always felt that a properly designed Man-Machine collaboration would very likely outperform an MT design approach that relies entirely on algorithms and/or data alone. We can see this is true from the comparative results of the large public MT portals who probably have 100X or more of the resources and budget that ModernMT does. The understanding of the translation task and resulting directives that ongoing translator feedback brings to the table is an ingredient that most current MT systems lack. Gary Marcus and other AI experts have been vocal in pointing out that machine learning and data alone is not the best way forward and more human steering and symbolic knowledge is needed for better outcomes.

Special Features

ModernMT is a context-aware machine translation product that learns from user corrections. There has recently been growing interest in the MT research community to bring a greater degree of contextual awareness to MT systems and ModernMT has also been investigating implementing capabilities around doing this. The current production version has an implementation of this already, and this feature continues to evolve in speed, efficiency, and capability.

The ModernMT Context Analyzer analyzes an entire document text to be translated in milliseconds before producing a translation. This analysis seeks out and identifies the distinctive terminology and intrinsic style of the document. This information is then used to automatically select the most suitable private translation memories loaded by the user for that particular document. This results in the engine selecting the translation memory inventory that best reflects the right terminology and writing style. It is precisely this inventory that the MT engine leverages to customize the output in real-time, for each and every sentence of the document.

As translators at Translated working with ModernMT regularly have the ability to compare the MT output with that of Google Translate, the developers monitor translator preferences on an ongoing basis. This ensures that translators are always working with the MT output that they find most useful and that developers understand when their own engines need to be improved or enhanced. The following charts are based on feedback from translators during production work and show a very definite preference for the rapidly improving ModernMT engine output. This preference is seen in internal translator assessments working in production mode rather than just a selective test set, and this has also been confirmed by independent third-party assessments with both automated scores and human evaluations. They all consistently show that ModernMT customizations regularly outperform most others in independent comparative evaluations. The forces driving this superior performance are the result of design philosophy and long-term man-machine collaboration that cannot be easily replicated by others.

Recent comparative assessments done by independent third parties also confirm this preference using different evaluation methods that include both human and automated metrics as shown below. It is not unreasonable to presume that this performance advantage will remain intact for at least the short term.

Data Privacy

In response to a question on data privacy, Davide Caroselli, VP of Product, ModernMT responded: "Any content sent to ModernMT, whether a “TMX” memory or an MTPE correction from a professional translator, is saved in the user’s private data area. In fact, only you will be able to access your resources and make ModernMT adjust to them; in no way will another user be able to utilize that same inventory for his/her system, nor will ModernMT itself be able to use those contents, other than to exclusively offer your personalized translation service.

In addition, ModernMT uses state-of-the-art encryption technologies to provide its cloud services. Our data centers, employee processes and office operations are ISO 27001:2013 certified."

On-Premise Capabilities

While the bulk of the current ModernMT customer base works with the secure cloud deployment, the team at ModernMT has also defined a range of on-premise deployment capabilities for those enterprises that need the security, control, and assured data privacy needs that characterizes some National Security, Financial, Legal, and Healthcare/Pharma industry requirements. The open-source foundations of much of the ModernMT infrastructure should make it particularly interesting to US Government Intelligence and Law-Enforcement agencies seeking large-scale multilingual data processing capabilities for eDiscovery and Social Media Surveillance applications.

Given that ModernMT is a continuous learning MT platform that learns with each correction, dynamically, there is a requirement for more GPU infrastructure than some other on-premise solutions in the market. However, there is a strong focus on computational efficiency to minimize the IT footprint needed to deploy it on-premise, and based on information provided to me, their capabilities are quite similar to competitive alternatives both in terms of hardware requirements and software pricing. Hardware costs are linked to throughput expectations with more hardware required for high throughput requirements. As with most machine learning-intensive capabilities, only enterprises with competent IT teams could undertake this as an internal deployment, and most LSPs and localization departments will see a lower total cost of ownership with the cloud deployment.

Enterprise Readiness

As ModernMT has evolved from the localization world it is already optimized for MT use cases where there is a significant need for a machine-first human optimized approach. More and more we see this model as being a preferred approach for the exploding volumes of localization content. The Localization Use Case is possibly the most challenging MT use case out there, as it requires very high-quality initial output that translators are willing to work with where it can be proven that the MT enhances productivity and efficiency. Localization use cases demand the highest quality MT output from the outset compared to eDiscovery, social media surveillance, eCommerce, customer service & support use cases which are all more tolerant of lower MT output quality on much larger volumes of data. Very few MT developers have had success with the high-quality and rapid responsiveness needs of the localization use case and many have tried and failed. This is why LSP adoption of MT is so low. ModernMT's success with the challenging localization use case, however, positions them very well for other MT use cases as their growing success with these other use cases proves.

The ASTW case study illustrates the success of ModernMT in Intellectual Property (Patents) and Life Science focused translations, where the ease of customization for complex terminology and morphology, the ability to learn continuously and quickly from corrective feedback, and superior MTPE experience compared to other MT solutions has quickly made it a preferred solution.

"ModernMT is currently our favorite MT engine, especially in patent translations and in the Life Science sector, because it proves reliable, efficient, qualitatively better than its competitors, easily customizable and advantageous in terms of cost."
Domenico Lombardini, CEO ASTW

We see that eCommerce giants understand the positive impact of translating huge volumes of catalog and user-generated CX content has on driving international revenue growth with the examples of eBay, Amazon, and Alibaba. ModernMT is now the MT engine driving the multilingual expansion of Airbnb web content and is translating many billions of words a month for them. User-generated content influences future customers, and there is great value in translating this content to drive and grow international business. Interestingly ModernMT began this initiative with almost no translation memory and had to perform specialized heuristic analysis on Airbnb content to build the training material.

ModernMT has reached this point with very little investment in sales and marketing infrastructure. As this builds out and expands I will be surprised if ModernMT does not continue to expand and grow its enterprise presence, as enterprise buyers begin to understand that a tightly integrated man-machine collaborative platform that is continuously learning, is key to creating successful MT outcomes. I am aware that many other high-profile enterprise conversations are underway, and I expect that most enterprise buyers who evaluate the ModernMT platform will very likely find it is a preferred, cost-efficient way to implement large-scale MT solutions in a way that dramatically raises the likelihood of success.

Future Directions

Davide also mentioned to me that his team is very connected to the AI community in Italy, and have been experimenting with GPT-3 and BERT, and will continue to do so until clear value-added applications that support and enhance their MT product emerge. ModernMT has a close relationship with Pi Campus and thus has regular interaction with luminaries in the AI community e.g. Lukasz Kaiser who will be speaking about improvements in the Transformer architecture later this month.

The team also showed me demos of complex video content that had ModernMT-based automated dubbing from English to Italian injected into it. Apparently, Italy is one of the largest dubbing markets in the world. Who knew? Since my wife speaks Italian, I showed her some National Geographic content on geology, filled with complex terminology and scientific subject matter that she was shocked to find out had been done completely without human modification. The Translated team is exploring Speech Translation and I expect that they will be quality leaders here too.

ModernMT will continue to expand its connectivity to other translation and content management infrastructure to make it easier to get translation-worthy data in and out of their environment. They also continue to explore ways to make the ModernMT continuous training infrastructure more computationally efficient so that it can be more easily deployed on smaller footprint hardware.

I expect we will see more and more of ModernMT on the enterprise MT stage from now on, as buyers realize that this is a significantly improved next-generation MT solution that is more likely to produce successful outcomes in digital transformation-related enterprise use scenarios. The ModernMT approach reduces the uncertainty that is so common with most MT-related initiatives and does it so seamlessly that most would not realize how sophisticated the underlying technology is until they attempt to replicate the functionality.

On a completely different note, I participated some months ago in responding to a question posed by Luca Di Biase, the Imminent Research Director. He posed this same question to many luminaries in the translation industry, and also to me. The question has already triggered several discussions on Twitter.

“Is language a technology or a culture?”

My response was as follows, but I think you may find the many other responses more interesting and complete if you go to this link or look at some of the other Twitter comments.

It is neither. Language is a means of communication and an information-sharing protocol that employs sounds, symbols, and gestures. Language can sometimes use technology to enable amplification, extend the reach of messages, and accelerate information and knowledge sharing. Language can create a culture when shared with(in) a group and used with well-understood protocols and norms. Intercultural communication can also mean cross species, e.g., when communicating with dogs and horses.

Translated's Research Center has just released the Imminent publication which has a distinctive style coupled with interesting content, that I think most in the language industry would find compelling and worth a close look.

Thursday, March 25, 2021

The Impact of MT on the Freelance Translator

The ProZ.com Podcast

This is a conversation or interview that I did with Paul Urwin of Proz where the links will take you to the podcast.

The conversation covers possible strategies that freelance translators can adopt to deal with PEMT and provides some guidance (hopefully) on potential new skills that professionals can develop.

It also provides context on how valuable the translator is even with continuously improving MT and points to a growing awareness that translators are a resource whose value can only grow in importance given the never-ending momentum on content that needs to be translated.

This is Part 1.

Paul talks with machine translation expert Kirti Vashee about interactive-adaptive MT, linguistic assets, freelance positioning, how to add value in explosive content situations, e-commerce translation and the Starship Enterprise.

This is Part 2.

Paul continues the fascinating discussion with Kirti Vashee on machine translation. In this episode, they talk about how much better MT can get, which languages it works well for, data, content, pivot languages and machine interpreting.

The Future is NOT just MT

A relatively easy way to understand the power of an adaptive MT solution (that learns from corrective feedback dynamically) is to test ModernMT with the free open-source Matecat CAT tool. A more detailed overview of the capability is given here.

Monday, January 11, 2021

Most Popular Blog Posts of 2020

This is a summary ranking of the most popular blog posts of the 2020 year based on readership traffic and presence. These rankings are based on the statistics given to me by the hosting platform, which sometimes fluctuate much more than one would expect.

I am pleased to see that there is an increasing awareness on the importance data analysis in multiple phases of the MT development and deployment process. Data analysis matters for training data selection, improved linguistic pattern handling, effective testing and quality estimation amongst other things. The tools to do this well are still lacking robustness or need major improvements to make them more useable. As the world shifts from handling translation projects for localization (relatively low volume), to other digital presence and large scale content assimiliation and dissemination use cases, I think there will be a need for better tools. I am surprised by the continuing stream new TMS products that continue to emerge, most of these new products have a huge amount of overlap with existing products, and none of the new tools really change the playing field in a meaningful way.

The single most popular post of the year was this one which was an interview with Adam Bittlingmayer on the risk prediction capabilities of Modelfront:

1. Understanding Machine Translation Quality & Risk Prediction

Getting a better understanding of data and identifying the most critical data issues is key to success with MT. Better data analysis would mean that human efforts can be focused on a much smaller set of data and thus yield better overall quality in less time. The risk prediction and quality estimation data provided by ModelFront makes MT use much more efficient. It allows rapid error detection and can help isolate translation projects by high-touch and low-touch elements. I suspect much of the readership of this post came from outside the translation industry as I continue to see little focus on this in the localization world. This post is worth a closer look for those LSPs who are investigating a more active use of MT. This link will lead to a case study to show how this can help in localization projects.

Despite the hype around the magic of NMT and deep learning in general, we should understand that deep learning NMT toolkits are going to be viewed as commodities. Data analysis is where value creation will happen.

The data is your teacher and is where the real value contribution possibilities are. I predict that this will continue to be clearer in the coming year.

2. The Premium Translation Market: Hiding In Plain Sight

If we consider all three posts related to the "Premium Translation", it would easily be the top blog theme for the year. These posts together attracted the most active readership, and also the most articulate and comprehensive comments. MT technologists tend to lump all translators together when making comments about "human" translators, but we should understand that there is a broad spectrum of capabilities when we talk about "human" translators. And those who collaborate, consult, and advise their customers around critical content communication are unlikely to be replaced by ever-improving MT. Real domain expertise, insight, and the ability to focus on the larger global communication mission of the content is something I do not see MT approach successfully in my lifetime.

I am extremely skeptical about the notion of "singularity" as some technologists have described it. It is a modern myth that will not happen as described IMO -- most AI today and machine learning, in particular, is no more than sophisticated pattern matching in big data, and while it can be quite compelling at times, it is NOT intelligence. Skillful use and integration of multiple deep learning tasks can create the illusion of intelligence but I feel that we have yet, much to learn about human capabilities, before we make "human equivalent" technology.

Here is a smart discussion on AI that provides informed context and a reality check on the never-ending AI hype that we continue to hear. Translators, in particular, will enjoy this discussion as it reveals how central language understanding is to the evolution of AI possibilities.

There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not a pixie dust that magically solves all your problems."

Steven Pinker

3. Understanding Data Security with Microsoft Translator

The pandemic has forced many more B2C and B2B interactions to be digital. In an increasingly global world where it is commonplace to have multilingual communication, collaboration and information sharing MT becomes an enabler of global corporate presence. I estimate that the huge bulk of the need for translation is beyond the scope of the primary focus of localization efforts. It is really much more about digital interactions across all kinds of content, multiple platforms that require instant multilingual capabilities.

However, data security and privacy really matter in these many interactions, and MT technology that does not make data security a primary focus in the deployment should be treated with suspicion and care.

Microsft offers quite possibly the most robust and secure cloud MT platform for companies who wish to integrate instant translation into all kinds of enterprise content flows. The voice of a Microsoft customer states the needs quite simply and eloquently.

“Ultimately, we expect the Azure environment to provide the same data security as our internal translation portal has offered thus far,”

Tibor Farkas, Head of IT Cloud at Volkswagen

4. Data Preparation Best Practices for Neural MT

This was a guest post by Raymond Doctor that illustrates the significant added value that linguists can add to the MT development process. Linguistically informed data can make some MT systems considerable better than just adding more data. Many believe that the need is simply more data but this post clarifies that a smaller amount of the right kind of data can have a much more favorable impact than sheer random data volume.

The success of these MT experiments is yet more proof that the best MT systems come from those who have a deep understanding of both the underlying linguistics, as well as the MT system development methodology.

Here is a great primer on the need for data cleaning in general. This post takes the next step and provides specific examples of how this can be extended to MT.

"True inaccuracy and errors in data are at least relatively straightforward to address because they are generally all logical in nature. Bias, on the other hand, involves changing how humans look at data, and we all know how hard it is to change human behavior."

- Michiko Wolcott

5. Exploring Issues Related to Post-Editing MT Compensation

I have been surprised at the continuing popularity of this post which was actually written and published in March 2012, almost 9 years ago. Interestingly Sharon O'Brien stated this at the recent AMTA2020 conference. She tried to get a discussion going on why have the issues being discussed around post-editing not changed in ten years.

The popularity of this post points to how badly PEMT compensation is being handled even in 2020. Or perhaps it suggests that people are doing research to try and do it better.

Jay Marciano had a presentation at ATA recently, where he argued that since there is no discernible and reliable differentiator between fuzzy translation memory matches and machine translation suggestions (assuming that you are using a domain trained machine translation engine), we should stop differentiating them in their pricing. Instead, he suggested that they should all be paid by edit distance. ("Edit distance" is the now widely used approach to evaluating the number of changes the editor or translator had to make to an MT suggestion before delivering it.)

Doing this, according to Jay, protects the translator from poor-quality machine translation (because the edit distance -- or rewrite from scratch --will, in that case, be large enough for 100% payment) as well as from bad translation memories (same reason). Also, he suggests payment for MT suggestions with no edit distance, i.e., suggestions where no edits were deemed necessary (20% of the word price) at a rate twice as high as a 100% TM match (10%) to compensate for the effort to evaluate their accuracy. He also suggests a 110% rate for an edit distance of 91-100%, taking into account the larger effort needed to "correct" something that was rather useless in the first place.

This is an attempt to be fair but, practically, it is a hard-to-predict compensation scheme and most customers like to know costs BEFORE they buy. There are many others who think we should still be looking at an hourly-based compensation scheme. We do not encounter discussions on how a mechanic, electrician, accountant, or lawyer takes too long to do a job as a reason not to hire them, and perhaps translation work could evolve to this kind of a model. It is not clear how this could work when very large volumes (millions of words/day) of words are involved as the edit-distance approach really only viable in post-editing of MT use scenarios.

Nonetheless, much of the current thinking on the proper PEMT compensation model is to use Edit Distance-based methodologies. While this makes sense for localization MT use cases, this approach is almost useless for the other higher volume MT use cases. The quality and error assessment schemes proposed in localization are much too slow and onerous to use in scenarios where millions or hundreds of millions of words are being translated every day.

It is my sense that 95% of MT use is going to be outside of localization use cases (PEMT) and I think the more forward-looking LSPs will learn to find approaches that work better when the typical translation job handles millions of words a week. Thus, I am much more bullish on quality estimation and risk prediction approaches that are going to be a better way to do rapid error detection and rapid error correction for these higher business value, higher volume MT use cases.

The issue of equitable compensation for the post-editors is an important one, and it is important to understand the issues related to post-editing, that many translators find to be a source of great pain and inequity. MT can often fail or backfire if the human factors underlying work are not properly considered and addressed.

From my vantage point, it is clear that those who understand these various issues and take steps to address them are most likely to find the greatest success with MT deployments. These practitioners will perhaps pave the way for others in the industry and “show you how to do it right” as Frank Zappa says. Many of the problems with PEMT are related to ignorance about critical elements, “lazy” strategies and lack of clarity on what really matters, or just simply using MT where it does not make sense. These factors result in the many examples of poor PEMT implementations that antagonize translators.

6. Observations on the Translation Industry

This is a guest post by Luigi Muzii with his observations on several hot issues in the localization business. He comments often on the misplaced emphasis and attention on the wrong problem. For some reason, misplaced emphasis on the wrong issues has been a long-term problem in the business translation industry.

Almost more interesting than disintermediation – removing the middleman – is intermediation that adds the middleman back into the mix. Intermediation occurs when digital platforms inject themselves between the customers and a company. In this case, the global enterprise and the translators who do the translation work. These platforms are so large that businesses can’t afford not to reach customers through these platforms. Intermediation creates a dependency and disintermediation removes the dependency. There is no such intermediary for translation though some might argue that the big public MT portals have already done this and the localization industry only services the niche needs.

He focuses also on the emergence of low-value proposition, generic MT portals with attached cheap human review capabilities as examples of likely to fail attempts at disintermediation. It is worth a read. An excerpt:

"It is my observation, that these allegedly “new offerings” are usually just a response to the same offering from competitors. They should not be equated to disintermediation and they often backfire, both in terms of business impact and brand image deterioration. They all seem to look like dubious, unsound initiatives instigated by Dilbert’s pointy-haired boss. And the Peter principle rules again here and should be considered together with Cipolla’s laws of stupidity, which state that a stupid person is more dangerous than a pillager and often does more damage to the general welfare of others. "

By Vincedevries - Own work, CC BY-SA 4.0

The danger of the impact of the stupid person is proven by what we have seen from the damage caused by the orange buffoon to the US. This man manages to comfortably straddle both the stupid and bandit quadrants with equal ease even though he started as a bandit. Fortunately for the US, the stupid element was much stronger than the bandit element in this particular case. Unfortunately for the US, stupid bandits can inflict long-term damage on the prospects of a nation and it may take a decade or so to recover from the damage done.

“The reason why it is so difficult for existing firms to capitalize on disruptive innovations is that their processes and their business model that make them good at the existing business actually make them bad at competing for the disruption.”

'Disruption' is, at its core, a really powerful idea. Everyone hijacks the idea to do whatever they want now. It's the same way people hijacked the word 'paradigm' to justify lame things they're trying to sell to mankind."

Clay Christensen

“Life’s too short to build something nobody wants.”

Ash Maurya

7. Lead and Gold: Challenging the Premium Translation Market Claims

Luigi also wrote a critique of my post on the Premium Market and challenged many of the assumptions and conclusions I had drawn. I thought it would only be fair to include it in this list so that readers could get both sides of the subject on the premium market discussion.

I also noted that the following two posts got an unusual amount of attention in 2020. The BLEU score post has been very popular in two other forums where it has been published. There are now many other quality measurements for adequacy and fluency being used but I still see a large number of new research findings reporting with BLEU, mostly because it is widely understood in all its imperfection.

The latest WMT results use Direct Assessment (DA) extensively in their results summaries.

Direct assessment (DA) (Graham et al., 2013,2014, 2016) is a relatively new human evaluation approach that overcomes previous challenges with respect to lack of reliability of human judges. DA collects assessments of translations separately in the form of both fluency and adequacy on a 0–100 rating scale, and, by combination of repeat judgments for translations, produces scores that have been shown to be highly reliable in self-replication experiments. The main component of DA used to provide a primary ranking of systems is adequacy, where the MT output is assessed via a monolingual similarity of meaning assessment. In Direct Assessment humans assess the quality of a given MT output translation by comparison with a reference translation (as opposed to the source and reference). DA is the new standard used in WMT News Translation Task evaluation, requiring only monolingual evaluators. For system-level evaluation, they use the Pearson correlation r of automatic metrics with DA scores.I have not seen enough comparison data of this to have an opinion on efficacy yet.

Understanding MT Quality: BLEU Scores

Most Popular Blog Posts of 2019 had an unusually high traffic flow and would rank in the Top 5

I wish you all a Happy, Prosperous and Healthy New Year

Tuesday, December 22, 2020

American Machine Translation Association (AMTA2020) Conference Highlights

This post is to a great extent a belated summary of the last highlights from the AMTA2020 virtual conference which I felt was one of the best ones (in terms of signal-to-noise ratio), held in the last ten years. Of course, I can only speak to those sessions I attended, and I am aware that there were many technical sessions that I did not attend that were also widely appreciated by others. This post is also a way to summarize many of the key challenges and issues being faced by MT today and is thus a good way to review the SOTA of the technology as this less-than-wonderful year ends.

The State of Neural MT

I think that 2020 is the year that Neural MT became just MT. Just regular MT. It is superfluous to add neural anymore because most of the MT discussions and applications that you see today are NMT based and it would be like saying Car Transportation. It might still be useful to say SMT or RBMT to point out use that is not a current mainstream approach, but it is less necessary to say neural MT anymore. While NMT was indeed a big or even HUGE leap forward, we have reached a phase where much of the best research and discussion is focused on superior implementation and application of NMT, rather than just simply using NMT. There are many open-source NMT toolkits available and it is clearly the preferred MT methodology in use today, even though SMT nor RBMT are not completely dead. And some still argue that these older approaches are better for certain specialized kinds of problems.

However, while NMT is a significant step forward in improving the generic MT output quality, there are still many challenges and hurdles ahead. Getting back to AMTA2020, one of the sessions (C3) talked specifically about the most problematic NMT errors across many different language combinations and provided a useful snapshot of the situation. The chart below is a summary of the most common kinds of translation errors found across many different language combinations. We see that while the overall level of MT output acceptability has increased, many of the same challenges still remain. Semantic confusion around word ambiguity, unknown words, and dialectical variants continue to be challenging. NMT has a particular problem with phantom or hallucinated text - it sometimes simply creates stuff that is not in the source. But, we should be clear that the proportion of useful and acceptable translations continues to climb and is strikingly "good" in some cases.

A concern for all the large public MT portals, that translate billions of words an hour, is the continuing possibility of catastrophic errors that are offensive, insensitive, or just simply outlandish. Some of these are listed below from a presentation made by Mona Diab, a GWU/Facebook researcher who presented a very interesting overview of something she called "faithful" translation.

This is a particularly urgent issue for those platforms like Facebook and Twitter that face the huge volumes of social media commentary on political and social events. Social media, in case you did not know, is increasingly the way that much of the world consumes news.

The following slides show what Mona was pointing to when she talked about "Faithfulness" and I recommend that readers look at her whole presentation which is available here. MT on social media can be quite problematic as shown in the next chart.

She thus urged the community to find better ways to assess and determine acceptable or accurate MT quality especially in high-volume social media translation settings. Her presentation provided many examples of problems and described a need for a more semantically accurate measure that she calls "Faithful MT". Social media is an increasingly more important target of translation focus and we have seen the huge impact that commentary in social media can have on consumer buying behavior, political opinion, brand identity, brand reputation, and even political power outcomes. A modern enterprise, commercial, or government, that does not monitor ongoing relevant social media feedback is walking blind, and likely to face unfortunate consequences from this lack of foresight.

Mona Diab's full presentation is available here and is worth a look as I think it defines several key challenges for the largest users of MT in the world. She mentioned that Facebook processes 20B+ translation transactions per day which could mean anywhere from 100 billion to 2 trillion words a day. This volume will only increase as more of the world comes online and could be twice the current volumes in as little as a year.

Another keynote that was noteworthy (for me) was the presentation by Colin Cherry of Google Research: "Research stories from Google Translate’s Transcribe Mode". He found a way to present his research in a truly compelling way in a style that was both engaging and compelling. The slides are available here but without his talk track, it is barely a shadow of the presentation I watched. Hopefully, AMTA will make the video available.

Chris Wendt from Microsoft also provided insight into the enterprise use of MT and showed some interesting data in his keynote. He also gave some examples of catastrophic errors and had this slide to summarize the issues.

He pointed out that in some language combinations it is possible to use 'Raw MT" across many more types of content than in others because these combinations tend to perform better across much more content variation. I am surprised by how many LSPs still overlook this basic fact, i.e. all MT combinations are not equivalent.

He showed a ranking of the "best" language pair combinations (as in closest to human references) that probably is most meaningful for localization users. But could also be useful to others who want to understand roughly what the MT system quality ratings by language are.

Normally vendor presentations at conferences have too much sales emphasis and too little information content in them to be interesting. I was thus surprised by the Intento and Systran presentations which were both content-rich, educational, and informational-rich. A welcome contrast to the mostly lame product descriptions we normally see.

While MT technology presentations focused on large-scale use cases (i.e. NOT localization) are making progress in great strides, my feeling is that the localization presentations were inching along progress-wise with post-editing management, complicated data analysis, and review tools themes that really have not changed very much in ten years. A quick look at the most-read blog posts on eMpTy Pages also confirmed that a post I wrote in 2012 on Post-Editing Compensation has made it into the Top 10 list for 2020. Localization use cases still struggle to eke out value from MT technology because it is simply not yet equivalent to human translation. There are clear use cases for both approaches (MT and HT) and it has always been my feeling that localization is a somewhat iffy use case and can only work for the most skilled practitioners who make long-term investments in building suitable translation production pipelines. If I were able to find a cooperative MT developer team I think I would be able to architect a better production flow and man-machine engagement model than much of what I have seen over the years. The reality of MT use in localization still has too much emphasis on the wrong syllable. I hope I get the chance to do this in 2021.

MT Quality Assessment

BLEU is still widely used today to describe and document progress with MT model development, even though it is widely understood to be inadequate in measuring quality changes with NMT models in particular. However, BLEU provides long-term progress milestones for developers in particular and I think the use of BLEU scores in that context still has some validity assuming that proper experimental rigor is followed. BLEU works for this task because it is relatively easy to set up and implement.

The use of BLEU to compare MT systems from different vendors, using public domain test sets is more problematic - my feeling is that it will lead to erroneous conclusions and sub-optimal system selection. To put it bluntly, it is a bullshit exercise that appears scientific and structured but is laden with deeply flawed assumptions and ignorance.

None of the allegedly "superior metric" replacements have really taken root because they simply don't add enough additional accuracy or precision to warrant the overhead, extra effort, and experimental befuddlement. Human evaluation feedback is now a core requirement for any serious MT system development because it is still the best way to accurately understand relative MT system quality and determine progress in development related to specific use scenarios. The SOTA today is still multiple automated metrics + human assessments when accuracy is a concern. As of this writing, a comparison system that spits out relative rankings of MT systems without meaningful human oversight I think is suspect and should be quickly dismissed.

However, the need for better metrics to help both developers and users quickly understand the relative strengths and weaknesses of multiple potential MT systems is even more urgent today. If a developer has 2 or 3 close variants of a potential production MT system, how do they tell which is the best one to commit to? The need to understand how a production system improves or degrades over time is also very valuable.

Unbabel presented their new metric: COMET and provided some initial results on its suitability and ability to solve the challenge described above for example. That is to successfully rank several high-performing MT systems.

The Unbabel team seems very enthusiastic about the potential for COMET to help with:

Determine the ongoing improvement or degradation of a production MT system
Differentiate between multiple high-performing systems with better accuracy than has been possible with other metrics

Both of these apparently can be done with less human involvement and better automated feedback on these two issues is clearly of high value. It is not clear to me how much overhead is involved in using the metric as we do not have much experience of its use outside of Unbabel. It is being put into open source and will possibly attract a broader user community at least from sophisticated LSPs who stand to gain the most from a better understanding of the value of retraining and doing better comparisons of multiple MT systems than is possible with BLEU, chrF, and hLepor. I hope to dig deeper into understanding COMET in 2021.

The Translator-Computer Interface

One of the most interesting presentations I saw at AMTA was by Nico Herbig from DFKI on what he called the Multi-modal interface for post-editing. I felt it was striking enough that I asked Nico to contribute a guest post on this blog. This post is now the most popular one over the last two months and can be read at the link below. He has also been covered in more detail and discussion by Jost Zetzche in his newsletter.

The Evolving Translator-Computer Interface

While Nico focused on the value of the multi-modal system to the post-editing task at AMTA, it has great value and applicability for any translation-related task. A few things stand out for me about this initiative:

It allows a much richer and more flexible interaction with the computer for any translation-related task.
It naturally belongs in the cloud and is likely to offer the most powerful user assistance experience in the cloud setting.
It can be connected to many translation assistance capabilities like Linguee, dictionaries, terminology - synonym - antonym databases, MT, and other translator reference aids to transform the current TM focused desktop.
It creates the possibility of a much more interactive and translator-driven interaction/adaptation model for next-generation MT systems that can learn with each interaction.

I wish you all a Happy Holiday season and wish a Happy, Healthy, and Prosperous New Year.

(Image credit: SkySafari app and some fun facts)

eMpTy Pages

Pages