Pages

Pages

Wednesday, January 12, 2022

Most Popular Blog Posts of 2021

Here is the list of most popular blog posts in 2021. The only theme that I can discern in the list is that there is a greater focus on better understanding what is real and viable from a technology viewpoint and looking beyond the hype. The secondary theme is more exploration into the "how" to do it right which is all about better human-machine collaboration and creating a more robust assistant role for MT.

I have noticed that these lists tend to favor the posts that were published earliest in the year and in 2020 the post would have easily been the top post had it been published earlier in the year. 

The most popular post for the year was:

1. The Quest for Human Parity Machine Translation 


We have over the last few years, especially since the emergence of Neural MT seen several claims of MT systems having reached human parity. Anyone could show that this was not true within minutes of submitting a few sentences to verify this. The basis of the claim typically is the performance of MT systems on certain measured metrics (scores) on tiny test sets. NLG rankings have the same problem with leaderboards with over-exuberant claims of having reached human parity. Thus, the extrapolations of achieving human-level performance are extravagant, to put it mildly. However, as soon as you move away from data that is typical in the training data, one notices how brittle and fragile these systems really are.

MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. I think it would be useful to the user community-at-large that MT developers refrain from making these claims until they can show all of the following:
    • 90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human
    • Catch obvious errors in the source and possibly even correct these before attempting to translate 
    • Handle variations in the source with consistency and dexterity
    • Have at least some nominal amount of contextual referential capability
Note that these are things we would expect without question from an average translator. So why not from the super-duper AI machine? 

 



The second most popular post was a guest post by @VeredShwartz on the challenge of building AI that has common sense.

Common sense has been called the “dark matter of AI” — both essential and frustratingly elusive. That’s because common sense consists of implicit information — the broad (and broadly shared) set of unwritten assumptions and rules of thumb that humans automatically use to make sense of the world. Critics of over-exuberant AI claims frequently point out that two-year children have more common sense than existing deep-learning-based AI systems whose "understanding" is often quite brittle and easily distracted and deranged.

Common sense is easier to detect than to define. The implicit nature of most common-sense knowledge makes it difficult and tedious to represent explicitly. 

"The great irony of common sense—and indeed AI itself—is that it is stuff that pretty much everybody knows, yet nobody seems to know what exactly it is or how to build machines that possess it," said Gary Marcus, CEO, and founder of Robust.AI. "Solving this problem is, we would argue, the single most important step towards taking AI to the next level. Common sense is a critical component to building AIs that can understand what they read; that can control robots that can operate usefully and safely in the human environment; that can interact with human users in reasonable ways. Common sense is not just the hardest problem for AI; in the long run, it's also the most important problem." 


The third most popular post was based on some research I did on ModernMT which impressed me enough that I decided to join the company that built it. This decision was further validated when they announced that they were the heart of the "translation engine" that Airbnb uses to power UGC translation and ensure an optimal global CX for all their customers. This is done by translating billions of words a month through a continuously improving MT infrastructure and is quite likely to be one of the largest deployments of MT technology in the world for UGC by any global enterprise.

3. ModernMT: A Closer Look At An Emerging Enterprise MT Powerhouse

The ModernMT system was used heavily by translators who worked for Translated and the MT systems were continually adapted and modified to meet the needs of production translators. This is a central design intention and it is important to not gloss over this, as this is the ONLY MT initiative I know of where Translator Acceptance is used as the primary criterion on an ongoing basis, in determining whether MT should be used for production work or not. The operations managers will simply not use MT if it does not add value to the production process and causes translator discontent.

The long-term collaboration between translators and MT developers, and resulting system and process modifications are the key reasons why ModernMT does so well in both generic MT system comparisons by independent testers, and this is especially pronounounced in adapted/customized MT comparisons.

Over the years the ModernMT product evolution has been driven by changes to identify and reduce post-editing effort rather than optimizing BLEU scores as most others have done. This makes it the best system available for translators in my opinion as all the heavy lifting for customization is done in the background, seamlessly and transparently.

ModernMT has reached this point with very little investment in sales and marketing infrastructure. As this builds out and expands I will be surprised if ModernMT does not continue to expand and grow its enterprise presence, as enterprise buyers begin to understand that a tightly integrated man-machine collaborative platform that is continuously learning, is key to creating successful MT outcomes.

This was followed by:

4. Building Equity In The Translation Workflow With Blockchain


and an interview with ProZ which was well received and which continues to regularly generate feedback from readers. It includes links to the original podcast.


Midway through the year, I started engaging with ModernMT and Translated in a much more substantial way, and thus there was a continuity break and publishing hiatus for a while. 

The posts since my engagement with Translated are influenced by my increasing exposure to ModernMT, but they are still honest opinions that I would stand by. I expect that these posts will become much more popular as they have time to circulate.

The most popular in 2021 are:



Ideally, the “best” MT system would be identified by a team of competent translators who would run a diverse range of relevant content through the MT system after establishing a structured and repeatable evaluation process. 

This is slow, expensive, and difficult, even if only a small sample of 250 sentences is evaluated.

Thus, automated measurements that attempt to score translation adequacy, fluency, precision, and recall have to be used. They attempt to do what is best done by competent humans. This is often done by comparing MT output to a human translation in what is called a Reference Test set. These reference sets cannot provide all the possible ways a source sentence could be correctly translated. Thus, these scoring methodologies are always an approximation of what a competent human assessment would determine, and can sometimes be wrong or misleading. Small differences in scores are particularly meaningless.

Thus, identifying the “best MT” solution is not easily done. Consider the cost of evaluating ten different systems on twenty different language combinations with a human team versus automated scores. Even though it is possible to rank MT systems based on scores like BLEU and hLepor, they do not represent production performance. The scores are a snapshot of an ever-changing scene. If you change the angle or the focus the results would change.

It has recently become common practice to use "MT routers" that select the "best" MT system for you, but I maintain that this is a  practice that will often lead to sub-optimal choices, as your rankings and selections are only as good as your test set selections, and you are always looking at old, out-of-date data. MT systems are always evolving and how quickly and easily systems learn to do what you focus on is much more relevant than a score from an old ranking. 


The final post in the popularity list for 2021 is this one:

8. The Human-In-The-Loop Driving MT Progress


I expect this post will be an evergreen post since the issues raised are of long-term if not perennial interest. As we see Tesla Self Driving, Alexa, GPT-3, and the other AI fads of the day regularly fumble and fall, more and more people realize that AI can be a super assistant if properly built, but that it is wise and even imperative to keep a human-in-the-loop to keep the AI from doing dangerous or stupid things.

Neural MT “learns to translate” by looking closely (aka as "training") at large datasets of human-translated data. Deep learning is self-education for machines; you feed the system huge amounts of data, and it begins to discern complex patterns within the data.

But despite the occasional ability to produce human-like outputs, ML algorithms are at their core only complex mathematical functions that map observations to outcomes. They can forecast patterns that they have previously seen and explicitly learned from. Therefore, they’re only as good as the data they train on and start to break down as real-world data starts to deviate from examples seen during training.

In most cases, the AI learning process happens upfront and only takes place in the development phase. The model that is developed is then brought onto the market as a finished program. Continuous “learning” is neither planned nor does it always happen after a model is put into production use. This is also true of most public MT systems. While these systems are updated periodically, they are not easily able to learn and adapt to new, ever-changing production requirements. 

With language translation, the critical training data is translation memory. 
However, the truth is that there is no existing training data set (TM) that is so perfect, complete, and comprehensive as to produce an algorithm that consistently produces perfect translations.

 Human-in-the-loop aims to achieve what neither a human being nor a machine can achieve on their own. When a machine isn’t able to solve a problem, humans step in and intervene. This process results in the creation of a continuous feedback loop that produces output that is useful to the humans using the system.

With constant feedback, the algorithm learns and produces better results over time. Active and continuous feedback to improve existing learning and create new learning is a key element of this approach.

As Rodney Brooks, the co-founder of iRobot said in a post entitled - An Inconvenient Truth About AI:

 "Just about every successful deployment of AI has either one of two expedients: It has a person somewhere in the loop, or the cost of failure, should the system blunder, is very low."

In the translation context, with ModernMT, this means that the system is designed from the ground up to actively receive feedback and rapidly incorporate this into the existing model on a daily or even hourly basis.

 AI lacks a theory of mind, cognition, common sense and causal reasoning, extrapolation capabilities, and a physical body collecting multi-sensory contextual data, and so it is still extremely far from being “better than us” at almost anything slightly complex or general.

This also suggests that humans will remain at the center of complex, knowledge-based AI applications even though the way humans work will continue to change. The future is more likely to be about how to make AI be a useful assistant than it is about replacing humans. 


For those who wonder, what post has gotten the most readership over 12 years that this blog has been in place, the answer is a post I wrote on post-editor compensation in 2012. This is unfortunate as it suggests that this is an issue that people are still grappling with in 2021 and that it remains unresolved for many. It is still being read thousands of times a year if Google Analytics is to be believed:

Exploring Issues Related to Post-Editing MT Compensation



My final post of the 2021 year which I wrote during downtime during the holiday season, has little or nothing to do with MT but I somehow managed to link it to some musings on the limits of AI and machine learning. It is the post that I had the most fun writing and I think also based on initial feedback, one that people actually enjoyed reading. I would not be surprised if it is an evergreen post, i.e. one that continues to be popular over many years. I recommend it, as it is about human connection and is something that could be shared with anyone, even those who have little interest in AI, MT, or translation. It is primarily about music which many have said is the universal language, and about how music connects us to feeling and emotion where language is unnecessary:

The Human Space Beyond Language

  

Peace.


Wishing you a wonderful and successful New Year.