Here is the list of most popular blog posts in 2021. The only theme that I can discern in the list is that there is a greater focus on better understanding what is real and viable from a technology viewpoint and looking beyond the hype. The secondary theme is more exploration into the "how" to do it right which is all about better human-machine collaboration and creating a more robust assistant role for MT.
I have noticed that these lists tend to favor the posts that were published earliest in the year and in 2020 the post would have easily been the top post had it been published earlier in the year.
The most popular post for the year was:
MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. I think it would be useful to the user community-at-large that MT developers refrain from making these claims until they can show all of the following:
Note that these are things we would expect without question from an average translator. So why not from the super-duper AI machine?
Common sense has been called the “dark matter of AI” — both essential and frustratingly elusive. That’s because common sense consists of implicit information — the broad (and broadly shared) set of unwritten assumptions and rules of thumb that humans automatically use to make sense of the world. Critics of over-exuberant AI claims frequently point out that two-year children have more common sense than existing deep-learning-based AI systems whose "understanding" is often quite brittle and easily distracted and deranged.
Common sense is easier to detect than to define. The implicit nature of most common-sense knowledge makes it difficult and tedious to represent explicitly.
"The great irony of common sense—and indeed AI itself—is that it is stuff that pretty much everybody knows, yet nobody seems to know what exactly it is or how to build machines that possess it," said Gary Marcus, CEO, and founder of Robust.AI. "Solving this problem is, we would argue, the single most important step towards taking AI to the next level. Common sense is a critical component to building AIs that can understand what they read; that can control robots that can operate usefully and safely in the human environment; that can interact with human users in reasonable ways. Common sense is not just the hardest problem for AI; in the long run, it's also the most important problem."
The third most popular post was based on some research I did on ModernMT which impressed me enough that I decided to join the company that built it. This decision was further validated when they announced that they were the heart of the "translation engine" that Airbnb uses to power UGC translation and ensure an optimal global CX for all their customers. This is done by translating billions of words a month through a continuously improving MT infrastructure and is quite likely to be one of the largest deployments of MT technology in the world for UGC by any global enterprise.
The ModernMT system was used heavily by translators who worked for Translated and the MT systems were continually adapted and modified to meet the needs of production translators. This is a central design intention and it is important to not gloss over this, as this is the ONLY MT initiative I know of where Translator Acceptance is used as the primary criterion on an ongoing basis, in determining whether MT should be used for production work or not. The operations managers will simply not use MT if it does not add value to the production process and causes translator discontent.
The long-term collaboration between translators and MT developers, and resulting system and process modifications are the key reasons why ModernMT does so well in both generic MT system comparisons by independent testers, and this is especially pronounounced in adapted/customized MT comparisons.
Over the years the ModernMT product evolution has been driven by changes to identify and reduce post-editing effort rather than optimizing BLEU scores as most others have done. This makes it the best system available for translators in my opinion as all the heavy lifting for customization is done in the background, seamlessly and transparently.
ModernMT has reached this point with very little investment in sales and marketing infrastructure. As this builds out and expands I will be surprised if ModernMT does not continue to expand and grow its enterprise presence, as enterprise buyers begin to understand that a tightly integrated man-machine collaborative platform that is continuously learning, is key to creating successful MT outcomes.
This was followed by:
Ideally, the “best” MT system would be identified by a team of competent translators who would run a diverse range of relevant content through the MT system after establishing a structured and repeatable evaluation process.
This is slow, expensive, and difficult, even if only a small sample of 250 sentences is evaluated.
Thus, automated measurements that attempt to score translation adequacy, fluency, precision, and recall have to be used. They attempt to do what is best done by competent humans. This is often done by comparing MT output to a human translation in what is called a Reference Test set. These reference sets cannot provide all the possible ways a source sentence could be correctly translated. Thus, these scoring methodologies are always an approximation of what a competent human assessment would determine, and can sometimes be wrong or misleading. Small differences in scores are particularly meaningless.
Thus, identifying the “best MT” solution is not easily done. Consider the cost of evaluating ten different systems on twenty different language combinations with a human team versus automated scores. Even though it is possible to rank MT systems based on scores like BLEU and hLepor, they do not represent production performance. The scores are a snapshot of an ever-changing scene. If you change the angle or the focus the results would change.
It has recently become common practice to use "MT routers" that select the "best" MT system for you, but I maintain that this is a practice that will often lead to sub-optimal choices, as your rankings and selections are only as good as your test set selections, and you are always looking at old, out-of-date data. MT systems are always evolving and how quickly and easily systems learn to do what you focus on is much more relevant than a score from an old ranking.
The final post in the popularity list for 2021 is this one:
Neural MT “learns to translate” by looking closely (aka as "training") at large datasets of human-translated data. Deep learning is self-education for machines; you feed the system huge amounts of data, and it begins to discern complex patterns within the data.
But despite the occasional ability to produce human-like outputs, ML algorithms are at their core only complex mathematical functions that map observations to outcomes. They can forecast patterns that they have previously seen and explicitly learned from. Therefore, they’re only as good as the data they train on and start to break down as real-world data starts to deviate from examples seen during training.
In most cases, the AI learning process happens upfront and only takes place in the development phase. The model that is developed is then brought onto the market as a finished program. Continuous “learning” is neither planned nor does it always happen after a model is put into production use. This is also true of most public MT systems. While these systems are updated periodically, they are not easily able to learn and adapt to new, ever-changing production requirements.
With language translation, the critical training data is translation memory.However, the truth is that there is no existing training data set (TM) that is so perfect, complete, and comprehensive as to produce an algorithm that consistently produces perfect translations.
Human-in-the-loop aims to achieve what neither a human being nor a machine can achieve on their own. When a machine isn’t able to solve a problem, humans step in and intervene. This process results in the creation of a continuous feedback loop that produces output that is useful to the humans using the system.
With constant feedback, the algorithm learns and produces better results over time. Active and continuous feedback to improve existing learning and create new learning is a key element of this approach.
"Just about every successful deployment of AI has either one of two expedients: It has a person somewhere in the loop, or the cost of failure, should the system blunder, is very low."
In the translation context, with ModernMT, this means that the system is designed from the ground up to actively receive feedback and rapidly incorporate this into the existing model on a daily or even hourly basis.
AI lacks a theory of mind, cognition, common sense and causal reasoning, extrapolation capabilities, and a physical body collecting multi-sensory contextual data, and so it is still extremely far from being “better than us” at almost anything slightly complex or general.
This also suggests that humans will remain at the center of complex, knowledge-based AI applications even though the way humans work will continue to change. The future is more likely to be about how to make AI be a useful assistant than it is about replacing humans.
For those who wonder, what post has gotten the most readership over 12 years that this blog has been in place, the answer is a post I wrote on post-editor compensation in 2012. This is unfortunate as it suggests that this is an issue that people are still grappling with in 2021 and that it remains unresolved for many. It is still being read thousands of times a year if Google Analytics is to be believed:
Wishing you a wonderful and successful New Year.