Tuesday, January 19, 2021

Adding Commonsense Reasoning to Natural Language Processing Applications

This article is reprinted with permission from the original poster @VeredShwartz . This post might be challenging reading for the usual reader of this blog, but I think that even skimming through this might be useful to many to get a sense for possibly the most formidable challenge in the artificial intelligence community: building common sense capabilities into existing and emerging AI deployments.  

Commonsense knowledge consists of facts about the everyday world, that all humans are expected to know. Commonsense knowledge helps to solve problems in the face of incomplete information. It is currently considered an unsolved problem in AGI and is a focus of the Allen Institute for Artificial Intelligence which the author is associated with. 

Deep learning is self-education for machines; you feed a machine learning system huge amounts of data, and eventually it begins to discern patterns all by itself.  But despite their remarkable achievements, and occasional ability to produce human-like outputs, machine learning algorithms are at their core complex mathematical functions that map observations to outcomes. Or are able to forecast patterns that they have previously seen and explicitly learned. Therefore, they’re as good as their data and they start to break as the data they face in the world starts to deviate from examples they’ve seen during training. Neural MT is an example, great progress indeed, but far from having solved the translation problem.  

We hear continuously about the relentless "big data" that is driving AI progress, but we are finding more and more cases where the current approach of deep learning and more data is not enough. The path to machine commonsense is unlikely to be brute force training of larger neural networks with deeper layers on more data.  Whilst deep learning excels at pattern recognition, it’s very poor at adapting to changing situations even when small modifications of the original case are encountered, and often has to be re-trained with large amounts of data from scratch. 

"The great irony of common sense—and indeed AI itself—is that it is stuff that pretty much everybody knows, yet nobody seems to know what exactly it is or how to build machines that possess it," said Gary Marcus, CEO and founder of Robust.AI. "Solving this problem is, we would argue, the single most important step towards taking AI to the next level. Common sense is a critical component to building AIs that can understand what they read; that can control robots that can operate usefully and safely in the human environment; that can interact with human users in reasonable ways. Common sense is not just the hardest problem for AI; in the long run, it's also the most important problem." 

Common sense has been called the “dark matter of AI” — both essential and frustratingly elusive. That’s because common sense consists of implicit information — the broad (and broadly shared) set of unwritten assumptions and rules of thumb that humans automatically use to make sense of the world. Critics of over-exhuberant AI claims frequently point out that two-year children have more common sense than existing deep-learning based AI systems whose "understanding" is often quite brittle and easily distracted and deranged.

Common sense is easier to detect than to define. The implicit nature of most common-sense knowledge makes it difficult and tedious to represent explicitly. 

DARPA, the US defense department’s research agency, has also recognized the absence of common sense as being an important issue. They recently launched a project called Machine Common Sense. As they say:“ The absence of common sense prevents intelligent systems from understanding their world, behaving reasonably in unforeseen situations, communicating naturally with people, and learning from new experiences. Its absence is considered the most significant barrier between the narrowly focused AI applications of today and the more general, human-like AI systems hoped for in the future”. 

Gary Marcus suggests combining traditional AI approaches together with deep learning as a way forward. 

"First, classical AI actually IS a framework for building cognitive models of the world that you can then make inferences over. The second thing is, classical AI is perfectly comfortable with rules. It’s a strange sociology right now in deep learning where people want to avoid rules. They want to do everything with neural networks, and do nothing with anything that looks like classical programming. But there are problems that are routinely solved this way that nobody pays attention to, like making your route on Google maps.

We actually need both approaches. The machine-learning stuff is pretty good at learning from data, but it’s very poor at representing the kind of abstraction that computer programs represent. Classical AI is pretty good at abstraction, but it all has to be hand-coded, and there is too much knowledge in the world to manually input everything. So it seems evident that what we want is some kind of synthesis that blends these approaches."

Yejin Choi and her collaborators at the Allen Institute have united traditional symbolic AI approaches with newer machine learning approaches in an attempt to address the commonsense challenge. One initiative, COMET (short for “commonsense transformers”) extends traditional symbolic reasoning with the latest advances in neural language modeling — a kind of deep learning that aims to imbue computers with a statistical “understanding” of written language. COMET is a  fusion of symbolic reasoning with a neural network and tries to solve the coverage and brittleness problems, of purely DL-approaches, at the same time.  COMET works by reimagining common-sense reasoning as a process of generating plausible (if imperfect) responses to novel input, rather than making airtight deductions by consulting a vast encyclopedia-like database.

Gary Marcus, a critic of the deep-learning fanboys and girls, often points out DL-only shortcomings to challenge the over-exhuberance of these fans. To put progress in AI into a more realistic context he says: “Just because you can build a better ladder doesn’t mean you can build a ladder to the moon.” To him and others, COMET’s approach suffers from a fundamental limitation of deep learning: “statistics ≠ understanding.”

Regardless, Vered presents a comprehensive picture of the many challenges faced and attempts at developing solutions in introducing commonsense to NLP applications in arguably one of the most challenging problems in computing today. I think  her post is a great resource for anybody who wants to quickly get a sense for the issue and the SOTA.


Commonsense Reasoning for Natural Language Processing

This long-overdue blog post is based on the Commonsense Tutorial taught by Maarten Sap, Antoine Bosselut, Yejin Choi, Dan Roth, and myself at ACL 2020. Credit for much of the content goes to the co-instructors, but any errors are mine. 

In the last 5 years, popular media has made it seem that AI is nearly---if not already---solved by deep learning, with reports on super-human performance on speech recognition, image captioning, and object recognition. The release of Google Translate’s neural models in 2016 reported large performance improvements: “60% reduction in translation errors on several popular language pairs”. But looking under the hood, these numbers seem to be misleading. Neural models find shortcuts to the correct answers through dataset-specific input-output correlations, essentially solving the dataset but not the underlying task. When models are challenged with adversarial out-of-domain examples, they perform poorly. Small unnoticeable noise added to images confuses object recognition models and changes their predictions. Visual question answering models guess the answer based on the frequency of answers for the same type of question in the training set, e.g. replying "2" to any "how many" question. Image captioning models often learn to recognize objects based solely on their typical environment and fail to recognize them outside their typical environment. In NLP, dialogue systems generate highly generic responses such as “I don’t know” even for simple questions. Open-ended generation is prone to repetition. Question answering systems are easily distracted by the addition of an unrelated sentence to the passage. And more. 

Figure 1: adversarial examples in computer vision (left) and natural language processing tasks (right).

Machine learning models today perform reasonably well on perception tasks (image and speech recognition). However, they mostly lack the ability to perform simple intuitive commonsense inferences that humans do in every minute of their waking hours, regarding pre-and post-conditions of events, understanding other people's motivations and intents, mental and emotional states, etc. 

Table of contents: 

  1. What is commonsense? 
  2. Is commonsense knowledge already captured by pre-trained language models? 
  3. How to create benchmarks to measure commonsense reasoning capabilities? 
  4. How to gather and represent machine-readable commonsense knowledge? 
  5. How to enhance neural models for commonsense reasoning tasks with symbolic knowledge? 
  6. Summary
What is commonsense? 
The boundaries of commonsense are quite challenging to define, but we will go with this working definition:
Commonsense is the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. 
For example, it's common sense that it's OK to keep the closet door open, but not the fridge door, as the food inside might go bad. 

Types of commonsense: 

Commonsense knowledge can be categorized according to types, including but not limited to:
  • Social commonsense: people are capable of making inferences about other people's mental states, e.g. what motivates them, what they are likely to do next, etc. This kind of inference is captured by the ATOMIC knowledge base discussed later. In addition, we each have a set of social norms of accepted behavior, e.g. knowing that “it's impolite to comment on someone's weight”. While these are often implicit in our actions and decisions, machines need to be taught them explicitly

  • Temporal commonsense: natural language rarely communicates explicit temporal information. Instead, it's vague and relies on the commonsense knowledge of the listener. For example, when told that "Dr. Porter is taking a vacation" we can predict that Dr. Porter will not be able to see us soon, as opposed to when "Dr. Porter is taking a walk". This requires knowing the typical duration of "taking a walk" (minutes) and that of "taking a vacation" (days). Other temporal knowledge is typical times, order, frequency, etc. of events which are addressed by the MC-TACO dataset and the TACO-LM time-aware contextual language model. 

  • Physical commonsense: a glass will likely shatter if it falls to the floor, which is a fact most people (and arguably cats) know. Physical commonsense includes knowledge about the physical properties and affordances of everyday objects, as tested in the PIQA dataset.
Commonsense is essential for humans to navigate everyday situations seamlessly and interact with each other in a reasonable and safe way, and for AI to understand human needs and actions better. Yet, endowing machines with such human-like commonsense reasoning capabilities has remained an elusive goal of AI research for decades. Past attempts, in the 1960s and 1970s, resulted in an AI winter, i.e. reduced interest and funding for AI research due to failed over-hyped research directions. In recent years, a new interest in machine commonsense has emerged, with the availability of stronger computing power and huge amounts of data. With that said, the path to machine commonsense is unlikely to be brute force training larger neural networks with deeper layers.   

Is commonsense knowledge already captured by pre-trained language models?

In the last 3 years, language models have been ubiquitous in NLP. Language models are pre-trained once, in a self-supervised manner that requires only a large text corpus. Traditionally, language models are trained to predict the next word in a sentence (top part of Figure 2, in blue), but they can also predict hidden (masked) words in the middle of the sentence, as in Google's BERT model (top part of Figure 2, in orange). This pre-training phase yields a function that gets a sequence of words (sentence, short paragraph) and returns a vector for each word in the sequence. 

Figure 2: Language models pre-training and fine-tuning.

As opposed to word embeddings which are static, language model-based word vectors are dynamic and re-computed for each context. At the very basic level, they assign different vectors to words when they are used in different senses, as in Figure 3. 

Figure 3: Static vs. dynamic word representations.

Do off-the-shelf pre-trained language models already capture commonsense knowledge? 

✅  They are capable to some extent, of filling incomplete commonsense facts or ranking candidate facts. For example, the language model score (≈ statement plausibility) of a fact like "a musician plays a musical instrument" is higher than "a dancer plays a musical instrument". This is a proof that, in addition to lexical and syntactic knowledge, language models capture general knowledge about the world.  

✅  They can, to some extent, associate concepts with their properties. They distinguish concepts 
associated with a given set of properties, i.e. complete a statement such as "       has fur, is big, and has claws, has teeth, is an animal, ..." with bear (just like playing the "20 question game"). They perform better when they are shown encyclopedic properties (e.g. is an animal) as opposed to perceptual properties (e.g. smooth). They can also, pretty successfully, list the properties 
associated with given concepts, e.g. complete the sentence "Everyone knows that a bear has       " with fur, claws, teeth, etc. 

However, knowledge generated from language models is noisy! 

🚫 Several papers have shown that language models are not sensitive to negation, i.e. they consider the negated version of facts ("birds can't fly") as similarly plausible. 

🚫 They are sensitive to phrasing:

🚫  In distributional word vectors, the vector representing a (sub-)word is learned from the contexts in which it appeared, leading to similar representation for semantically-similar words. In language models, the representation of similar contexts are similar, so the model learns which type of word should appear next (or instead of a masked token). This is generally a positive thing, but it sometimes over-generalizes, leading to examples such as this: 

Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the AllenNLP demo

Here, BERT has seen in its training corpus enough sentences of the type "The color of something is [color]" to know to suggest different colors as substitutes for the masked word. Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn't see enough sentences discussing the color of a dove, thus it defaults to just predicting any color.  

So knowledge in language models is not the most accurate and reliable. Is it still useful?

Yes, to some extent. One way to show it is through evaluation on tasks requiring commonsense knowledge. We will discuss several such tasks, but for now, let's focus on WinoGrande as an example. It is the large-scale version of the Winograd Schema Challenge. Given a sentence with a cloze, the goal is to fill in the blank with a previously mentioned entity or concept, out of two answer choices. For example: 

Because Brett found an internship while in college but Ian was unable to, _____ found a job less quickly after graduation. 
Choices: Brett, Ian

What makes this task especially difficult is that every instance has a twin sentence which is minimally changed such that the correct answer is the other one (for instance, replacing "less quickly" with "more quickly" will change the correct answer from Ian to Brett). 

Language model-based models top the leaderboards of WinoGrande and other commonsense tasks, but since they are trained on task-specific training data, which often contains tens or hundreds of thousands of training examples, it's hard to attribute the success to the knowledge captured in language models from the pre-training step. A better way to estimate it is with zero-shot (unsupervised) models. Typically, the way zero-shot models address multiple-choice tasks is by phrasing a statement from the instance and each answer choice, and computing the language model score as a proxy for plausibility:

PLM(The answer is answer1
PLM(The answer is answer2
PLM(The answer is answerk)

And then predicting the answer choice with the best language model score (highest probability, which is usually computed as the lowest perplexity). 

In our recent EMNLP paper, we took it one step further and asked whether we can use language models to generate what would otherwise be missing or implicit knowledge needed for solving a multiple-choice commonsense question answering instance. We proposed the unsupervised "self-talk" framework, that uses language models to generate information-seeking questions such as "what is the definition of..." and their corresponding answers (clarifications) to discover additional background knowledge. In the example in Figure 5, knowing that internship experience may help a person get a job is crucial for answering the question (which of Brett and Ian found a job less quickly?). On most benchmarks, the self-talk model performed better than unsupervised models with no additional knowledge, while competing with models that have access to knowledge bases. This is despite the inaccurate and noisy knowledge language models generate. However, when we showed people some of the clarifications that helped the model choose the correct answer choice, they judged only 40% of them as actually providing helpful information. This discrepancy means that our model doesn't imitate the human reasoning process - it works differently. Check out our demo! It's not always accurate but it's often funny :) 

Figure 5: An example of clarification generation for an instance from WinoGrande.

The best performance on commonsense tasks is achieved by fine-tuning language models, i.e. training them on task-specific data. Let's look at some of the benchmarks and the issues we face with supervised learning.  

How to measure commonsense reasoning capabilities? 

Multiple commonsense benchmarks have been released over the last few years. Some of them will be discussed here (see examples in Figure 6), along with the main differences and design choices when creating a benchmark.

Figure 6: Some commonsense benchmarks along with an example instance. 

Type of knowledge: some benchmarks focus on a specific type of commonsense knowledge, such as social commonsense (e.g. Social IQa),  physical commonsense (e.g. PIQA), temporal commonsense (e.g. MC-TACO),  or causes and effects (e.g. COPA), while others target a broader domain of general commonsense knowledge and reasoning (e.g. WSC, WinoGrande, CommonsenseQA, ROCStories).  

Size: most recent datasets include a large training set, in order to facilitate training large neural models. One way to create a benchmark is to hire experts to curate a high-quality dataset such as for WSC and COPA. These datasets are rather expensive to collect and are therefore typically small. The common alternative is to collect data through crowdsourcing or semi-automatically, and split it randomly to train, validation, and test sets. Models that learned data-specific shortcuts in the training set instead of generalized phenomena are likely to perform well on a test set drawn from the same distribution, but this performance is misleading and is likely a lot better than on real-world instances of the task.  Despite this understanding, this is still the dominant approach. 

Format: the vast majority of datasets are in the format of multiple-choice questions, as exemplified in Figure 6. This format is the easiest to evaluate automatically: models are judged for their accuracy, i.e. what percent of the questions they answered correctly. Unfortunately, this type of tasks also makes it possible for a model to guess the correct answer. We're not talking about a random guess, which would leave enough room for improvement. A random guess is expected to result in an accuracy of 100/k %, where k is the number of answer choices, e.g. 50% accuracy for binary tests, 33.3% for tests with 3 choices, 25% for 4 choices, etc. The risk is that the model makes an "educated guess" based on - yes, you guessed it correctly - spurious correlations between the questions and the correct/incorrect answers. 

How do you make sure a model is right for the right reasons?

That's the million-dollar question. We don't have a perfect solution for this problem yet. For a start, when collecting a new benchmark, the process of collecting incorrect answers (=distractors) should be well-designed such that distractors are plausible but unlikely. Using random answers as distractors (e.g. naturally-occurring sentences or correct answers of different questions) would create topically-different distractors, which are easy to detect (remember, relatedness is one of the strengths of distributional text representations). Asking people to come up with the distractors may introduce other annotation artifacts, such as exaggerations, going off-topic, or producing overly emotional texts, which are easy for models to detect. Some solutions have been proposed: for example, the distractors in Social IQa are answers for different questions asked on the same context. In Figure 7, the context "Alex spilt food all over the floor and it made a huge mess." appears in the dataset with two questions: "what happens next?" and "what happened before?". The distractors of "what happens next?" are the correct answers of "what happened before?", e.g. that Alex has slippery hands. A similar approach is taken in CommonsenseQA. 

Figure 7: Creating distractors for a Social IQa instance. Image credit: Maarten Sap.

An alternative solution is to filter out easy questions through "adversarial filtering", i.e. training a weaker model and iteratively removing instances that it succeeds in answering. Variants of adversarial filtering were applied to WinoGrande and PIQA. 

Finally, I believe the future is in generative tasks, in which the model needs to produce a free-text answer without being provided with the candidate answers. Several recent benchmarks are generative, such as TimeTravel (counterfactual reasoning), ART (abductive reasoning), CommonGen, and ProtoQA. The challenge in generative tasks is the lack of reliable automatic evaluation metrics. Given the gold standard reference answer(s), we would like a metric to (1) reward correct generated answers that are different from the reference answer, while (2) penalizing incorrect answers that are similar (e.g. lexically) to the reference. Human evaluation is reliable, but it is costly and is typically done once on the test set. In order to be able to improve models during development, we need automatic metrics. We currently settle for metrics based on lexical overlap such as BLEU and ROUGE which are pretty terrible at (1) and have little correlation with human judgments, or model-based metrics such as BERT score that are not great at (2). 

How to gather and represent machine-readable commonsense knowledge?

Commonsense resources provide machine-readable knowledge about the world. Resources are expected to be large-scale and accurate, consist of diverse knowledge types, and be usable in downstream tasks. ConceptNet is a large (21 million assertions), commonly-used resource consisting of general commonsense knowledge, in over 85 languages. ATOMIC consists of 880,000 triplets reasoning about causes and effects of everyday situations. Other resources are listed in Figure 8.

Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap. 

Existing resources differ in several aspects:

Representation: how is knowledge represented in the resource? ConceptNet and ATOMIC represent knowledge in natural language (Figure 9), while NELL and Cyc represent knowledge in symbolic logic:

(#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET)) 

Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap. 

Knowledge type: ConceptNet consists of semantic knowledge, i.e. properties of concepts (e.g. reading is a type of activity). ATOMIC, on the other hand, is inferential: given a templated event with "PersonX" representing the subject and "PersonY" an optional object(s) (e.g. PersonX yells at PersonY), and one of 9 pre-defined relation dimensions (e.g. PersonX's motivation) it provides a second event (e.g. PersonX wanted to express anger). 

Collection method: knowledge can be collected from humans, either experts or crowdsourcing workers. Expert-curated resources are more uniform and accurate and may use complex representations, but it is an expensive collection method, and it is very time-consuming. Alternatively, non-experts can write knowledge in natural language, making the collection faster and more scalable.

The alternative approach is to extract knowledge automatically from texts, as in NELL. This approach works, but it produces less accurate knowledge. In addition, the approach suffers from reporting bias: over-representing the rare at the expense of the trivial. For example, people are reported to murder more often than they are reported to breathe. Default properties of concepts (yellow banana) are mentioned less often than their alternatives (green banana), etc. 

How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?

Most models developed for solving commonsense benchmarks today are based on language models. Typically, each answer choice, along with the context, forms a statement. The language model computes a vector representing each statement. These vectors are then fed into a classifier that assigns a plausibility score for each candidate answer:

Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance.

Static neuro-symbolic integration

The knowledge in commonsense resources may enhance models built for solving commonsense benchmarks. For example, we can extract from ConceptNet the assertions that a job is used for making money, that spending money requires making money, that buying requires spending money and that car is something you can buy. Ideally, we would also need the knowledge that a high-paying job is a type of job, specifically one used for making a lot of money, which is required for spending a lot of money, which is required for buying something that costs a lot of money, a car being one of them. Finally, we may want to remove the edge from "buy" to "car" so we can only get to "car" from the node "buy something that costs a lot of money". 

Figure 12: Knowledge extracted from ConceptNet for the WinoGrande instance discussed above.

How do we incorporate knowledge from knowledge resources into a neural model?

The simple recipe (success not guaranteed) calls for 4 ingredients: the task addressed, the knowledge resource used, the neural component, and the combination method. We have already discussed tasks and knowledge resources, so I would only add here that ConceptNet is the main resource utilized for downstream models, although some models incorporate other knowledge sources, such as other knowledge bases (WordNet, ATOMIC), knowledge mined from text, and tools (knowledge base embeddings, sentiment analysis models, COMET - see below). 

Figure 13: Resources used by most knowledge-informed commonsense models.

The neural component is the shiny new neural architecture - language models in the last 3 years, biLSTMs in the years prior, etc. The more interesting component is the combination method. We will look at 3 examples:

Incorporating into the scoring function: Lin et al. (2017) extracted probabilistic "rules" connecting pairs of terms from multiple sources such as WordNet (restaurant→eatery: 1.0), Wikipedia categories (restaurant→business: 1.0), script knowledge mined from text (X went to a restaurant→X ate: 0.32), word embedding-based relatedness scores (restaurant→food: 0.71), and more. The model scores each candidate answer according to the scores of the inference rules used to get from the context (e.g. "Mary walked to a restaurant" in Figure 14) to the candidate answer (e.g. "She ordered foods.").  

Figure 14: "covering" each candidate answer by the original context and the rules extracted from various sources. Image credit: Lin et al. (2017).

Representing symbolic knowledge as vectors: Lin et al. (2019) used BERT as the neural component to represent the instance (statement vector). For their symbolic component, they extracted subgraphs from ConceptNet pertaining to concepts mentioned in the instance and learned to represent them as a vector (graph vector). These two vectors were provided as input to the answer scorer which was trained to predict the correct answer choice. 

Figure 15: extracting subgraphs from ConceptNet pertaining to concepts mentioned in the instance. Image credit: Lin et al. (2019).

Multi-task learning: Xia et al. (2019) fine-tuned a BERT model to solve the multiple-choice questions. They also trained two auxiliary tasks supervised by ConceptNet, in which two concepts were given as input and the classifier had to predict whether they are related or not, and the specific ConceptNet property that connects them. The BERT model was shared between the main and the auxiliary tasks, so that commonsense knowledge from ConceptNet was instilled into BERT, improving its performance on the main task.

Figure 16: multi-task learning aimed at instilling knowledge from ConceptNet into BERT.

Dynamic neuro-symbolic integration

There are two main limitations to the neuro-symbolic integration discussed above:
  1. Coverage: relevant knowledge is often not found as-is in commonsense knowledge resources. As we've seen earlier, commonsense knowledge is immeasurably vast, so much of it is not documented. 

  2. Precision and context: knowledge found in the knowledge base about concept X doesn't necessarily apply to all contexts in which X appears. For example, when provided with "PersonX adopts a cat", ATOMIC says that PersonX had to go to the shelter first (Figure 17), but that's not always the case. It may as well be that PersonX adopted a cat they found on the street or got the cat from a friend who was no longer able to care for it. 

Figure 17: ATOMIC inferences for the event "PersonX adopted a cat".

How do we provide machines with large-scale, contextualized commonsense knowledge?

The solution is to leverage manually curated commonsense knowledge resources, such as ConceptNet and ATOMIC, to train a model that can dynamically produce such knowledge for a given context. Commonsense knowledge resources are typically sparse, making training a knowledge base completion model to extend the resource less efficient. Pre-trained language models and their inherent knowledge come in handy here. Language models (such as GPT) implicitly represent knowledge, so you can re-train them on completing knowledge base assertions (e.g. from ATOMIC) to teach them the structure of knowledge. This is what COMET (COMmonsEnse Transformers) does, as illustrated in Figure 18. 

Figure 18: Illustration of the training process of COMET: The language model is fine-tuned to predict the "tail entity" (e.g. inference in ATOMIC) given the "head entity" and the relation. Image credit: Antoine Bosselut.

COMET is capable of dynamically generating inferences for any context. For example, if we modify the context from ATOMIC to "David adopted his sister's cat because they found out her husband was allergic.", which for obvious reasons does not appear in ATOMIC, COMET no longer predicts that PersonX (David) had to go to the shelter, but instead that he, for example, needed to find out about it.

COMET has been used successfully in various downstream tasks requiring commonsense knowledge. Models trained on ATOMIC or on ConceptNet are available, and the demo for both ATOMIC and COMET can be found here. There is also a Visual COMET that can generate inferences from images. 


We talked about ways to acquire and represent commonsense knowledge in machine-readable format, ways to measure commonsense reasoning abilities, and ways to integrate this kind of knowledge into models. None of these is solved yet. Manually collecting all the commonsense knowledge is infeasible, while extracting it from texts or from language models suffers from inaccuracies, reporting bias, and societal biases. Looking forward, a promising research direction is multi-modal commonsense knowledge acquisition, e.g. learning from texts along with images and videos. For example, looking through enough class photos, you might learn that the kids in the front row typically sit (especially if the kids in the last row are also seated). 

Machines may reach human performance on commonsense benchmarks but it's often due to being right for the wrong reasons rather than actually possessing and successfully applying commonsense knowledge and reasoning abilities. Generative tasks are somewhat less prone to this issue, but we would have to develop reliable automatic evaluation metrics to make them the standard. 

Machine commonsense reasoning is becoming more and more popular within NLP so I am optimistic about future breakthroughs! 

Monday, January 11, 2021

Most Popular Blog Posts of 2020

 This is a summary ranking of the most popular blog posts of the 2020 year based on readership traffic and presence. These rankings are based on the statistics given to me by the hosting platform, which sometimes fluctuate much more than one would expect. 

I am pleased to see that there is an increasing awareness on the importance data analysis in multiple phases of the MT development and deployment process. Data analysis matters for training data selection, improved linguistic pattern handling, effective testing and quality estimation amongst other things. The tools to do this well are still lacking robustness or need major improvements to make them more useable. As the world shifts from handling translation projects for localization (relatively low volume), to other digital presence and large scale content assimiliation and dissemination use cases, I think there will be a need for better tools. I am surprised by the continuing stream new TMS products that continue to emerge, most of these new products have a huge amount of overlap with existing products, and none of the new tools really change the playing field in a meaningful way. 

The single most popular post of the year was this one which was an interview with Adam Bittlingmayer on the risk prediction capabilities of  Modelfront:

1. Understanding Machine Translation Quality & Risk Prediction

Getting a better understanding of data and identifying the most critical data issues is key to success with MT. Better data analysis would mean that human efforts can be focused on a much smaller set of data and thus yield better overall quality in less time. The risk prediction and quality estimation data provided by ModelFront makes MT use much more efficient. It allows rapid error detection and can help isolate translation projects by high-touch and low-touch elements. I suspect much of the readership of this post came from outside the translation industry as I continue to see little focus on this in the localization world. This post is worth a closer look for those LSPs who are investigating a more active use of MT. This link will lead to a case study to show how this can help in localization projects. 

Despite the hype around the magic of NMT and deep learning in general, we should understand that deep learning NMT toolkits are going to be viewed as commodities. Data analysis is where value creation will happen. 

The data is your teacher and is where the real value contribution possibilities are. I predict that this will continue to be clearer in the coming year.

If we consider all three posts related to the "Premium Translation", it would easily be the top blog theme for the year. These posts together attracted the most active readership, and also the most articulate and comprehensive comments. MT technologists tend to lump all translators together when making comments about "human" translators, but we should understand that there is a broad spectrum of capabilities when we talk about "human" translators.  And those who collaborate, consult, and advise their customers around critical content communication are unlikely to be replaced by ever-improving MT. Real domain expertise, insight, and the ability to focus on the larger global communication mission of the content is something I do not see MT approach successfully in my lifetime. 

I am extremely skeptical about the notion of "singularity" as some technologists have described it. It is a modern myth that will not happen as described IMO -- most AI today and machine learning, in particular, is no more than sophisticated pattern matching in big data, and while it can be quite compelling at times, it is NOT intelligence. Skillful use and integration of multiple deep learning tasks can create the illusion of intelligence but I feel that we have yet, much to learn about human capabilities, before we make "human equivalent" technology.

Here is a smart discussion on AI that provides informed context and a reality check on the never-ending AI hype that we continue to hear. Translators, in particular, will enjoy this discussion as it reveals how central language understanding is to the evolution of AI possibilities.
There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not a pixie dust that magically solves all your problems."
                                                                            Steven Pinker  

The pandemic has forced many more B2C and B2B interactions to be digital. In an increasingly global world where it is commonplace to have multilingual communication, collaboration and information sharing MT becomes an enabler of global corporate presence. I estimate that the huge bulk of the need for translation is beyond the scope of the primary focus of localization efforts. It is really much more about digital interactions across all kinds of content, multiple platforms that require instant multilingual capabilities. 

However, data security and privacy really matter in these many interactions, and MT technology that does not make data security a primary focus in the deployment should be treated with suspicion and care.

Microsft offers quite possibly the most robust and secure cloud MT platform for companies who wish to integrate instant translation into all kinds of enterprise content flows. The voice of a Microsoft customer states the needs quite simply and eloquently.

“Ultimately, we expect the Azure environment to provide the same data security as our internal translation portal has offered thus far,” 

Tibor Farkas, Head of IT Cloud at Volkswagen 

This was a guest post by Raymond Doctor that illustrates the significant added value that linguists can add to the MT development process. Linguistically informed data can make some MT systems considerable better than just adding more data. Many believe that the need is simply more data but this post clarifies that a smaller amount of the right kind of data can have a much more favorable impact than sheer random data volume. 

The success of these MT experiments is yet more proof that the best MT systems come from those who have a deep understanding of both the underlying linguistics, as well as the MT system development methodology.

Here is a great primer on the need for data cleaning in general. This post takes the next step and provides specific examples of how this can be extended to MT.

"True inaccuracy and errors in data are at least relatively straightforward to address because they are generally all logical in nature. Bias, on the other hand, involves changing how humans look at data, and we all know how hard it is to change human behavior."

Michiko Wolcott

I have been surprised at the continuing popularity of this post which was actually written and published in March 2012, almost 9 years ago. Interestingly Sharon O'Brien stated this at the recent AMTA2020 conference. She tried to get a discussion going on why have the issues being discussed around post-editing not changed in ten years. 

The popularity of this post points to how badly PEMT compensation is being handled even in 2020. Or perhaps it suggests that people are doing research to try and do it better. 

Jay Marciano had a presentation at ATA recently,  where he argued that since there is no discernible and reliable differentiator between fuzzy translation memory matches and machine translation suggestions (assuming that you are using a domain trained machine translation engine), we should stop differentiating them in their pricing. Instead, he suggested that they should all be paid by edit distance. ("Edit distance" is the now widely used approach to evaluating the number of changes the editor or translator had to make to an MT suggestion before delivering it.) 

Doing this, according to Jay, protects the translator from poor-quality machine translation (because the edit distance -- or rewrite from scratch --will, in that case, be large enough for 100% payment) as well as from bad translation memories (same reason). Also, he suggests payment for MT suggestions with no edit distance, i.e., suggestions where no edits were deemed necessary (20% of the word price) at a rate twice as high as a 100% TM match (10%) to compensate for the effort to evaluate their accuracy. He also suggests a 110% rate for an edit distance of 91-100%, taking into account the larger effort needed to "correct" something that was rather useless in the first place. 

This is an attempt to be fair but, practically, it is a hard-to-predict compensation scheme and most customers like to know costs BEFORE they buy. There are many others who think we should still be looking at an hourly-based compensation scheme. We do not encounter discussions on how a mechanic, electrician, accountant, or lawyer takes too long to do a job as a reason not to hire them, and perhaps translation work could evolve to this kind of a model.  It is not clear how this could work when very large volumes (millions of words/day) of words are involved as the edit-distance approach really only viable in post-editing of MT use scenarios.

Nonetheless, much of the current thinking on the proper PEMT compensation model is to use Edit Distance-based methodologies. While this makes sense for localization MT use cases, this approach is almost useless for the other higher volume MT use cases. The quality and error assessment schemes proposed in localization are much too slow and onerous to use in scenarios where millions or hundreds of millions of words are being translated every day.

 It is my sense that 95% of MT use is going to be outside of localization use cases (PEMT) and I think the more forward-looking LSPs will learn to find approaches that work better when the typical translation job handles millions of words a week. Thus, I am much more bullish on quality estimation and risk prediction approaches that are going to be a better way to do rapid error detection and rapid error correction for these higher business value, higher volume MT use cases.

The issue of equitable compensation for the post-editors is an important one, and it is important to understand the issues related to post-editing, that many translators find to be a source of great pain and inequity.  MT can often fail or backfire if the human factors underlying work are not properly considered and addressed. 

From my vantage point, it is clear that those who understand these various issues and take steps to address them are most likely to find the greatest success with MT deployments. These practitioners will perhaps pave the way for others in the industry and “show you how to do it right” as Frank Zappa says. Many of the problems with PEMT are related to ignorance about critical elements, “lazy” strategies and lack of clarity on what really matters, or just simply using MT where it does not make sense. These factors result in the many examples of poor PEMT implementations that antagonize translators. 


This is a guest post by Luigi Muzii with his observations on several hot issues in the localization business. He comments often on the misplaced emphasis and attention on the wrong problem. For some reason,  misplaced emphasis on the wrong issues has been a long-term problem in the business translation industry. 

Almost more interesting than disintermediation – removing the middleman – is intermediation that adds the middleman back into the mix. Intermediation occurs when digital platforms inject themselves between the customers and a company.  In this case, the global enterprise and the translators who do the translation work. These platforms are so large that businesses can’t afford not to reach customers through these platforms. Intermediation creates a dependency and disintermediation removes the dependency. There is no such intermediary for translation though some might argue that the big public MT portals have already done this and the localization industry only services the niche needs.

He focuses also on the emergence of low-value proposition, generic MT portals with attached cheap human review capabilities as examples of likely to fail attempts at disintermediation. It is worth a read. An excerpt:
"It is my observation, that these allegedly “new offerings” are usually just a response to the same offering from competitors. They should not be equated to disintermediation and they often backfire, both in terms of business impact and brand image deterioration. They all seem to look like dubious, unsound initiatives instigated by Dilbert’s pointy-haired boss. And the Peter principle rules again here and should be considered together with Cipolla’s laws of stupidity, which state that a stupid person is more dangerous than a pillager and often does more damage to the general welfare of others. "


By Vincedevries - Own work, CC BY-SA 4.0

The danger of the impact of the stupid person is proven by what we have seen from the damage caused by the orange buffoon to the US. This man manages to comfortably straddle both the stupid and bandit quadrants with equal ease even though he started as a bandit. Fortunately for the US, the stupid element was much stronger than the bandit element in this particular case. Unfortunately for the US, stupid bandits can inflict long-term damage on the prospects of a nation and it may take a decade or so to recover from the damage done. 

“The reason why it is so difficult for existing firms to capitalize on disruptive innovations is that their processes and their business model that make them good at the existing business actually make them bad at competing for the disruption.”

'Disruption' is, at its core, a really powerful idea. Everyone hijacks the idea to do whatever they want now. It's the same way people hijacked the word 'paradigm' to justify lame things they're trying to sell to mankind."
'Disruption' is, at its core, a really powerful idea. Everyone hijacks the idea to do whatever they want now. It's the same way people hijacked the word 'paradigm' to justify lame things they're trying to sell to mankind.
'Disruption' is, at its core, a really powerful idea. Everyone hijacks the idea to do whatever they want now. It's the same way people hijacked the word 'paradigm' to justify lame things they're trying to sell to mankind.
Clay Christensen

“Life’s too short to build something nobody wants.”

                                                                                    Ash Maurya 

Luigi also wrote a critique of my post on the Premium Market and challenged many of the assumptions and conclusions I had drawn. I thought it would only be fair to include it in this list so that readers could get both sides of the subject on the premium market discussion.

I also noted that the following two posts got an unusual amount of attention in 2020. The BLEU score post has been very popular in two other forums where it has been published. There are now many other quality measurements for adequacy and fluency being used but I still see a large number of new research findings reporting with BLEU, mostly because it is widely understood in all its imperfection.

The latest WMT results use Direct Assessment (DA) extensively in their results summaries. 

Direct assessment (DA) (Graham et al., 2013,2014, 2016) is a relatively new human evaluation approach that overcomes previous challenges with respect to lack of reliability of human judges. DA collects assessments of translations separately in the form of both fluency and adequacy on a 0–100 rating scale, and, by combination of repeat judgments for translations, produces scores that have been shown to be highly reliable in self-replication experiments. The main component of DA used to provide a primary ranking of systems is adequacy, where the MT output is assessed via a monolingual similarity of meaning assessment. In Direct Assessment humans assess the quality of a given MT output translation by comparison with a reference translation (as opposed to the source and reference). DA is the new standard used in WMT News Translation Task evaluation, requiring only monolingual evaluators. For system-level evaluation, they use the Pearson correlation r of automatic metrics with DA scores.I have not seen enough comparison data of this to have an opinion on efficacy yet.

Most Popular Blog Posts of 2019 had an unusually high traffic flow and would rank in the Top 5

I wish you all a Happy, Prosperous and Healthy New Year