This article is reprinted with permission from the original poster @VeredShwartz . This post might be challenging reading for the usual reader of this blog, but I think that even skimming through this might be useful to many to get a sense for possibly the most formidable challenge in the artificial intelligence community: building common sense capabilities into existing and emerging AI deployments.
Commonsense knowledge consists of facts about the everyday world, that all humans are expected to know. Commonsense knowledge helps to solve problems in the face of incomplete information. It is currently considered an unsolved problem in AGI and is a focus of the Allen Institute for Artificial Intelligence which the author is associated with.
Deep learning is self-education for machines; you feed a machine learning system huge amounts of data, and eventually it begins to discern patterns all by itself. But despite their remarkable achievements, and occasional ability to produce human-like outputs, machine learning algorithms are at their core complex mathematical functions that map observations to outcomes. Or are able to forecast patterns that they have previously seen and explicitly learned. Therefore, they’re as good as their data and they start to break as the data they face in the world starts to deviate from examples they’ve seen during training. Neural MT is an example, great progress indeed, but far from having solved the translation problem.
We hear continuously about the relentless "big data" that is driving AI progress, but we are finding more and more cases where the current approach of deep learning and more data is not enough. The path to machine commonsense is unlikely to be brute force training of larger neural networks with deeper layers on more data. Whilst deep learning excels at pattern recognition, it’s very poor at adapting to changing situations even when small modifications of the original case are encountered, and often has to be re-trained with large amounts of data from scratch.
"The great irony of common sense—and indeed AI itself—is that it is stuff that pretty much everybody knows, yet nobody seems to know what exactly it is or how to build machines that possess it," said Gary Marcus, CEO and founder of Robust.AI. "Solving this problem is, we would argue, the single most important step towards taking AI to the next level. Common sense is a critical component to building AIs that can understand what they read; that can control robots that can operate usefully and safely in the human environment; that can interact with human users in reasonable ways. Common sense is not just the hardest problem for AI; in the long run, it's also the most important problem."
Common sense has been called the “dark matter of AI” — both essential and frustratingly elusive. That’s because common sense consists of implicit information — the broad (and broadly shared) set of unwritten assumptions and rules of thumb that humans automatically use to make sense of the world. Critics of over-exhuberant AI claims frequently point out that two-year children have more common sense than existing deep-learning based AI systems whose "understanding" is often quite brittle and easily distracted and deranged.
Common sense is easier to detect than to define. The implicit nature of most common-sense knowledge makes it difficult and tedious to represent explicitly.
DARPA, the US defense department’s research agency, has also recognized the absence of common sense as being an important issue. They recently launched a project called Machine Common Sense. As they say:“ The absence of common sense prevents intelligent systems from understanding their world, behaving reasonably in unforeseen situations, communicating naturally with people, and learning from new experiences. Its absence is considered the most significant barrier between the narrowly focused AI applications of today and the more general, human-like AI systems hoped for in the future”.
Gary Marcus suggests combining traditional AI approaches together with deep learning as a way forward.
"First, classical AI actually IS a framework for building cognitive models of the world that you can then make inferences over. The second thing is, classical AI is perfectly comfortable with rules. It’s a strange sociology right now in deep learning where people want to avoid rules. They want to do everything with neural networks, and do nothing with anything that looks like classical programming. But there are problems that are routinely solved this way that nobody pays attention to, like making your route on Google maps.
We actually need both approaches. The machine-learning stuff is pretty good at learning from data, but it’s very poor at representing the kind of abstraction that computer programs represent. Classical AI is pretty good at abstraction, but it all has to be hand-coded, and there is too much knowledge in the world to manually input everything. So it seems evident that what we want is some kind of synthesis that blends these approaches."
Yejin Choi and her collaborators at the Allen Institute have united traditional symbolic AI approaches with newer machine learning approaches in an attempt to address the commonsense challenge. One initiative, COMET (short for “commonsense transformers”) extends traditional symbolic reasoning with the latest advances in neural language modeling — a kind of deep learning that aims to imbue computers with a statistical “understanding” of written language. COMET is a fusion of symbolic reasoning with a neural network and tries to solve the coverage and brittleness problems, of purely DL-approaches, at the same time. COMET works by reimagining common-sense reasoning as a process of generating plausible (if imperfect) responses to novel input, rather than making airtight deductions by consulting a vast encyclopedia-like database.
Gary Marcus, a critic of the deep-learning fanboys and girls, often points out DL-only shortcomings to challenge the over-exhuberance of these fans. To put progress in AI into a more realistic context he says: “Just because you can build a better ladder doesn’t mean you can build a ladder to the moon.” To him and others, COMET’s approach suffers from a fundamental limitation of deep learning: “statistics ≠ understanding.”
Regardless, Vered presents a comprehensive picture of the many challenges faced and attempts at developing solutions in introducing commonsense to NLP applications in arguably one of the most challenging problems in computing today. I think her post is a great resource for anybody who wants to quickly get a sense for the issue and the SOTA.
******
Commonsense Reasoning for Natural Language Processing
Figure 1: adversarial examples in computer vision (left) and natural language processing tasks (right). |
Table of contents:
- What is commonsense?
- Is commonsense knowledge already captured by pre-trained language models?
- How to create benchmarks to measure commonsense reasoning capabilities?
- How to gather and represent machine-readable commonsense knowledge?
- How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?
- Summary
Commonsense is the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people.
- Social commonsense: people
are capable of making inferences about other people's mental states,
e.g. what motivates them, what they are likely to do next, etc. This
kind of inference is captured by the ATOMIC
knowledge base discussed later. In addition, we each have a set of social norms of accepted behavior, e.g. knowing that “it's impolite to comment on someone's weight”. While these are often implicit in our actions and decisions, machines need to be taught them explicitly.
- Temporal commonsense: natural
language rarely communicates explicit temporal information. Instead, it's vague and relies on the commonsense knowledge of the listener. For example, when told that "Dr. Porter is taking a vacation" we can predict that Dr. Porter will not be able to see us soon, as opposed to when "Dr. Porter is taking a walk".
This requires knowing the typical duration of "taking a walk" (minutes)
and that of "taking a vacation" (days). Other temporal knowledge is typical times, order, frequency, etc. of events which are addressed by the MC-TACO dataset and the TACO-LM time-aware contextual language model.
- Physical commonsense: a glass will likely shatter if it falls to the floor, which is a fact most people (and arguably cats) know. Physical commonsense includes knowledge about the physical properties and affordances of everyday objects, as tested in the PIQA dataset.
Is commonsense knowledge already captured by pre-trained language models?
Figure 2: Language models pre-training and fine-tuning. |
Figure 3: Static vs. dynamic word representations. |
Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the AllenNLP demo. |
Here, BERT has seen in its training corpus enough sentences of the type "The color of something is [color]" to know to suggest different colors as substitutes for the masked word. Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn't see enough sentences discussing the color of a dove, thus it defaults to just predicting any color.
How to measure commonsense reasoning capabilities?
Figure 6: Some commonsense benchmarks along with an example instance. |
Figure 7: Creating distractors for a Social IQa instance. Image credit: Maarten Sap. |
How to gather and represent machine-readable commonsense knowledge?
Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap. |
Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap. |
Most models developed for solving commonsense benchmarks today are based on language models. Typically, each answer choice, along with the context, forms a statement. The language model computes a vector representing each statement. These vectors are then fed into a classifier that assigns a plausibility score for each candidate answer:
Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance. |
Static neuro-symbolic integration
Figure 12: Knowledge extracted from ConceptNet for the WinoGrande instance discussed above. |
Figure 13: Resources used by most knowledge-informed commonsense models. |
The neural component is the shiny new neural architecture - language models in the last 3 years, biLSTMs in the years prior, etc. The more interesting component is the combination method. We will look at 3 examples:
Figure 14: "covering" each candidate answer by the original context and the rules extracted from various sources. Image credit: Lin et al. (2017). |
Figure 15: extracting subgraphs from ConceptNet pertaining to concepts mentioned in the instance. Image credit: Lin et al. (2019). |
Multi-task learning: Xia et al. (2019) fine-tuned a BERT model to solve the multiple-choice questions. They also trained two auxiliary tasks supervised by ConceptNet, in which two concepts were given as input and the classifier had to predict whether they are related or not, and the specific ConceptNet property that connects them. The BERT model was shared between the main and the auxiliary tasks, so that commonsense knowledge from ConceptNet was instilled into BERT, improving its performance on the main task.
Figure 16: multi-task learning aimed at instilling knowledge from ConceptNet into BERT. |
Dynamic neuro-symbolic integration
- Coverage:
relevant knowledge is often not found as-is in commonsense knowledge
resources. As we've seen earlier, commonsense knowledge is immeasurably
vast, so much of it is not documented.
- Precision and context: knowledge found in the knowledge base about concept X doesn't necessarily apply to all contexts in which X appears. For example, when provided with "PersonX adopts a cat", ATOMIC says that PersonX had to go to the shelter first (Figure 17), but that's not always the case. It may as well be that PersonX adopted a cat they found on the street or got the cat from a friend who was no longer able to care for it.
Figure 17: ATOMIC inferences for the event "PersonX adopted a cat". |