This long-overdue blog post is based on the
Commonsense Tutorial taught by Maarten Sap, Antoine Bosselut, Yejin Choi, Dan Roth, and myself at
ACL 2020. Credit for much of the content goes to the co-instructors, but any errors are mine.
|
Figure 1: adversarial examples in computer vision (left) and natural language processing tasks (right). |
Machine learning models today perform reasonably well on
perception tasks (image and speech recognition). However, they mostly
lack the ability to perform simple intuitive commonsense inferences that
humans do in every minute of their waking hours, regarding pre-and
post-conditions of events, understanding other people's motivations and
intents, mental and emotional states, etc.
Table of contents:
- What is commonsense?
- Is commonsense knowledge already captured by pre-trained language models?
- How to create benchmarks to measure commonsense reasoning capabilities?
- How to gather and represent machine-readable commonsense knowledge?
- How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?
- Summary
What is commonsense?
The boundaries of commonsense are quite challenging to define, but we will go with this working definition:
Commonsense is the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people.
For
example, it's common sense that it's OK to keep the closet door open,
but not the fridge door, as the food inside might go bad.
Types of commonsense:
Commonsense knowledge can be categorized according to types, including but not limited to:
- Social commonsense: people
are capable of making inferences about other people's mental states,
e.g. what motivates them, what they are likely to do next, etc. This
kind of inference is captured by the ATOMIC
knowledge base discussed later. In addition, we each have a set of social norms of accepted behavior, e.g. knowing that “it's impolite to comment on someone's weight”. While these are often implicit in our actions and decisions, machines need to be taught them explicitly.
- Temporal commonsense: natural
language rarely communicates explicit temporal information. Instead, it's vague and relies on the commonsense knowledge of the listener. For example, when told that "Dr. Porter is taking a vacation" we can predict that Dr. Porter will not be able to see us soon, as opposed to when "Dr. Porter is taking a walk".
This requires knowing the typical duration of "taking a walk" (minutes)
and that of "taking a vacation" (days). Other temporal knowledge is typical times, order, frequency, etc. of events which are addressed by the MC-TACO dataset and the TACO-LM time-aware contextual language model.
- Physical commonsense: a glass will likely shatter if it falls to the floor, which is a fact most people (and arguably cats)
know. Physical commonsense includes knowledge about the physical properties and affordances of everyday objects, as tested in the PIQA dataset.
Commonsense is essential for humans to navigate everyday situations seamlessly and interact with each other in a reasonable and safe way,
and for AI to understand human needs and actions better. Yet, endowing machines with such human-like commonsense reasoning capabilities has remained an elusive goal of AI research for decades. Past attempts, in the 1960s and 1970s, resulted in an AI winter, i.e. reduced interest and funding for AI research due to failed over-hyped research directions.
In recent years, a new interest in machine commonsense has emerged, with the availability of stronger computing power and huge amounts of data.
With that said, the path to machine commonsense is unlikely to be brute force training larger neural networks with deeper layers.
Is commonsense knowledge already captured by pre-trained language models?
In the last 3 years, language models have been ubiquitous in NLP.
Language models
are pre-trained once, in a self-supervised manner that requires only a
large text corpus. Traditionally, language models are trained to predict
the next word in a sentence (top part of Figure 2, in blue), but they
can also predict hidden (masked) words in the middle of the sentence, as
in Google's
BERT
model (top part of Figure 2, in orange). This pre-training phase yields
a function that gets a sequence of words (sentence, short paragraph)
and returns a vector for each word in the sequence.
|
Figure 2: Language models pre-training and fine-tuning. |
As
opposed to word embeddings which are static, language model-based word
vectors are dynamic and re-computed for each context. At the very basic
level, they assign different vectors to words when they are used in
different senses, as in Figure 3.
|
Figure 3: Static vs. dynamic word representations. |
Do off-the-shelf pre-trained language models already capture commonsense knowledge?
✅ They are capable to some extent, of
filling incomplete commonsense facts or
ranking candidate facts. For example, the language model score (≈ statement plausibility) of a fact like "a
musician plays a musical instrument" is higher than "a
dancer plays a musical instrument". This is a proof that, in addition to
lexical and syntactic knowledge, language models capture general
knowledge about the world.
✅ They can, to some extent,
associate concepts with their properties. They distinguish concepts
associated with a given set of properties, i.e. complete a statement such as "
A has fur, is big, and has claws, has teeth, is an animal, ..." with
bear (just like playing the "20 question game"). They perform better when they are shown encyclopedic properties (e.g.
is an animal) as opposed to perceptual properties (e.g.
smooth). They
can also, pretty successfully, list the properties
associated with
given concepts, e.g. complete the sentence "Everyone knows that a bear
has
" with fur, claws, teeth, etc.
However, knowledge generated from language models is noisy!
🚫
Several papers have shown that language models are not sensitive to negation, i.e. they consider the negated version of facts ("
birds can't fly") as similarly plausible.
🚫 They are sensitive to phrasing:
🚫 In
distributional word vectors,
the vector representing a (sub-)word is learned from the contexts in
which it appeared, leading to similar representation for
semantically-similar words. In language models, the representation of
similar contexts are similar, so the model learns which type of word
should appear next (or instead of a masked token). This is generally a
positive thing, but it sometimes
over-generalizes, leading to examples such as this:
|
Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the AllenNLP demo. |
Here, BERT has seen in its training corpus enough sentences of the type "The color of something is [color]" to
know to suggest different colors as substitutes for the masked word.
Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn't see enough sentences discussing the
color of a dove, thus it defaults to just predicting any color.
So knowledge in language models is not the most accurate and reliable. Is it still useful?
Yes,
to some extent. One way to show it is through evaluation on tasks requiring commonsense knowledge. We will discuss several such tasks, but for now, let's focus on
WinoGrande as an example. It is the large-scale version of the
Winograd Schema Challenge.
Given a sentence with a cloze, the goal is to fill in the blank with a
previously mentioned entity or concept, out of two answer choices. For
example:
Because Brett found an internship while in college but Ian was unable to, _____ found a job less quickly after graduation.
Choices: Brett, Ian
What
makes this task especially difficult is that every instance has a twin
sentence which is minimally changed such that the correct answer is the
other one (for instance, replacing "less quickly" with "more quickly"
will change the correct answer from Ian to Brett).
Language
model-based models top the leaderboards of WinoGrande and other
commonsense tasks, but since they are trained on task-specific training
data, which often contains tens or hundreds of thousands of training
examples, it's hard to attribute the success to the knowledge captured
in language models from the pre-training step. A better way to estimate
it is with zero-shot (unsupervised) models. Typically, the way zero-shot
models address multiple-choice tasks is by phrasing a statement from
the instance and each answer choice, and computing the language model
score as a proxy for plausibility:
PLM(The answer is answer1)
PLM(The answer is answer2)
...
PLM(The answer is answerk)
And
then predicting the answer choice with the best language model score
(highest probability, which is usually computed as the lowest
perplexity).
In our
recent EMNLP paper,
we took it one step further and asked whether we can use language models to generate what would otherwise be missing or implicit knowledge needed for solving a multiple-choice commonsense question answering instance. We proposed the unsupervised "self-talk" framework, that uses language models to generate information-seeking questions such as "
what is the definition of..." and their corresponding answers (clarifications) to discover additional background knowledge. In the example in Figure 5, knowing that internship experience may help a person get a job is crucial for answering the question (which of Brett and Ian found a job less quickly?). On most benchmarks, the self-talk model performed better than unsupervised models with no additional knowledge, while competing with models that have access to knowledge bases. This is despite the inaccurate and noisy knowledge language models generate. However, when we showed people some of the clarifications that helped the model choose the correct answer choice, they judged only 40% of them as actually providing helpful information. This discrepancy means that our model doesn't imitate the human reasoning process - it works differently. Check out our
demo! It's not always accurate but it's often funny :)
|
Figure 5: An example of clarification generation for an instance from WinoGrande.
|
The
best performance on commonsense tasks is achieved by fine-tuning
language models, i.e. training them on task-specific data. Let's look at
some of the benchmarks and the issues we face with supervised
learning.
How to measure commonsense reasoning capabilities?
Multiple
commonsense benchmarks have been released over the last few years. Some of them will be discussed here (see examples in Figure 6), along with the main differences and design choices when creating a benchmark.
|
Figure 6: Some commonsense benchmarks along with an example instance. |
Type of knowledge: some benchmarks focus on a specific type of commonsense knowledge, such as social commonsense (e.g.
Social IQa), physical commonsense (e.g.
PIQA), temporal commonsense (e.g.
MC-TACO), or causes and effects (e.g.
COPA), while others target a broader domain of general commonsense knowledge and reasoning (e.g.
WSC,
WinoGrande,
CommonsenseQA,
ROCStories).
Size: most
recent datasets include a large training set, in order to facilitate training large neural models. One way to create a benchmark is to hire experts to curate a high-quality dataset such as for WSC and COPA. These datasets are rather expensive to collect and are therefore typically small. The common alternative is to collect data through
crowdsourcing or semi-automatically, and split it randomly to train, validation, and test sets. Models that learned data-specific shortcuts in the training set instead of generalized phenomena are likely to perform well on a
test set
drawn from the same distribution,
but this performance is misleading and is likely a lot better than on real-world instances of the task. Despite this understanding, this is still the dominant approach.
Format: the
vast majority of datasets are in the format of multiple-choice
questions, as exemplified in Figure 6. This format is the easiest to evaluate automatically: models are judged for their accuracy, i.e. what percent of the questions they answered correctly. Unfortunately, this type of tasks also makes it possible for a model to guess the correct answer. We're not talking about a random guess, which would leave enough room for improvement. A random guess is expected to result in an
accuracy of 100/k %, where k is the number of answer choices, e.g. 50%
accuracy for binary tests, 33.3% for tests with 3 choices, 25% for 4
choices, etc. The risk is that the model makes an "educated guess" based on - yes, you guessed it correctly - spurious correlations between the questions and the correct/incorrect answers.
How do you make sure a model is right for the right reasons?
That's
the million-dollar question. We don't have a perfect solution for this problem yet. For a start, when collecting a new benchmark, the process
of collecting incorrect answers (=distractors) should be well-designed
such that distractors are
plausible but unlikely. Using random
answers as distractors (e.g. naturally-occurring sentences or correct
answers of different questions) would create topically-different
distractors, which are easy to detect (remember,
relatedness is one of the strengths of distributional text representations).
Asking people to come up with the distractors may introduce other annotation artifacts, such as exaggerations, going off-topic, or producing overly emotional texts, which are
easy for models to detect.
Some solutions have been proposed: for example, the distractors in
Social IQa are answers for different questions asked on the same context. In Figure 7, the context "
Alex spilt food all over the floor and it made a huge mess." appears in the dataset with two questions: "
what happens next?" and "
what happened before?". The distractors of "
what happens next?" are the correct answers of "
what happened before?", e.g. that Alex has slippery hands. A similar approach is taken in CommonsenseQA.
|
Figure 7: Creating distractors for a Social IQa instance. Image credit: Maarten Sap. |
An alternative solution is to filter out easy questions through "
adversarial filtering",
i.e. training a weaker model and iteratively removing instances that it succeeds in answering. Variants of adversarial filtering were applied
to WinoGrande and PIQA.
Finally, I believe the
future is in generative tasks, in which the model needs to produce a
free-text answer without being provided with the candidate answers.
Several recent benchmarks are generative, such as
TimeTravel (counterfactual reasoning),
ART (abductive reasoning),
CommonGen, and
ProtoQA.
The challenge in generative tasks is the lack of reliable automatic evaluation metrics. Given the gold standard reference answer(s), we would like a metric to (1) reward correct generated answers that are different from the reference answer, while (2) penalizing incorrect answers that are similar (e.g. lexically) to the reference. Human evaluation is reliable, but it is costly and is typically done once on the test set. In order to be able to improve models during development,
we need automatic metrics. We currently settle for metrics based on lexical overlap such as
BLEU and
ROUGE which are pretty terrible at (1) and have
little correlation with human judgments, or model-based metrics such as
BERT score that are not great at (2).
How to gather and represent machine-readable commonsense knowledge?
Commonsense
resources provide machine-readable knowledge about the world. Resources are expected to be large-scale and accurate, consist of diverse knowledge types, and be usable in downstream tasks.
ConceptNet is a large (21 million assertions), commonly-used resource consisting of general commonsense knowledge, in over 85 languages.
ATOMIC consists of 880,000 triplets reasoning about causes and effects of everyday situations. Other resources are listed in Figure 8.
|
Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap. |
Existing resources differ in several aspects:
Representation:
how is knowledge represented in the resource? ConceptNet and ATOMIC
represent knowledge in natural language (Figure 9), while NELL and Cyc
represent knowledge in symbolic logic:
(#$implies
(#$and
(#$isa ?OBJ ?SUBSET)
(#$genls ?SUBSET ?SUPERSET))
(#$isa ?OBJ ?SUPERSET))
|
Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap. |
Knowledge type: ConceptNet consists of semantic knowledge, i.e. properties of concepts (e.g. reading is a type of activity).
ATOMIC, on the other hand, is inferential: given a templated event with
"PersonX" representing the subject and "PersonY" an optional object(s)
(e.g. PersonX yells at PersonY), and one of 9 pre-defined relation dimensions (e.g. PersonX's motivation) it provides a second event (e.g. PersonX wanted to express anger).
Collection method: knowledge
can be collected from humans, either experts or crowdsourcing workers.
Expert-curated resources are more uniform and accurate and may use complex representations, but it is an expensive collection method, and it is very
time-consuming. Alternatively, non-experts can write knowledge in natural language, making the collection faster and more scalable.
The alternative approach is to extract knowledge automatically from texts, as in
NELL. This approach works, but it produces less accurate knowledge. In addition, the approach suffers from
reporting bias:
over-representing the rare at the expense of the trivial. For example,
people are reported to murder more often than they are reported to breathe. Default properties of concepts (
yellow banana) are mentioned less often than their alternatives (
green banana), etc.
How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?
Most
models developed for solving commonsense benchmarks today are based on language models. Typically, each answer choice, along with the context,
forms a statement. The language model computes a vector representing each statement. These vectors are
then fed into a classifier that assigns a plausibility score for each
candidate answer:
|
Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance. |
Static neuro-symbolic integration
The
knowledge in commonsense resources may enhance models built for solving commonsense benchmarks. For example, we can extract from ConceptNet the assertions that a job is used for making money, that spending money requires making money, that buying requires spending money and that car is something you can buy. Ideally, we would also need the knowledge that a high-paying job is a type of job, specifically one used for making a lot of money, which is required for spending a lot of money, which is required for buying something that costs a lot of money, a car being one of them. Finally, we may want to remove the edge from "buy" to "car" so we can only get to "car" from the node "buy something that costs a lot of money".
|
Figure 12: Knowledge extracted from ConceptNet for the WinoGrande instance discussed above.
|
How do we incorporate knowledge from knowledge resources into a neural model?
The
simple recipe (success not guaranteed) calls for 4 ingredients: the
task addressed, the knowledge resource used, the neural component, and
the combination method. We have already discussed tasks and knowledge
resources, so I would only add here that ConceptNet is the main resource
utilized for downstream models, although some models incorporate other
knowledge sources, such as other knowledge bases (WordNet, ATOMIC),
knowledge mined from text, and tools (knowledge base embeddings,
sentiment analysis models, COMET - see below).
|
Figure 13: Resources used by most knowledge-informed commonsense models. |
The neural component is the shiny
new neural architecture - language models in the last 3 years, biLSTMs
in the years prior, etc. The more interesting component is the combination method. We will look at 3 examples:
Incorporating into the scoring function: Lin et al. (2017) extracted
probabilistic "rules" connecting pairs of terms from multiple sources
such as WordNet (restaurant→eatery: 1.0), Wikipedia categories
(restaurant→business: 1.0), script knowledge mined from text (X went to a
restaurant→X ate: 0.32), word embedding-based relatedness scores
(restaurant→food: 0.71), and more. The model scores each candidate
answer according to the scores of the inference rules used to get from
the context (e.g. "
Mary walked to a restaurant" in Figure 14) to the candidate answer (e.g. "
She ordered foods.").
|
Figure 14: "covering" each candidate answer by the original context and the rules extracted from various sources. Image credit: Lin et al. (2017). |
Representing symbolic knowledge as vectors: Lin et al. (2019) used
BERT as the neural component to represent the instance (statement vector). For their symbolic component, they extracted subgraphs from
ConceptNet pertaining to concepts mentioned in the instance and learned to represent them as a vector (graph vector). These two vectors were provided as input to the answer scorer which was trained to predict the correct answer choice.
|
Figure 15: extracting subgraphs from ConceptNet pertaining to concepts mentioned in the instance. Image credit: Lin et al. (2019). |
Multi-task learning: Xia et al. (2019) fine-tuned a BERT model to solve the multiple-choice questions. They also trained two auxiliary tasks supervised by ConceptNet,
in which two concepts were given as input and the classifier had to
predict whether they are related or not, and the specific ConceptNet
property that connects them. The BERT model was shared between the main
and the auxiliary tasks, so that commonsense knowledge from ConceptNet
was instilled into BERT, improving its performance on the main task.
|
Figure 16: multi-task learning aimed at instilling knowledge from ConceptNet into BERT. |
Dynamic neuro-symbolic integration
There are two main limitations to the neuro-symbolic integration discussed above:
- Coverage:
relevant knowledge is often not found as-is in commonsense knowledge
resources. As we've seen earlier, commonsense knowledge is immeasurably
vast, so much of it is not documented.
- Precision and context: knowledge found in the knowledge base about concept X doesn't necessarily apply to all contexts in which X appears. For example, when provided with "PersonX adopts a cat",
ATOMIC says that PersonX had to go to the shelter first (Figure 17),
but that's not always the case. It may as well be that PersonX adopted a
cat they found on the street or got the cat from a friend who was no
longer able to care for it.
|
Figure 17: ATOMIC inferences for the event "PersonX adopted a cat". |
How do we provide machines with large-scale, contextualized commonsense knowledge?
The
solution is to leverage manually curated commonsense knowledge resources, such as ConceptNet and ATOMIC, to train a model that can dynamically produce such knowledge for a given context. Commonsense knowledge resources are typically sparse, making training a knowledge base completion model to extend the resource less efficient. Pre-trained
language models and their inherent knowledge come in handy here.
Language models (such as GPT) implicitly represent knowledge, so you can
re-train them on completing knowledge base assertions (e.g. from
ATOMIC) to teach them the structure of knowledge. This is what
COMET (COMmonsEnse Transformers) does, as illustrated in Figure 18.
|
Figure
18: Illustration of the training process of COMET: The language model
is fine-tuned to predict the "tail entity" (e.g. inference in ATOMIC)
given the "head entity" and the relation. Image credit: Antoine
Bosselut. |
COMET is capable of dynamically generating inferences for any context. For example, if we modify the context from ATOMIC to "David adopted his sister's cat because they found out her husband was allergic.",
which for obvious reasons does not appear in ATOMIC, COMET no longer
predicts that PersonX (David) had to go to the shelter, but instead that
he, for example, needed to find out about it.
COMET
has been used successfully in various downstream tasks requiring commonsense knowledge. Models trained on ATOMIC or on ConceptNet are available, and the demo for both ATOMIC and COMET can be found
here. There is also a
Visual COMET that can generate inferences from images.
Summary
We
talked about ways to acquire and represent commonsense knowledge in
machine-readable format, ways to measure commonsense reasoning
abilities, and ways to integrate this kind of knowledge into models.
None of these is solved yet. Manually collecting all the commonsense
knowledge is infeasible, while extracting it from texts or from language
models suffers from inaccuracies, reporting bias, and societal biases.
Looking forward, a promising research direction is multi-modal
commonsense knowledge acquisition, e.g. learning from texts along with images
and videos. For example, looking through enough class photos, you might
learn that the kids in the front row typically sit (especially if the
kids in the last row are also seated).
Machines
may reach human performance on commonsense benchmarks but it's often
due to being right for the wrong reasons rather than actually possessing
and successfully applying commonsense knowledge and reasoning
abilities. Generative tasks are somewhat less prone to this issue, but
we would have to develop reliable automatic evaluation metrics to make
them the standard.
Machine commonsense reasoning is becoming more and more popular within NLP so I am optimistic about future breakthroughs!
No comments:
Post a Comment