Pages

Monday, March 29, 2021

The Quest for Human Parity Machine Translation



The Challenge of Defining Translation Quality 


The subject of  "translation quality" has always been a challenging communication issue for the translation industry. It is particularly difficult to explain this concept in a straightforward way to an industry outsider or a customer whose primary focus is building business momentum in international markets, and who is not familiar with localization industry translation-quality-speak.  Nowadays they tend to focus on creating and managing the dynamic and ever-changing content that enhances a global customer's digital journey, rather than the static content that is the more typical focus of localization managers. Thus, the conventional way in which translation quality is discussed by LSPs is not very useful to these customers. Since every LSP claims to deliver the "best quality " or "high quality" translations", it is difficult for these buyers to tell the difference in this service aspect from one service provider to another. The quality claim between vendors thus essentially cancels out. 

These customers also differ in other ways. They need larger volumes of content to be translated rapidly at the lowest cost possible, but yet at a quality level that is useful to the customer in digital interactions with the enterprise. For millions of digital interactions with enterprise content, the linguistic perfection of translations is not a meaningful and achievable goal given the volume, short shelf-life, and instant turnaround expectations a digital customer will have.  
As industry observer and critic Luigi Muzii describes it:
"Because of the intricacies related to the intrinsic nature of languages, objective measurement of translation quality has always been a much researched and debated topic that has borne very little fruit. The notion of understood quality level remains unsolved, together with any kind of generally accepted and clearly understood quality assessment and measurement."
The industry response to this need for a better definition of translation quality is deeply colored by the localization mindset and thus we see the emergence of approaches like the Dynamic Quality Framework (DQF). Many critics consider it too cumbersome and detailed to implement in translating modern fast-flowing content streams needed for superior digital experience. While DQF can be useful in some limited localization use-case scenarios, it will surely confound and frustrate the enterprise managers who are more focused on digital transformation imperatives.  The ability to rapidly handle and translate large volumes of DX-relevant content cost-effectively is increasingly a higher priority and needs a new and different view on monitoring quality. The quality of the translation does matter in delivering superior DX but has a lower priority than speed, cost, and digital agility.

While machines do most of the translation done on the planet today, this does not mean that there is not a role for higher value-added human translation (increasingly supported by CAT tools). If the content is a critical and high-impact communication, most of us understand that human oversight is critical for success in the business mission. And if translation involves finesse, nuance, and high art, it is probably best to leave the "translating" computers completely out of the picture. 

However, in this age of digitally-driven business transformation and momentum, competent MT solutions are essential to the enterprise mission. Increasingly, more and more content is translated and presented to target customers without EVER going through any post-editing modificationThe business value of the translation is often defined by its utility to the consumer in a digital journey, basic understandability, availability-on-demand, and the overall CX impact, rather than linguistic perfection. Generally, useable accuracy and timely delivery matter more than perfect grammar and fluency. The phrase "good enough" is used both disparagingly, and as a positive attribute, for the translation output that is useful to a customer even in a less than “perfect” state.


So we have a situation today where the term translation quality is often meaningless even in "human translation" because it cannot be described to an inexperienced buyer of translation services (or regular human beings) in a clear, objective, and consistently measurable way. Comparing different human translation works of the same source material is often an exercise in frustration or subjective preference at best. Every sentence can have multiple correct, accurate translations, so how do we determine what is the best translation?  Since every LSP in the industry claims to provide the "best quality", such a claim is useless to a buyer who does not wish to wade through discussions on error counts, error categories, and error monitoring dashboards that are sometimes used to illustrate translation quality.


Defining Machine Translation Output Quality


The MT development community has also had difficulty with establishing a meaningful and widely useful comparative measurement for translation quality. Fortunately, they had assistance from the National Institute of Standards & Technology (NIST) who developed a methodology to compare the translation quality of multiple competing MT systems under carefully controlled evaluation protocols. The NIST used a variant of BLEU scores and other measures of precision, recall, adequacy, and fluency to compare different MT systems rapidly in a standardized and transparent manner.  Their efforts probably helped to establish BLEU as a preferred scoring methodology to rate both evolving and different competing MT systems. 

The competitive evaluation approach works when multiple systems are compared under carefully monitored test protocols, but becomes less useful when an individual developer announces "huge improvements" in BLEU scores as it is easy to make extravagant claims of improvement that are not easily validated. Some independent evaluations used by many today, provide comparisons where several systems may have actually trained on the test sets - this is the equivalent of giving a student the exam with the answers before a formal test. This makes some publicly available comparisons done by independent parties somewhat questionable and misleading.  Other reference test set-based measurements like hLepor, Meteor, chrF, Rouge, and others are also plagued by similar problems. These automated measurements are all useful, but unreliable indicators of absolute quality.

Best practices today suggest that a combination of multiple automated measures needs to be used together with human assessments of MT output to really get a handle on the relative quality of different MT systems. Again, this quickly gets messy as soon as we start asking annoying questions like:
  • What are we testing on?
  • Are we sure that these MT systems have not trained on the test data? 
  • What kind of translators is evaluating the different sets of MT output?  
  • How do these evaluators determine what is better and worse when comparing different correct translations?
  • How many sentences are needed to make a meaningful assessment and draw accurate conclusions when comparing multiple MT systems performance on the same source material?
So, we see that conducting an accurate evaluation is difficult, messy, and it is easy to draw wrong conclusions stemming from easy-to-make errors in the evaluation process.

However, in the last few years, several MT developers have claimed to produce MT systems that have achieved human parity. This has been especially true with the advent of neural machine translation. These claims are useful for creating a publicity buzz among ignorant journalists and fear amongst some translators, but usually disappoint anybody who looks more closely.

I have been especially vocal in challenging the first of these broad human parity claims as seen here: The Google Neural Machine Translation Marketing Deception. The challenge is very specific and related to some specific choices in the research approach and how the supporting data was presented.  A few years later Microsoft claimed they reached human parity on a much narrower focus with their Chinese to English News system but also said: 
Achieving human parity for machine translation is an important milestone of machine translation research. However, the idea of computers achieving human quality level is generally considered unattainable and triggers negative reactions from the research community and end-users alike. This is understandable, as previous similar announcements have turned out to be overly optimistic. 
The goal of achieving human parity has become a way to say that MT systems have gotten significantly better as this Microsoft communication shows. I too was also involved with the SDL claim of having "cracked Russian", which is yet another broad claim stating that human parity has been reached😧. 

Many, who are less skeptical than I am, will interpret that an MT engine that claims to have achieved human parity can ostensibly produce translations of equal quality to those produced by a human translator. This can indeed be true on a small subset of carefully selected test material, but alas we find that it is not usually true in general for much of what we submit with high expectations to these allegedly human parity MT engines. This is the unfortunate history of MT: over-promising and underdelivering. MT promises are so often empty promises 😏. 

While many in the translation and research communities feel a certain amount of outrage over these exaggerated claims (based on MT output they see in the results of their own independent tests) it is useful to understand what supporting documentation is used to make these claims. 

We should understand that at least among some MT experts there is no deliberate intent to deceive, and it is possible to do these evaluations with enough rigor and competence to make a reasonable claim of breakthrough progress, even if it falls short of the blessed state of human parity. 

There are basically two definitions of human parity generally used to make this claim.
Definition 1.If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.

Definition 2.If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
Again the devil is in the details, as the data and the people used in making the determination can vary quite dramatically. There are (50?) shades of grey rather than black and white facts in most cases.  The most challenging issue is that human judges and evaluators are at the heart of the assessment process. These evaluators can vary in competence and expertise and can range from bilingual subject matter experts and professionals to low-cost crowdsourced workers who earn pennies per evaluation. The other big problem is the messy, inconsistent, irrelevant, biased data underlying the assessments.


Ensuring objective, consistent human evaluation is necessary but difficult to do consistently on the required continuous and ongoing basis. If the underlying data used in an evaluation are fuzzy and unclear we actually move to obfuscation and confusion rather than clarity. This can be the scientific equivalent of fake news. MT engines evolve over time and the better the feedback, the faster the evolution of if developers know how to use this feedback to drive continuous improvements.  

Again, as Luigi Muzii states:
The problem with human evaluation is bias. The red-pen syndrome.

Indeed, human evaluation of machine translation is known for being expensive, time-consuming, and often biased, and yet it is supposed to overcome the drawbacks introduced by the limited accuracy and approximation of automatic evaluation. ... Indeed, translation quality metrics have become more and more complex and overly detailed, and always seem to be based on the error-catching [and categorization] approach that has proved costly and unreliable thus far.  

 



Useful Issues to Understand 


While the parity claims can be roughly true for a small sample of evaluated sentences, it is difficult to extrapolate parity to a broader range of content because it is simply not possible to do machine translation output evaluation on an MT scale (millions of sentences). Some of the same questions that obfuscate quality discussions with human translation services also apply to MT. If we cannot define what a "good translation" is for a human, how is it possible to do this for a mindless, common-sense-free machine, where instruction and direction need to be explicit and clear?  

Here are some validation and claim verification questions that can help an observer to understand the extent to which parity has been reached or also expose deceptive marketing spin that may motivate the claims.

What was the test data used in the assessments? 
MT systems are often tested and scored on news domain data which is most plentiful. This may not correlate well with system performance on the typical content in the global enterprise content domain. A broad range of different types of content needs to be included to make claims as extravagant as having reached human parity. 

What is the quality of the reference test set?
In some cases, researchers found that the test sets had been translated, and then back-translated with MTPE into the original source language. This could mean the content of the test sets would be simplified from a linguistic perspective, and thus easier to machine translate. Ideally, only expert human-created test sets should be used and should contain original source material and should not be translated data from another language.

Who produced the reference human translations being used and compared?
The reference translations against which all judgments will be made should be "good" translations. Easily said but not so easily done. If competent humans are creating the source test set sentences, the test process will be expensive. Thus, it is often more financially expedient to use MT or cheap translators to produce the test material.  This can cause a positive bias for widely used MT systems like Google Translate. 

How much data was used in the test to make the claim? 
Often human assessments are done with as little as 50 sentences, and automated scoring is rarely done with more than 2,000 sentences. Thus, drawing conclusions on how any MT system will handle the next million sentences it will process is risky, and likely to be overly optimistic. For example, when an MT developer says that over 90% of the system’s output has been labeled as a human translation by professional translators, they may be looking at a sample of only 100 or so sentences. To then claim that human parity has been reached is perhaps over-reaching.  

Who is making the judgments and what are their credentials?
It is usually cost-prohibitive to use expert professional translators to make the judgments and thus evaluators are often acquired on crowdsourcing platforms where evaluator and translator competence is not easily ascertained. 

It can be seen that do an evaluation properly would be a significant and expensive task, and MT developers have to do this continuously while building the system. The process needs to be efficient, fast, and consistent. It is often only possible to do such careful tests on the most mission-critical projects and is not realistic to follow all these rigorous protocols for typical low ROI enterprise projects. This is why BLEU and other "imperfect" automated quality scores are so widely used. They provide the developers with continuous feedback in a fast and cost-efficient manner if they are done with care and rigor. Recently there has been much discussion about testing on documents to assess understanding of context rather than just sentences. This will add complexity, cost, and difficulty to an already difficult evaluation process, and IMO will yield very small incremental benefits in evaluative and predictive accuracy. There is a need to balance improved process recommendations with cost, and the benefit from improved predictability. 


The Academic Response


Recently, several academic researchers provided some feedback on their examination of these MT at human parity claims. The study is called “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation” is worth a look to see the many ways in which evaluations can go wrong. The study showed that human evaluation of MT quality depends on three factors: “the choice of raters, the availability of linguistic context, and the creation of reference translations.”


Some findings from this report in summary:

“Professional translators showed a significant preference for human translation, while non-expert [crowdsourced] raters did not”.

“Human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems”

The authors recommend the following design changes to MT developers in their evaluation process:
  • Appoint professional translators as raters
  • Evaluate documents, not sentences
  • Evaluate fluency on top of adequacy
  • Do not heavily edit reference translations for fluency
  • Use original source texts
Most developers would say that implementing all these recommendations would make the evaluation process prohibitively expensive and slow. The researchers here do agree and welcome further studies into “alternative evaluation protocols that can demonstrate their validity at a lower cost.” Process changes need to be practical and reasonably possible, and we see that there is a need to balance improved process benefits with, cost, and improved predictability benefits.  


What Would Human Parity MT Look Like?


MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. I think it would be useful to the user community-at-large that MT developers restrain from making these claims until they can show all of the following:
  • 90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human
  • Catch obvious errors in the source and possibly even correct these before attempting to translate 
  • Handle variations in the source with consistency and dexterity
  • Have at least some nominal amount of contextual referential capability
Note that these are things we would expect without question from an average translator. So why not from the super-duper AI machine? 

Until we reach the point where all of the above is true, it would be useful to CLEARLY state the boundary limits of the claim with key parameters underlying the claim. Such as:
  • How large the test set was (e.g. 90% of 50 sentences where parity was achieved) 
  • Descriptions on what kind of source material was tested
  • How varied the test material was: sentences, paragraphs, phrases, etc...
  • Who judged, scored, and compared the translations
 For example, if we saw an MT developer state a parity claim as follows perhaps:
We found that a sample of 45/50 original human sourced sentences translated by the new MT system were judged by a team of three crowdsourced translator/raters as indistinguishable from the translations produced by two professional human translators.  Based on this data, we claim the system has achieved "limited human parity".

Until the minimum set of capabilities are shown at MT scale (>100,000 or even 1M sentences) we should tell MT developers to STFU and give us the claim parameters in a simple, clear, summarized way, so that we can weigh the reality of the data versus the claim for ourselves.  

I am also skeptical that we will achieve human parity by 2029 as some "singularity" enthusiasts have been saying for over a decade. 
 
"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not pixie dust that magically solves all your problems." 
Steven Pinker 
Elsewhere, Pinker also says:
"… I’m skeptical, though, about science-fiction scenarios played out in the virtual reality of our imaginations. The imagined futures of the past have all been confounded by boring details: exponential costs, unforeseen technical complications, and insuperable moral and political roadblocks. It remains to be seen how far artificial intelligence and robotics will penetrate into the workforce. (Driving a car is technologically far easier than unloading a dishwasher, running an errand, or changing a baby.) Given the tradeoffs and impediments in every other area of technological development, the best guess is: much farther than it has so far, but not nearly so far as to render humans obsolete."

Recently some in the Singularity community have admitted that "language is hard" as you can see in this attempt to explain why AI has not mastered translation yet.

Michael Housman, a faculty member of Singularity University, explained that the ideal scenario for machine learning and artificial intelligence is something with fixed rules and a clear-cut measure of success or failure. He named chess as an obvious example and noted machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited set of moves.

Housman elaborated, “Language is almost the opposite of that. There aren’t as clearly-cut and defined rules. The conversation can go in an infinite number of different directions. And then of course, you need labeled data. You need to tell the machine to do it right or wrong.”

Housman noted that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”

Perhaps, we need to admit that human parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?

Perhaps what we really need is an MT platform that can rapidly evolve in quality with specialized human feedback. Post-editing (MTPE) today is generally NOT a positive experience for most translators. But human interaction with the machine can be a significantly better and positive experience.  Developing interactive and highly responsive MT systems that can assist, learn, and improve the humdrum elements of translation tasks instantaneously might be a better research focus. This may be a more worthwhile goal than having a God-like machine that can translate anything and everything at human parity. 

Even in the AI-will-solve-all community, we know that "language is hard" so maybe we need more focus on improving the man-machine interface, and the quality of the interaction and find more sophisticated collaborative models. Rapid evolution, intuitive and collaborative interaction, and instant learning seem like a more promising vision to me than crawling all the data on the web and throwing machine learning pixie dust at your ten trillion word TM training data. 

Getting to a point where the large majority of translators ALWAYS WANT TO USE MT because it simply makes the work easier, more pleasant, and more efficient is perhaps a better focus for the future.  I would bet also that this different vision will be a more likely path to better MT systems that consistently produce better output over millions of sentences.   

Thursday, March 25, 2021

The Impact of MT on the Freelance Translator


The ProZ.com Podcast







This is a conversation or interview that I did with Paul Urwin of Proz where the links will take you to the podcast.

The conversation covers possible strategies that freelance translators can adopt to deal with PEMT and provides some guidance (hopefully) on potential new skills that professionals can develop. 

It also provides context on how valuable the translator is even with continuously improving MT and points to a growing awareness that translators are a resource whose value can only grow in importance given the never-ending momentum on content that needs to be translated.




 

This is Part 1.

Paul talks with machine translation expert Kirti Vashee about interactive-adaptive MT, linguistic assets, freelance positioning, how to add value in explosive content situations, e-commerce translation and the Starship Enterprise.







This is Part 2.

Paul continues the fascinating discussion with Kirti Vashee on machine translation. In this episode, they talk about how much better MT can get, which languages it works well for, data, content, pivot languages and machine interpreting.







The Future is NOT just MT








Tuesday, February 16, 2021

Building Equity In The Translation Workflow With Blockchain

This is a guest post by Bob Kuhns on the subject of blockchain use in the translation industry. He presents a very "simple" model where he shows how a blockchain could enable an ongoing,  robust, and trusted Buyer to Translator business connection that could quite possibly reduce the role of LSP middlemen whose primary value-add in business translation work today is project management services and coordination. Though this is a valuable service, it often significantly increases the cost of translation, and also sometimes creates discord, disgruntlement, and enmity amongst the freelance translation-service suppliers who ultimately do the work. 

A blockchain solution is not just about technology, it’s about solving business problems that have been insolvable before due to the inability of the ecosystem to share information in a transparent, immutable, and trusted manner.  

 LSPs continue to struggle in their communications on translation quality and the value of ongoing project services and thus it theoretically seems that a blockchain that did in fact deliver direct-to-buyer translation services that are trusted, reliable and predictable would indeed be a great step forward for the business translation industry. This is also exactly the primary reason why some in the LSP sector would not want blockchain to succeed. However, there is also a role for a more enlightened LSP in a functioning translation blockchain, one committed to transparency, equitable sharing of business value and benefits, and ultimate customer success in all their globalization initiatives.    

Blockchains show potential to address key concerns of our digitally-driven lives, such as a lack of transparency, accountability, verifiable identity, and control of dataBlockchain has the potential to enable defined quality to be delivered at a defined price in a defined timeframe with minimal administration overhead. The extent that it is able to do this in a clean and trusted way will likely drive its adoption. Blockchains are poised to catalyze new business models by cutting the costs of verifying the truth.  

Explanations of blockchains tend to get complicated quickly. However, in very basic language, blockchains help us certify that something is true, without someone in the middle doing checks and balances.  But recent translation market activity perhaps points to some of the vital building blocks to making blockchain more real in the translation industry.     

What might some of the elements to make a functional blockchain in the translation industry possible be? 

  • A robust TMS platform that could enable ANY buyer (not just localization buyers) to engage ANY translator to perform a necessary translation task.
  • Assistive translation technology that can be easily connected into a blockchain workflow (MT, TM, NLP Tools)
  • The ability to create self-sovereign data that would allow more equitable sharing of data. In the vision of many pioneers in the blockchain space, the “ownership” of data would switch from the organization that gathers it to the individual or organization that contributed it.
  • An Independent Translator rating, certification, ranking, identification database to enable competent resources to be identified and selected. 
  • Smart Contracts
  • And....

For those who think blockchain is still a distant dream, there is evidence that it is making meaningful headway in improving efficiency, accuracy, and transparency in some areas that have historically been project-management nightmares. The members of Tradelens, a blockchain joint venture between IBM and the shipper Maersk, control more than 60% of the world’s containership capacity. Seriously, do we really believe that a translation task has more variance and unplanned changes than these goods and trade flows? Watch the short movie clip at the link above to see it at work. Having trusted food quality from all over the world is perhaps an even more challenging scenario. Food Trust is a blockchain ecosystem that covers more than 100 organizations, including Carrefour and the top four grocery retailers in the U.S.     

It is quite likely that the old guard (executives, managers, and localization teams) will not be at the forefront of translation blockchain if it ever does become a reality. Mostly because they are "just too old, too tired, and too blind " as the movie says. Change is most often driven by the young who see the new potential and have the motivation in solving old enduring problems. 

"Disruption could also be spurred by an even younger generation. New York Times writer David Brooks traveled to college campuses to understand how students see the world. In a story, he wrote after the experience, starkly titled “A Generation Emerging from the Wreckage,” Brooks describes a cohort with diminished expectations. Their lived experience includes the Iraq war, the financial crisis, police brutality, political fragmentation, and the advent of fake news as a social force. In short, an entire series of important moments in which “big institutions failed to provide basic security, competence, and accountability.” To this cohort, in particular, blockchains’ promise of decentralization, with its built-in ability to ensure trust, is tantalizing. To circumvent and disintermediate institutions that have failed them is a ray of hope—as is establishing trust, accountability, and veracity through technology, or even the potential to forge new connections across fragmented societies.

This latent demand is well aligned with the promise of blockchains. While it’s a long road to maturity, these social forces provide a receptive environment, primed and ready for the moment entrepreneurs strike the right formula. "             

Alison McCauley, author of Unblocked 




===============


News recently has shed light on unfair work arrangements for freelancers or “gig workers” with Uber being a prominent example, and now more industries have begun to examine their relationships with freelancers. Though on a modest scale, this self-examination has come to the translation industry as well. The proposed translation model is grounded on the idea that blockchain can bring equity to translators while streamlining the translation workflow.

The Realization For Change

Even before the pandemic, the translation industry was changing and self-reflecting. NMT took center stage and the less-than-equitable working relationships for translators gained notice [1]. In “The State of the Linguist Supply Chain,” Common Sense Advisory examined the translation supply chain from the perspective of over 7,000 linguists, 75% of whom are freelancers representing 178 countries and 155 language pairs [2]. This data-rich survey identified many discrepancies between translators/linguists and their clients - the Buyers of translations and LSPs.

Several illustrative takeaways are:

  • Over half (54%) of the respondents could not live solely on their translation income.
  • Linguists are attracted to the translation profession because of flexible hours (91%) and the diversity of projects (75%). Only 33% rated their pay as being a plus.
  • The frustrations of linguists include fluctuating income (65%), irregularity of work (57%), and lack of respect (25%).
  • There is a preference for working for clients (65%) because translators (80%) earn more, have more job flexibility (56%), and quicker payments. Clients (76%) pay in less than 30 days compared with only 32% with LSPs.
  • The largest benefit of working for LSPs is more work (71%).
  • Translators perceived that cost (40%) and speed (33%) are more important than

Some of the largest challenges facing translators are finding clients (55%), negotiating prices (50%), dealing with tight deadlines (35%). Just before the pandemic, linguists felt the market changing to lower prices (64%) and faster turnaround times (56%).

Though not included here, linguists’ comments found in the report provide a more human context of their jobs than the raw percentages.


The Standard Model and Its Beneficiaries

The current translation workflow is one where translation Buyers hire LSPs to provide translators and manage the translation process including day-to-day project management, translation reviews, source-target file transfers, and handling of invoicing/payments between Buyers and translators. Even with translation management systems (TMSs), the tasks of LSPs are still mostly manual with continual updates to project status and translation reviews. In short, LSPs orchestrate the translation process and relieve Buyers from needing dedicated localization departments.

Who Benefits From the Standard Model

The primary beneficiaries of the Standard Model are Buyers which can have their content translated without investing heavily in a localization team and the LSPs. While LSPs do provide a much-needed project management function, they are in control of the purse strings and, like other businesses, they exist to maximize their share of the purse.

Weaknesses of the Standard Model

Inadequacies range from workflow issues to fairness.

Project management overhead is a glaring inefficiency. Despite the use of TMSs, there is simply too much human administration and intervention throughout a project.

Translation delays can result from time differences between LSPs and their geographically-dispersed pool of translators especially when issues can not be resolved promptly.

Security is a major problem for the Standard Model. LSPs do not know who is actually doing the translating. Online MT engines have been used to translate texts risking exposure of propriety material. [3]

The inequity of the Standard Model, where the Buyer wants to minimize translation costs and the LSPs want to maximize their profits, leaves [especially freelance] translators at the bottom of the food chain.


A Blockchain-based Translation Workflow

Breaking with the Standard Model, the proposed translation workflow reduces human administration and improves translation workflows with blockchain as the backbone [4]

Blockchain, smart contracts and oracles are the key pieces of the proposed workflow. A blockchain is a decentralized ledger of immutable, encrypted records (blocks) securely denoting asset transfers such as source/target files. Since each block on a blockchain contains the identifiers of a provider and recipient of an asset, the provenance of an asset is traceable.

A smart contract is a computer protocol that is intended to enforce the execution of a contract without a third party. Smart contracts could facilitate direct source-target file transfers between Buyers and translators and quicker payment for translators when projects are completed. These transactions are trackable and irreversible.

For a wide set of applications, a blockchain, actually, a smart contract, might require real-world information. Fulfilling that need, a blockchain oracle is an entity providing network-external data through an external transaction. Linguists and MT engines are examples of oracles receiving data (source files) and sending data (target files) to a translation blockchain workflow.

Figure 1: Blockchain Translation Schematic

A skip through the workflow

While the SkipThrough of the blockchain translation schematic (Figure 1.) glosses over many details of the translation process, it points to where the human workflow administration performed by LSPs is completed by smart contracts with a blockchain recording the handoffs of files and payments.

  1. As with the Standard model, a Buyer assembles source documents and project requirements including target languages, linguistic assets (terminologies and TMs), budgets, and schedules.
  1. The source content, linguistic assets, and requirements are recorded on a blockchain.
  1. Smart contracts execute throughout the workflow directing texts to TMs, then to MT engines, or directly to MT engines or translators. Each file transfer is recorded on the blockchain.
  1. Based on MT review acceptability, smart contracts initiate transfers of translations to Translators/MTPEs for review. The blockchain records the translation transfers.
  1. Once a reviewer approves the translations, a smart contract executes, thereby sending translated material to the Buyer. The blockchain is updated.
  1. The buyer’s acceptance of the completed translations invokes a smart contract that sends payment to the Translators/MTPEs. The blockchain records the details of payments.


Who Benefits from the Blockchain Model?

The Buyers and Linguists are the primary beneficiaries. That is not to say that there is no role for LSPs in the translation industry. Until there is widespread adoption of a blockchain or some other nearly fully-automated model of translation, LSPs will co-exist with automation. Also, Buyers may turn to LSPs for projects when they lack the availability of PMs.

Buyers

Buyers want quality translations with tight deadlines and a limited budget.

The streamlining of the translation workflow with blockchain replacing much of the managerial overhead will reduce costs. The savings could be used for other localization projects and ideally for fairly compensating the translators.

Because every file transfer or handoff is recorded on the blockchain, the Buyer is fully aware of the state of the project at any time. They now have access to real-time project management.

The Buyer’s material is secure. Inherent features of blockchains in recording asset transfers are transparency, traceability, and security. With the appropriate service agreements with any of the oracles involved, security weak points can be tightened.

Increased automation leads to faster translations. Since files are transferred and tracked upon execution of smart contracts, time differences are eliminated and translations can be produced 365/24/7.

Linguists

As noted, linguists prefer to work directly for clients with better pay, less time pressure, and recognition.

In the proposed workflow, the work of translators can be tracked. Those producing quality translations would gain recognition and could be compensated in a fair, transparent, and consistent way. There is another side effect of traceability as well. Consistent errors either by humans or an MT engine could be identified and corrections made to improve quality on future projects.

Without the intermediate LSPs, linguists will be working and communicating directly with the Buyer. Yes, the workflow will be automated, but there would be no obstacles to human communication between the Buyer and the translators.

Elimination of time-difference delays and the human management level could allow for more time for actual translation and should lessen the time pressure felt by translators today.


Obstacles to the Blockchain Workflow

While there is much to be gained from the blockchain workflow, there are three broad hurdles for its success. One is technical and the others are due to industry resistance.

  • The technical obstacles are huge. Despite all the hype and predictions, blockchain technology is in its infancy and its future remains uncertain. Nevertheless, blockchain technology is viable and scales with Bitcoin as the most visible example. So the utility of blockchain cannot be dismissed a priori.
  • Industries do not usually embrace change, especially when it changes their business models. The Standard Model represents business as usual and it has taken time and effort to put the infrastructure into place. LSPs, who play a valuable role in today’s translation process, would be most resistant to change. However, the overhead of LSPs drives up translation costs, perhaps at the expense of the translators at the far end of the supply chain.
  • Another major barrier is the human one. With the diminished roles of LSPs, localization managers would need to adapt to a new work environment, undoubtedly a stressful situation. Translators, especially those who have well-established relationships with LSPs, would lose a conduit for work and would also have to adapt. However, there could be monetary rewards as their work and expertise are recognized via blockchain.

Change is not easy!

A Few Final Remarks

Blockchain can provide the backbone for a supply chain that brings equity, improved work arrangements, and recognition to translators. At the same time, blockchain with smart contracts streamlines the workflow by automating much of the current managerial tasks. Granted, the blockchain model is a heavy lift and faces opposition from stakeholders that control much of the workflow today. In any case, whether technical, legal, or social pressures bring about change to the supply chain, solutions for a more equitable translation environment are being discussed and concrete solutions are being proposed. Change seems inevitable.


[1] See TAUS Webinar “Blockchain: When the Token Economy Meets the Translation Industry” - https://blog.taus.net/blockchain-when-the-token-economy-meets-the-translation-industry; For fair pay, see: TAUS blog “Fair Pay for the translators and data-keepers!” - https://blog.taus.net/2021-according-to-taus

[2] Pielmeier, Hélène, and Paul O’Mara, “The State of the Linguist Supply Chain,” CSA Research, January 2020.

[3] See: https://slator.com/technology/translate-com-exposes-highly-sensitive-information-massive-privacy-breach/ and https://www.nrk.no/urix/warning-about-translation-web-site_-passwords-and-contracts-accessible-on-the-internet-1.13670874

[4] For other blockchain architectures, see: Exfluency - https://www.exfluency.com and Kuhns, Bob, “The Pros and Cons of Blockchains and L10N Workflows,” TAUS White Paper, March 2019 - https://www.taus.net/insights/reports/the-pros-and-cons-of-blockchains-and-l10n-workflows-white-paper






Bob Kuhns is an independent consultant specializing in language technologies. His clients have included the Knowledge Technology Group in the Sun Microsystems Labs and Sun’s Globalization group. In the Labs, Bob was part of a team developing a conceptual indexing system and for the Globalization group, he was the project manager and lead translation technology designer for a controlled language checker, a terminology management system, and a hybrid MT system. He was also responsible for developing translation metrics and leading a competitive MT evaluation. Bob has also conducted research and published reports with Common Sense Advisory, TAUS, and MediaLocate on a variety of topics including managed authoring, advanced leveraging, MT, blockchain, and L10n workflows, and global social media.

Bob’s email is: kuhns@rcn.com

Tuesday, January 19, 2021

Adding Commonsense Reasoning to Natural Language Processing Applications

This article is reprinted with permission from the original poster @VeredShwartz . This post might be challenging reading for the usual reader of this blog, but I think that even skimming through this might be useful to many to get a sense for possibly the most formidable challenge in the artificial intelligence community: building common sense capabilities into existing and emerging AI deployments.  

Commonsense knowledge consists of facts about the everyday world, that all humans are expected to know. Commonsense knowledge helps to solve problems in the face of incomplete information. It is currently considered an unsolved problem in AGI and is a focus of the Allen Institute for Artificial Intelligence which the author is associated with. 

Deep learning is self-education for machines; you feed a machine learning system huge amounts of data, and eventually it begins to discern patterns all by itself.  But despite their remarkable achievements, and occasional ability to produce human-like outputs, machine learning algorithms are at their core complex mathematical functions that map observations to outcomes. Or are able to forecast patterns that they have previously seen and explicitly learned. Therefore, they’re as good as their data and they start to break as the data they face in the world starts to deviate from examples they’ve seen during training. Neural MT is an example, great progress indeed, but far from having solved the translation problem.  

We hear continuously about the relentless "big data" that is driving AI progress, but we are finding more and more cases where the current approach of deep learning and more data is not enough. The path to machine commonsense is unlikely to be brute force training of larger neural networks with deeper layers on more data.  Whilst deep learning excels at pattern recognition, it’s very poor at adapting to changing situations even when small modifications of the original case are encountered, and often has to be re-trained with large amounts of data from scratch. 

"The great irony of common sense—and indeed AI itself—is that it is stuff that pretty much everybody knows, yet nobody seems to know what exactly it is or how to build machines that possess it," said Gary Marcus, CEO and founder of Robust.AI. "Solving this problem is, we would argue, the single most important step towards taking AI to the next level. Common sense is a critical component to building AIs that can understand what they read; that can control robots that can operate usefully and safely in the human environment; that can interact with human users in reasonable ways. Common sense is not just the hardest problem for AI; in the long run, it's also the most important problem." 

Common sense has been called the “dark matter of AI” — both essential and frustratingly elusive. That’s because common sense consists of implicit information — the broad (and broadly shared) set of unwritten assumptions and rules of thumb that humans automatically use to make sense of the world. Critics of over-exhuberant AI claims frequently point out that two-year children have more common sense than existing deep-learning based AI systems whose "understanding" is often quite brittle and easily distracted and deranged.

Common sense is easier to detect than to define. The implicit nature of most common-sense knowledge makes it difficult and tedious to represent explicitly. 

DARPA, the US defense department’s research agency, has also recognized the absence of common sense as being an important issue. They recently launched a project called Machine Common Sense. As they say:“ The absence of common sense prevents intelligent systems from understanding their world, behaving reasonably in unforeseen situations, communicating naturally with people, and learning from new experiences. Its absence is considered the most significant barrier between the narrowly focused AI applications of today and the more general, human-like AI systems hoped for in the future”. 

Gary Marcus suggests combining traditional AI approaches together with deep learning as a way forward. 

"First, classical AI actually IS a framework for building cognitive models of the world that you can then make inferences over. The second thing is, classical AI is perfectly comfortable with rules. It’s a strange sociology right now in deep learning where people want to avoid rules. They want to do everything with neural networks, and do nothing with anything that looks like classical programming. But there are problems that are routinely solved this way that nobody pays attention to, like making your route on Google maps.

We actually need both approaches. The machine-learning stuff is pretty good at learning from data, but it’s very poor at representing the kind of abstraction that computer programs represent. Classical AI is pretty good at abstraction, but it all has to be hand-coded, and there is too much knowledge in the world to manually input everything. So it seems evident that what we want is some kind of synthesis that blends these approaches."


Yejin Choi and her collaborators at the Allen Institute have united traditional symbolic AI approaches with newer machine learning approaches in an attempt to address the commonsense challenge. One initiative, COMET (short for “commonsense transformers”) extends traditional symbolic reasoning with the latest advances in neural language modeling — a kind of deep learning that aims to imbue computers with a statistical “understanding” of written language. COMET is a  fusion of symbolic reasoning with a neural network and tries to solve the coverage and brittleness problems, of purely DL-approaches, at the same time.  COMET works by reimagining common-sense reasoning as a process of generating plausible (if imperfect) responses to novel input, rather than making airtight deductions by consulting a vast encyclopedia-like database.

Gary Marcus, a critic of the deep-learning fanboys and girls, often points out DL-only shortcomings to challenge the over-exhuberance of these fans. To put progress in AI into a more realistic context he says: “Just because you can build a better ladder doesn’t mean you can build a ladder to the moon.” To him and others, COMET’s approach suffers from a fundamental limitation of deep learning: “statistics ≠ understanding.”

Regardless, Vered presents a comprehensive picture of the many challenges faced and attempts at developing solutions in introducing commonsense to NLP applications in arguably one of the most challenging problems in computing today. I think  her post is a great resource for anybody who wants to quickly get a sense for the issue and the SOTA.



****** 

Commonsense Reasoning for Natural Language Processing

This long-overdue blog post is based on the Commonsense Tutorial taught by Maarten Sap, Antoine Bosselut, Yejin Choi, Dan Roth, and myself at ACL 2020. Credit for much of the content goes to the co-instructors, but any errors are mine. 

In the last 5 years, popular media has made it seem that AI is nearly---if not already---solved by deep learning, with reports on super-human performance on speech recognition, image captioning, and object recognition. The release of Google Translate’s neural models in 2016 reported large performance improvements: “60% reduction in translation errors on several popular language pairs”. But looking under the hood, these numbers seem to be misleading. Neural models find shortcuts to the correct answers through dataset-specific input-output correlations, essentially solving the dataset but not the underlying task. When models are challenged with adversarial out-of-domain examples, they perform poorly. Small unnoticeable noise added to images confuses object recognition models and changes their predictions. Visual question answering models guess the answer based on the frequency of answers for the same type of question in the training set, e.g. replying "2" to any "how many" question. Image captioning models often learn to recognize objects based solely on their typical environment and fail to recognize them outside their typical environment. In NLP, dialogue systems generate highly generic responses such as “I don’t know” even for simple questions. Open-ended generation is prone to repetition. Question answering systems are easily distracted by the addition of an unrelated sentence to the passage. And more. 

Figure 1: adversarial examples in computer vision (left) and natural language processing tasks (right).

Machine learning models today perform reasonably well on perception tasks (image and speech recognition). However, they mostly lack the ability to perform simple intuitive commonsense inferences that humans do in every minute of their waking hours, regarding pre-and post-conditions of events, understanding other people's motivations and intents, mental and emotional states, etc. 

Table of contents: 

  1. What is commonsense? 
  2. Is commonsense knowledge already captured by pre-trained language models? 
  3. How to create benchmarks to measure commonsense reasoning capabilities? 
  4. How to gather and represent machine-readable commonsense knowledge? 
  5. How to enhance neural models for commonsense reasoning tasks with symbolic knowledge? 
  6. Summary
What is commonsense? 
The boundaries of commonsense are quite challenging to define, but we will go with this working definition:
Commonsense is the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. 
For example, it's common sense that it's OK to keep the closet door open, but not the fridge door, as the food inside might go bad. 

Types of commonsense: 

Commonsense knowledge can be categorized according to types, including but not limited to:
  • Social commonsense: people are capable of making inferences about other people's mental states, e.g. what motivates them, what they are likely to do next, etc. This kind of inference is captured by the ATOMIC knowledge base discussed later. In addition, we each have a set of social norms of accepted behavior, e.g. knowing that “it's impolite to comment on someone's weight”. While these are often implicit in our actions and decisions, machines need to be taught them explicitly

  • Temporal commonsense: natural language rarely communicates explicit temporal information. Instead, it's vague and relies on the commonsense knowledge of the listener. For example, when told that "Dr. Porter is taking a vacation" we can predict that Dr. Porter will not be able to see us soon, as opposed to when "Dr. Porter is taking a walk". This requires knowing the typical duration of "taking a walk" (minutes) and that of "taking a vacation" (days). Other temporal knowledge is typical times, order, frequency, etc. of events which are addressed by the MC-TACO dataset and the TACO-LM time-aware contextual language model. 

  • Physical commonsense: a glass will likely shatter if it falls to the floor, which is a fact most people (and arguably cats) know. Physical commonsense includes knowledge about the physical properties and affordances of everyday objects, as tested in the PIQA dataset.
Commonsense is essential for humans to navigate everyday situations seamlessly and interact with each other in a reasonable and safe way, and for AI to understand human needs and actions better. Yet, endowing machines with such human-like commonsense reasoning capabilities has remained an elusive goal of AI research for decades. Past attempts, in the 1960s and 1970s, resulted in an AI winter, i.e. reduced interest and funding for AI research due to failed over-hyped research directions. In recent years, a new interest in machine commonsense has emerged, with the availability of stronger computing power and huge amounts of data. With that said, the path to machine commonsense is unlikely to be brute force training larger neural networks with deeper layers.   

Is commonsense knowledge already captured by pre-trained language models?

In the last 3 years, language models have been ubiquitous in NLP. Language models are pre-trained once, in a self-supervised manner that requires only a large text corpus. Traditionally, language models are trained to predict the next word in a sentence (top part of Figure 2, in blue), but they can also predict hidden (masked) words in the middle of the sentence, as in Google's BERT model (top part of Figure 2, in orange). This pre-training phase yields a function that gets a sequence of words (sentence, short paragraph) and returns a vector for each word in the sequence. 
  

Figure 2: Language models pre-training and fine-tuning.


As opposed to word embeddings which are static, language model-based word vectors are dynamic and re-computed for each context. At the very basic level, they assign different vectors to words when they are used in different senses, as in Figure 3. 


Figure 3: Static vs. dynamic word representations.


Do off-the-shelf pre-trained language models already capture commonsense knowledge? 

✅  They are capable to some extent, of filling incomplete commonsense facts or ranking candidate facts. For example, the language model score (≈ statement plausibility) of a fact like "a musician plays a musical instrument" is higher than "a dancer plays a musical instrument". This is a proof that, in addition to lexical and syntactic knowledge, language models capture general knowledge about the world.  

✅  They can, to some extent, associate concepts with their properties. They distinguish concepts 
associated with a given set of properties, i.e. complete a statement such as "       has fur, is big, and has claws, has teeth, is an animal, ..." with bear (just like playing the "20 question game"). They perform better when they are shown encyclopedic properties (e.g. is an animal) as opposed to perceptual properties (e.g. smooth). They can also, pretty successfully, list the properties 
associated with given concepts, e.g. complete the sentence "Everyone knows that a bear has       " with fur, claws, teeth, etc. 

However, knowledge generated from language models is noisy! 

🚫 Several papers have shown that language models are not sensitive to negation, i.e. they consider the negated version of facts ("birds can't fly") as similarly plausible. 

🚫 They are sensitive to phrasing:


🚫  In distributional word vectors, the vector representing a (sub-)word is learned from the contexts in which it appeared, leading to similar representation for semantically-similar words. In language models, the representation of similar contexts are similar, so the model learns which type of word should appear next (or instead of a masked token). This is generally a positive thing, but it sometimes over-generalizes, leading to examples such as this: 


Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the AllenNLP demo


Here, BERT has seen in its training corpus enough sentences of the type "The color of something is [color]" to know to suggest different colors as substitutes for the masked word. Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn't see enough sentences discussing the color of a dove, thus it defaults to just predicting any color.  

So knowledge in language models is not the most accurate and reliable. Is it still useful?

Yes, to some extent. One way to show it is through evaluation on tasks requiring commonsense knowledge. We will discuss several such tasks, but for now, let's focus on WinoGrande as an example. It is the large-scale version of the Winograd Schema Challenge. Given a sentence with a cloze, the goal is to fill in the blank with a previously mentioned entity or concept, out of two answer choices. For example: 

Because Brett found an internship while in college but Ian was unable to, _____ found a job less quickly after graduation. 
Choices: Brett, Ian

What makes this task especially difficult is that every instance has a twin sentence which is minimally changed such that the correct answer is the other one (for instance, replacing "less quickly" with "more quickly" will change the correct answer from Ian to Brett). 

Language model-based models top the leaderboards of WinoGrande and other commonsense tasks, but since they are trained on task-specific training data, which often contains tens or hundreds of thousands of training examples, it's hard to attribute the success to the knowledge captured in language models from the pre-training step. A better way to estimate it is with zero-shot (unsupervised) models. Typically, the way zero-shot models address multiple-choice tasks is by phrasing a statement from the instance and each answer choice, and computing the language model score as a proxy for plausibility:

PLM(The answer is answer1
PLM(The answer is answer2
...
PLM(The answer is answerk)

And then predicting the answer choice with the best language model score (highest probability, which is usually computed as the lowest perplexity). 

In our recent EMNLP paper, we took it one step further and asked whether we can use language models to generate what would otherwise be missing or implicit knowledge needed for solving a multiple-choice commonsense question answering instance. We proposed the unsupervised "self-talk" framework, that uses language models to generate information-seeking questions such as "what is the definition of..." and their corresponding answers (clarifications) to discover additional background knowledge. In the example in Figure 5, knowing that internship experience may help a person get a job is crucial for answering the question (which of Brett and Ian found a job less quickly?). On most benchmarks, the self-talk model performed better than unsupervised models with no additional knowledge, while competing with models that have access to knowledge bases. This is despite the inaccurate and noisy knowledge language models generate. However, when we showed people some of the clarifications that helped the model choose the correct answer choice, they judged only 40% of them as actually providing helpful information. This discrepancy means that our model doesn't imitate the human reasoning process - it works differently. Check out our demo! It's not always accurate but it's often funny :) 

Figure 5: An example of clarification generation for an instance from WinoGrande.


The best performance on commonsense tasks is achieved by fine-tuning language models, i.e. training them on task-specific data. Let's look at some of the benchmarks and the issues we face with supervised learning.  

How to measure commonsense reasoning capabilities? 

Multiple commonsense benchmarks have been released over the last few years. Some of them will be discussed here (see examples in Figure 6), along with the main differences and design choices when creating a benchmark.

Figure 6: Some commonsense benchmarks along with an example instance. 


Type of knowledge: some benchmarks focus on a specific type of commonsense knowledge, such as social commonsense (e.g. Social IQa),  physical commonsense (e.g. PIQA), temporal commonsense (e.g. MC-TACO),  or causes and effects (e.g. COPA), while others target a broader domain of general commonsense knowledge and reasoning (e.g. WSC, WinoGrande, CommonsenseQA, ROCStories).  

Size: most recent datasets include a large training set, in order to facilitate training large neural models. One way to create a benchmark is to hire experts to curate a high-quality dataset such as for WSC and COPA. These datasets are rather expensive to collect and are therefore typically small. The common alternative is to collect data through crowdsourcing or semi-automatically, and split it randomly to train, validation, and test sets. Models that learned data-specific shortcuts in the training set instead of generalized phenomena are likely to perform well on a test set drawn from the same distribution, but this performance is misleading and is likely a lot better than on real-world instances of the task.  Despite this understanding, this is still the dominant approach. 

Format: the vast majority of datasets are in the format of multiple-choice questions, as exemplified in Figure 6. This format is the easiest to evaluate automatically: models are judged for their accuracy, i.e. what percent of the questions they answered correctly. Unfortunately, this type of tasks also makes it possible for a model to guess the correct answer. We're not talking about a random guess, which would leave enough room for improvement. A random guess is expected to result in an accuracy of 100/k %, where k is the number of answer choices, e.g. 50% accuracy for binary tests, 33.3% for tests with 3 choices, 25% for 4 choices, etc. The risk is that the model makes an "educated guess" based on - yes, you guessed it correctly - spurious correlations between the questions and the correct/incorrect answers. 

How do you make sure a model is right for the right reasons?

That's the million-dollar question. We don't have a perfect solution for this problem yet. For a start, when collecting a new benchmark, the process of collecting incorrect answers (=distractors) should be well-designed such that distractors are plausible but unlikely. Using random answers as distractors (e.g. naturally-occurring sentences or correct answers of different questions) would create topically-different distractors, which are easy to detect (remember, relatedness is one of the strengths of distributional text representations). Asking people to come up with the distractors may introduce other annotation artifacts, such as exaggerations, going off-topic, or producing overly emotional texts, which are easy for models to detect. Some solutions have been proposed: for example, the distractors in Social IQa are answers for different questions asked on the same context. In Figure 7, the context "Alex spilt food all over the floor and it made a huge mess." appears in the dataset with two questions: "what happens next?" and "what happened before?". The distractors of "what happens next?" are the correct answers of "what happened before?", e.g. that Alex has slippery hands. A similar approach is taken in CommonsenseQA. 

Figure 7: Creating distractors for a Social IQa instance. Image credit: Maarten Sap.

An alternative solution is to filter out easy questions through "adversarial filtering", i.e. training a weaker model and iteratively removing instances that it succeeds in answering. Variants of adversarial filtering were applied to WinoGrande and PIQA. 

Finally, I believe the future is in generative tasks, in which the model needs to produce a free-text answer without being provided with the candidate answers. Several recent benchmarks are generative, such as TimeTravel (counterfactual reasoning), ART (abductive reasoning), CommonGen, and ProtoQA. The challenge in generative tasks is the lack of reliable automatic evaluation metrics. Given the gold standard reference answer(s), we would like a metric to (1) reward correct generated answers that are different from the reference answer, while (2) penalizing incorrect answers that are similar (e.g. lexically) to the reference. Human evaluation is reliable, but it is costly and is typically done once on the test set. In order to be able to improve models during development, we need automatic metrics. We currently settle for metrics based on lexical overlap such as BLEU and ROUGE which are pretty terrible at (1) and have little correlation with human judgments, or model-based metrics such as BERT score that are not great at (2). 

How to gather and represent machine-readable commonsense knowledge?

Commonsense resources provide machine-readable knowledge about the world. Resources are expected to be large-scale and accurate, consist of diverse knowledge types, and be usable in downstream tasks. ConceptNet is a large (21 million assertions), commonly-used resource consisting of general commonsense knowledge, in over 85 languages. ATOMIC consists of 880,000 triplets reasoning about causes and effects of everyday situations. Other resources are listed in Figure 8.

Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap. 


Existing resources differ in several aspects:

Representation: how is knowledge represented in the resource? ConceptNet and ATOMIC represent knowledge in natural language (Figure 9), while NELL and Cyc represent knowledge in symbolic logic:

(#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET)) 


Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap. 


Knowledge type: ConceptNet consists of semantic knowledge, i.e. properties of concepts (e.g. reading is a type of activity). ATOMIC, on the other hand, is inferential: given a templated event with "PersonX" representing the subject and "PersonY" an optional object(s) (e.g. PersonX yells at PersonY), and one of 9 pre-defined relation dimensions (e.g. PersonX's motivation) it provides a second event (e.g. PersonX wanted to express anger). 

Collection method: knowledge can be collected from humans, either experts or crowdsourcing workers. Expert-curated resources are more uniform and accurate and may use complex representations, but it is an expensive collection method, and it is very time-consuming. Alternatively, non-experts can write knowledge in natural language, making the collection faster and more scalable.

The alternative approach is to extract knowledge automatically from texts, as in NELL. This approach works, but it produces less accurate knowledge. In addition, the approach suffers from reporting bias: over-representing the rare at the expense of the trivial. For example, people are reported to murder more often than they are reported to breathe. Default properties of concepts (yellow banana) are mentioned less often than their alternatives (green banana), etc. 


How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?

Most models developed for solving commonsense benchmarks today are based on language models. Typically, each answer choice, along with the context, forms a statement. The language model computes a vector representing each statement. These vectors are then fed into a classifier that assigns a plausibility score for each candidate answer:


Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance.


Static neuro-symbolic integration

The knowledge in commonsense resources may enhance models built for solving commonsense benchmarks. For example, we can extract from ConceptNet the assertions that a job is used for making money, that spending money requires making money, that buying requires spending money and that car is something you can buy. Ideally, we would also need the knowledge that a high-paying job is a type of job, specifically one used for making a lot of money, which is required for spending a lot of money, which is required for buying something that costs a lot of money, a car being one of them. Finally, we may want to remove the edge from "buy" to "car" so we can only get to "car" from the node "buy something that costs a lot of money". 


Figure 12: Knowledge extracted from ConceptNet for the WinoGrande instance discussed above.


How do we incorporate knowledge from knowledge resources into a neural model?

The simple recipe (success not guaranteed) calls for 4 ingredients: the task addressed, the knowledge resource used, the neural component, and the combination method. We have already discussed tasks and knowledge resources, so I would only add here that ConceptNet is the main resource utilized for downstream models, although some models incorporate other knowledge sources, such as other knowledge bases (WordNet, ATOMIC), knowledge mined from text, and tools (knowledge base embeddings, sentiment analysis models, COMET - see below). 


Figure 13: Resources used by most knowledge-informed commonsense models.

The neural component is the shiny new neural architecture - language models in the last 3 years, biLSTMs in the years prior, etc. The more interesting component is the combination method. We will look at 3 examples:

Incorporating into the scoring function: Lin et al. (2017) extracted probabilistic "rules" connecting pairs of terms from multiple sources such as WordNet (restaurant→eatery: 1.0), Wikipedia categories (restaurant→business: 1.0), script knowledge mined from text (X went to a restaurant→X ate: 0.32), word embedding-based relatedness scores (restaurant→food: 0.71), and more. The model scores each candidate answer according to the scores of the inference rules used to get from the context (e.g. "Mary walked to a restaurant" in Figure 14) to the candidate answer (e.g. "She ordered foods.").  


Figure 14: "covering" each candidate answer by the original context and the rules extracted from various sources. Image credit: Lin et al. (2017).


Representing symbolic knowledge as vectors: Lin et al. (2019) used BERT as the neural component to represent the instance (statement vector). For their symbolic component, they extracted subgraphs from ConceptNet pertaining to concepts mentioned in the instance and learned to represent them as a vector (graph vector). These two vectors were provided as input to the answer scorer which was trained to predict the correct answer choice. 

Figure 15: extracting subgraphs from ConceptNet pertaining to concepts mentioned in the instance. Image credit: Lin et al. (2019).

Multi-task learning: Xia et al. (2019) fine-tuned a BERT model to solve the multiple-choice questions. They also trained two auxiliary tasks supervised by ConceptNet, in which two concepts were given as input and the classifier had to predict whether they are related or not, and the specific ConceptNet property that connects them. The BERT model was shared between the main and the auxiliary tasks, so that commonsense knowledge from ConceptNet was instilled into BERT, improving its performance on the main task.


Figure 16: multi-task learning aimed at instilling knowledge from ConceptNet into BERT.

Dynamic neuro-symbolic integration

There are two main limitations to the neuro-symbolic integration discussed above:
  1. Coverage: relevant knowledge is often not found as-is in commonsense knowledge resources. As we've seen earlier, commonsense knowledge is immeasurably vast, so much of it is not documented. 

  2. Precision and context: knowledge found in the knowledge base about concept X doesn't necessarily apply to all contexts in which X appears. For example, when provided with "PersonX adopts a cat", ATOMIC says that PersonX had to go to the shelter first (Figure 17), but that's not always the case. It may as well be that PersonX adopted a cat they found on the street or got the cat from a friend who was no longer able to care for it. 

Figure 17: ATOMIC inferences for the event "PersonX adopted a cat".


How do we provide machines with large-scale, contextualized commonsense knowledge?

The solution is to leverage manually curated commonsense knowledge resources, such as ConceptNet and ATOMIC, to train a model that can dynamically produce such knowledge for a given context. Commonsense knowledge resources are typically sparse, making training a knowledge base completion model to extend the resource less efficient. Pre-trained language models and their inherent knowledge come in handy here. Language models (such as GPT) implicitly represent knowledge, so you can re-train them on completing knowledge base assertions (e.g. from ATOMIC) to teach them the structure of knowledge. This is what COMET (COMmonsEnse Transformers) does, as illustrated in Figure 18. 


Figure 18: Illustration of the training process of COMET: The language model is fine-tuned to predict the "tail entity" (e.g. inference in ATOMIC) given the "head entity" and the relation. Image credit: Antoine Bosselut.


COMET is capable of dynamically generating inferences for any context. For example, if we modify the context from ATOMIC to "David adopted his sister's cat because they found out her husband was allergic.", which for obvious reasons does not appear in ATOMIC, COMET no longer predicts that PersonX (David) had to go to the shelter, but instead that he, for example, needed to find out about it.

COMET has been used successfully in various downstream tasks requiring commonsense knowledge. Models trained on ATOMIC or on ConceptNet are available, and the demo for both ATOMIC and COMET can be found here. There is also a Visual COMET that can generate inferences from images. 

Summary

We talked about ways to acquire and represent commonsense knowledge in machine-readable format, ways to measure commonsense reasoning abilities, and ways to integrate this kind of knowledge into models. None of these is solved yet. Manually collecting all the commonsense knowledge is infeasible, while extracting it from texts or from language models suffers from inaccuracies, reporting bias, and societal biases. Looking forward, a promising research direction is multi-modal commonsense knowledge acquisition, e.g. learning from texts along with images and videos. For example, looking through enough class photos, you might learn that the kids in the front row typically sit (especially if the kids in the last row are also seated). 


Machines may reach human performance on commonsense benchmarks but it's often due to being right for the wrong reasons rather than actually possessing and successfully applying commonsense knowledge and reasoning abilities. Generative tasks are somewhat less prone to this issue, but we would have to develop reliable automatic evaluation metrics to make them the standard. 

Machine commonsense reasoning is becoming more and more popular within NLP so I am optimistic about future breakthroughs!