Pages

Showing posts with label artificial intelligence. Show all posts
Showing posts with label artificial intelligence. Show all posts

Wednesday, December 29, 2021

The Human Space Beyond Language

Much of what I write about in this blog is about language technology and machine translation. The primary focus is on the technology and AI initiatives related to human language translation. This focus will remain so, but I recently came upon something that I felt was worth mentioning, especially in this holiday season, where many of us review, consider and express gratitude for the plenitude in our lives.

Language is a quintessentially human experience where we share, discover, learn, and express the many different facets of our lives through this medium we call language. This is probably why computers are unlikely to ever unravel it fully, there is too much amorphous but critical context about life, living, learning, and the world, around most words to easily capture with training data and give to a computer to learn.  

While many of us surmise that language is only about words and about how words can be strung together to share, express, understand the world around us, in most cases, there is much that is unspoken or not directly referenced that also needs to be considered to understand any set of words accurately and faithfully. Sometimes the feeling and emotion are enough and words are not needed.

In 2021 Large Language Models (LLMs) were a big deal and GPT-3, in particular, was all over the news as a symbol of breakthrough AI that to some suggests that a sentient machine is close at hand. Until you look more closely and see that much of what is produced by LLMs are crude pattern reflections that are completely devoid of understanding, comprehension, or cognition in any meaningful sense. The initial enthusiasm for GPT-3 has been followed by increasing concern as people have realized how these systems are prone to producing unpredictable obscenity, prejudiced remarks, misinformation, and so forth. The toxicity and bias inherent in these systems will not be easily overcome without strategies that involve more than more data and more compute.

It is very likely that we will see these increasingly larger LLMs go through the same cycles of over-promising and under-delivering that machine translation has gone through for over 70 years now. 

The problem is the same, the words used to train AI alone do not contain everything needed to establish understanding, comprehension, and cognition. And IMO simply training a deep learning algorithm with many more trillions of words will not somehow create understanding and cognition or even common sense.  

The inability for AI "to understand" was clearly shown by Amazon Alexa recently when it told a child to essentially electrocute herself. "No current AI is remotely close to understanding the everyday physical or psychological world, what we have now is an approximation to intelligence, not the real thing, and as such it will never really be trustworthy," said Gary Marcus in response to this incident. GPT-3 has also advised suicidal humans to kill themselves in experiments conducted elsewhere. 

The machine is not malicious, it simply has no real understanding of the world and life, and lacks common sense. 

The truth is that we are forced to learn to query and instruct Alexa, Siri, and Google Voice so that they can do simple but useful tasks for us. This is "AI" where the human in the loop keeps it basically functional and useful. Expecting any real understanding and comprehension from these systems without many explicit and repeated clarifications is simply not possible in 2021. 

But anyway, I digress, so, I wanted to talk about the areas where humans move beyond language (as in word-based) but yet communicate, share, and express quintessential humanness in the process. 

It is my feeling that entering this space happens most often with music, especially improvised music where there is some uncertainty or unpredictability about the outcome.  Where what happens, happens, often without a plan, but yet still with a clear artistic framework and structural outline. I happen to play the sitar focusing on the Indian Classical music of North India where the "Raga" is the basic blueprint that provides the needed foundations for highly disciplined improvisatory exploration. 

To a great extent what these musicians do is "shape the air" and create something equivalent to sonic sculptures. These sculptures can be pleasing or relaxing in many ways that only humans can understand, and sometimes can be very moving, which means they can trigger emotional release (tears) or establish a deeply emotional presence (left speechless). Often it is not necessary to understand the actual language used in musical performance since there is still a common layer of feeling, emotion, and yearning that all humans can connect and tap into. 

The key difference of this improvisation-heavy approach from a performance of score-based music is that neither the musician nor the audience really knows at the outset how things will turn out. With a score, there is a known and well-defined musical product that both the audience and the musician are aware of and expect. There is more of an elemental structure. However, here too it is possible for an attendee to listen to an unfamiliar language e.g. an operatic aria in Italian, and be deeply moved, even though the audience member speaks no Italian and may have no knowledge of the operatic drama. The connection is made at a feeling and emotional level, not at the word,  language, or idea cognition level.

I came upon this musical performance of a Sufi (a mystical Muslim tradition) song sung by two musical legends on a commercial platform called Coke Studio Pakistan. Musically, this might be considered "fusion" but it is heavily influenced by Indian classical music and it is sung in Urdu (Braj) which is so close to Hindi (Hindustani) that they are virtually the same language, except that Urdu uses much more Persian vocabulary. The original poem was written in Braj Basha an antecedent of both Urdu and modern-day Hindi. 

This particular performance was a rehearsal and was the first time all the musicians were in the same room, but the producers decided it was not possible to improve on this and published it, as is, since it was quite magical and probably impossible to reproduce. 

There are almost 20,000 comments to the video shown below and this comment by Matt Dinopoulos typifies much of the feedback: "It hit my soul on so many levels and just brought me to tears and I don’t even know what they’re saying.The figures of speech “pluck at one’s heartstrings” and “strikes a chord in me” have found a home in our language for just this reason.


The song/poem Chaap Tilak was written by Amir Khusrau the famous poet laureate of the Indian subcontinent considered one of the most versatile poets and prolific prose-writers of the 13th and 14th centuries. He is also considered by some to be a seminal force in the creation of the sitar and the development of the Khyal music form most prevalent in North Indian classical music today. 

This song essentially expresses Khusrau's gratitude, devotion, love, and longing for communion with his Pir (Guru/Spiritual teacher) whose name is Nizam (Nizamuddin Auliya). Sung from the perspective of a young girl awaiting or yearning for her beloved, it is replete with modest yet enchanting symbols, as it celebrates the splendor of losing oneself in love. Both the use of motifs as well as the language itself were deliberate creative choices by Amir Khusrau, to communicate with common people using familiar ideas and aesthetics.

A closer examination of Khusrau's works will reveal that the Beloved in his songs/poems is always the Divine or the Pir. Many poets in India use the perspective of the romantic yearnings of a young maiden for the beloved as an analogy, as the relationship with the Divine is seen as the most intense kind of love. The longing and union they speak of are always about direct contact with the Sacred and so this song should be considered a spiritual lament whose essential intention is to express spiritual love and gratitude. The translations shown in the video are sporadic but still useful.

Chaap Tilak Performance 

At the time of this publishing, the video above had already had 40 million views. Many thanks to the eminent Raymond Doctor for providing this link which provides a full translation, and useful background to better understand the thematic influences and artistic inspiration for this song.


“As long as a spiritual artist respects his craft, peace will prevail. It is wonderful when a singer has a noble cause and spreads the message of love, peace, and brotherhood as presented by our saints, without greed of money or the world. This is the real purpose of qawwali.”             
                                                                                             Rahat Fateh Ali Khan

“Music doesn’t have a language, it’s about the feeling. You have to put a lot of soul into whatever you are making. Music doesn’t work if you’re only doing it for money or professionally. It works only if it’s from the soul. There’s no price to it.”                                                                                                                                                                                                                                    Aima Baig     

"Information is not knowledge. Knowledge is not wisdom. Wisdom is not truth. Truth is not beauty. Beauty is not love. Love is not music. Music is THE BEST.” 

— Frank Zappa 


I played this song for several friends (mostly musicians) who had no familiarity with Indian music and found that several of them were deeply touched, so much so that some had tears streaming and were unable to speak when the song ended. In fact, they were mystified by how strong an emotional reaction they had to this unfamiliar and alien artistic expression. 

This unexpected, often surprising, emotion-heavy reaction is entirely and uniquely human. This kind of listener impact cannot come from musical virtuosity alone which is abundantly present here, the musicians here are also tapping into a deeper sub-strata of feeling and emotion that only exists in and is shared by humans. 

This is the human space beyond language where understanding happens in spite of initial unfamiliarity. There is something in the human psyche that understands and connects to this even if by accidental discovery, and this initial response often leads to a more substantial connection. We could call this learning perhaps, and this is probably how children also gather knowledge about the world. Intensity and connection probably have a more profound impact on students than pedagogy and quite possibly drive intense learning activity in any sphere.

It is interesting that there are many reaction videos on Youtube where music teachers and YT celebrities from around the world share their first reactions to this particular song and other culturally unfamiliar music. Based on the number of these reaction videos, I guess more and more people are exploring and want to share in the larger human musical experience. Some examples:

  • Latina Ceci Dover left speechless (around 4' 50")
  • British rapper reacts in shock and awe (around 6' 05")
  • A deep and informed analysis of the singing technique and mechanics. "If her voice was an animal it would be an eagle."
  • Seda Nur Turkish German was surprised by the emotional connection. (around 3' 15") It also led her to actually visit Pakistan last week, a trip which she is also sharing in her Vlogs.
  • John Cameron left speechless and in tears (around 11' 30')
  • Waleska & Efra discover a new musical paradigm (around 10' 40")
  • Asian dude is blown away (~2' 09"): Oh My Godness he says at 4' 0", and dances to the chorus like a bird (4' 25") Hilarious responses throughout the song.
It is not always necessary to have this level of virtuosity to find this sub-strata space beyond language. Artists often communicate without words, with just a look, or with presence and full attention. I too have participated in impromptu musical conversations, where friends simply gather to converse musically with a very simple outline depending primarily on improvisation and listening to the other. This is an example, that is difficult to reproduce exactly because it captured an instant in time that was unique.

All this to point out that the data we use to build artificial intelligence completely miss these deeper layers of humanness. It is not just about missing a larger subject, common sense, and physical world context, but especially the non-verbal emotional, and feeling layers that also make us human. 

Intelligence is barely understood by humans even after looking at the issue for eons, so how is it even possible to put this into a computer algorithm? What kind of data are we going to use? How do you model emotion and feeling about knowledge, data, information?

Machine learning and computers are likely to radically transform our lives in the coming decade and change our lives in so many ways, but there are some things like the wordless feeling-filled states of human space beyond language, the spiritual sub-strata that underlies consciousness, that I think is simply not within the province of Man to model or perhaps even to understand. It can only be experienced.  

Peace.

The poem below was copied from Maria Popova's excellent blog about the awe and wonder of being human. Her backgrounder on Rebecca Elson is timely and worth reading, and she has also written about the connection between music and the neurophysiological mechanism of emotion that I recommend. As she points out: "Emotions possess the evanescence of a musical note."

FUTURA VECCHIA, NEW YEAR’S EVE

by Rebecca Elson

Returning, like the Earth

To the same point in space,

We go softly to the comfort of destruction,

And consume in flames

A school of fish,

A pair of hens,

A mountain poplar with its moss.

A shiver of sparks sweeps round

The dark shoulder of the Earth,

Frisson of recognition,

Preparation for another voyage,

And our own gentle bubbles

Float curious and mute

Towards the black lake

Boiling with light,

Towards the sharp night

Whistling with sound.





I wish you all a Happy, Healthy, Peaceful, and Prosperous New Year

Tuesday, January 19, 2021

Adding Commonsense Reasoning to Natural Language Processing Applications

This article is reprinted with permission from the original poster @VeredShwartz . This post might be challenging reading for the usual reader of this blog, but I think that even skimming through this might be useful to many to get a sense for possibly the most formidable challenge in the artificial intelligence community: building common sense capabilities into existing and emerging AI deployments.  

Commonsense knowledge consists of facts about the everyday world, that all humans are expected to know. Commonsense knowledge helps to solve problems in the face of incomplete information. It is currently considered an unsolved problem in AGI and is a focus of the Allen Institute for Artificial Intelligence which the author is associated with. 

Deep learning is self-education for machines; you feed a machine learning system huge amounts of data, and eventually it begins to discern patterns all by itself.  But despite their remarkable achievements, and occasional ability to produce human-like outputs, machine learning algorithms are at their core complex mathematical functions that map observations to outcomes. Or are able to forecast patterns that they have previously seen and explicitly learned. Therefore, they’re as good as their data and they start to break as the data they face in the world starts to deviate from examples they’ve seen during training. Neural MT is an example, great progress indeed, but far from having solved the translation problem.  

We hear continuously about the relentless "big data" that is driving AI progress, but we are finding more and more cases where the current approach of deep learning and more data is not enough. The path to machine commonsense is unlikely to be brute force training of larger neural networks with deeper layers on more data.  Whilst deep learning excels at pattern recognition, it’s very poor at adapting to changing situations even when small modifications of the original case are encountered, and often has to be re-trained with large amounts of data from scratch. 

"The great irony of common sense—and indeed AI itself—is that it is stuff that pretty much everybody knows, yet nobody seems to know what exactly it is or how to build machines that possess it," said Gary Marcus, CEO and founder of Robust.AI. "Solving this problem is, we would argue, the single most important step towards taking AI to the next level. Common sense is a critical component to building AIs that can understand what they read; that can control robots that can operate usefully and safely in the human environment; that can interact with human users in reasonable ways. Common sense is not just the hardest problem for AI; in the long run, it's also the most important problem." 

Common sense has been called the “dark matter of AI” — both essential and frustratingly elusive. That’s because common sense consists of implicit information — the broad (and broadly shared) set of unwritten assumptions and rules of thumb that humans automatically use to make sense of the world. Critics of over-exhuberant AI claims frequently point out that two-year children have more common sense than existing deep-learning based AI systems whose "understanding" is often quite brittle and easily distracted and deranged.

Common sense is easier to detect than to define. The implicit nature of most common-sense knowledge makes it difficult and tedious to represent explicitly. 

DARPA, the US defense department’s research agency, has also recognized the absence of common sense as being an important issue. They recently launched a project called Machine Common Sense. As they say:“ The absence of common sense prevents intelligent systems from understanding their world, behaving reasonably in unforeseen situations, communicating naturally with people, and learning from new experiences. Its absence is considered the most significant barrier between the narrowly focused AI applications of today and the more general, human-like AI systems hoped for in the future”. 

Gary Marcus suggests combining traditional AI approaches together with deep learning as a way forward. 

"First, classical AI actually IS a framework for building cognitive models of the world that you can then make inferences over. The second thing is, classical AI is perfectly comfortable with rules. It’s a strange sociology right now in deep learning where people want to avoid rules. They want to do everything with neural networks, and do nothing with anything that looks like classical programming. But there are problems that are routinely solved this way that nobody pays attention to, like making your route on Google maps.

We actually need both approaches. The machine-learning stuff is pretty good at learning from data, but it’s very poor at representing the kind of abstraction that computer programs represent. Classical AI is pretty good at abstraction, but it all has to be hand-coded, and there is too much knowledge in the world to manually input everything. So it seems evident that what we want is some kind of synthesis that blends these approaches."


Yejin Choi and her collaborators at the Allen Institute have united traditional symbolic AI approaches with newer machine learning approaches in an attempt to address the commonsense challenge. One initiative, COMET (short for “commonsense transformers”) extends traditional symbolic reasoning with the latest advances in neural language modeling — a kind of deep learning that aims to imbue computers with a statistical “understanding” of written language. COMET is a  fusion of symbolic reasoning with a neural network and tries to solve the coverage and brittleness problems, of purely DL-approaches, at the same time.  COMET works by reimagining common-sense reasoning as a process of generating plausible (if imperfect) responses to novel input, rather than making airtight deductions by consulting a vast encyclopedia-like database.

Gary Marcus, a critic of the deep-learning fanboys and girls, often points out DL-only shortcomings to challenge the over-exhuberance of these fans. To put progress in AI into a more realistic context he says: “Just because you can build a better ladder doesn’t mean you can build a ladder to the moon.” To him and others, COMET’s approach suffers from a fundamental limitation of deep learning: “statistics ≠ understanding.”

Regardless, Vered presents a comprehensive picture of the many challenges faced and attempts at developing solutions in introducing commonsense to NLP applications in arguably one of the most challenging problems in computing today. I think  her post is a great resource for anybody who wants to quickly get a sense for the issue and the SOTA.



****** 

Commonsense Reasoning for Natural Language Processing

This long-overdue blog post is based on the Commonsense Tutorial taught by Maarten Sap, Antoine Bosselut, Yejin Choi, Dan Roth, and myself at ACL 2020. Credit for much of the content goes to the co-instructors, but any errors are mine. 

In the last 5 years, popular media has made it seem that AI is nearly---if not already---solved by deep learning, with reports on super-human performance on speech recognition, image captioning, and object recognition. The release of Google Translate’s neural models in 2016 reported large performance improvements: “60% reduction in translation errors on several popular language pairs”. But looking under the hood, these numbers seem to be misleading. Neural models find shortcuts to the correct answers through dataset-specific input-output correlations, essentially solving the dataset but not the underlying task. When models are challenged with adversarial out-of-domain examples, they perform poorly. Small unnoticeable noise added to images confuses object recognition models and changes their predictions. Visual question answering models guess the answer based on the frequency of answers for the same type of question in the training set, e.g. replying "2" to any "how many" question. Image captioning models often learn to recognize objects based solely on their typical environment and fail to recognize them outside their typical environment. In NLP, dialogue systems generate highly generic responses such as “I don’t know” even for simple questions. Open-ended generation is prone to repetition. Question answering systems are easily distracted by the addition of an unrelated sentence to the passage. And more. 

Figure 1: adversarial examples in computer vision (left) and natural language processing tasks (right).

Machine learning models today perform reasonably well on perception tasks (image and speech recognition). However, they mostly lack the ability to perform simple intuitive commonsense inferences that humans do in every minute of their waking hours, regarding pre-and post-conditions of events, understanding other people's motivations and intents, mental and emotional states, etc. 

Table of contents: 

  1. What is commonsense? 
  2. Is commonsense knowledge already captured by pre-trained language models? 
  3. How to create benchmarks to measure commonsense reasoning capabilities? 
  4. How to gather and represent machine-readable commonsense knowledge? 
  5. How to enhance neural models for commonsense reasoning tasks with symbolic knowledge? 
  6. Summary
What is commonsense? 
The boundaries of commonsense are quite challenging to define, but we will go with this working definition:
Commonsense is the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. 
For example, it's common sense that it's OK to keep the closet door open, but not the fridge door, as the food inside might go bad. 

Types of commonsense: 

Commonsense knowledge can be categorized according to types, including but not limited to:
  • Social commonsense: people are capable of making inferences about other people's mental states, e.g. what motivates them, what they are likely to do next, etc. This kind of inference is captured by the ATOMIC knowledge base discussed later. In addition, we each have a set of social norms of accepted behavior, e.g. knowing that “it's impolite to comment on someone's weight”. While these are often implicit in our actions and decisions, machines need to be taught them explicitly

  • Temporal commonsense: natural language rarely communicates explicit temporal information. Instead, it's vague and relies on the commonsense knowledge of the listener. For example, when told that "Dr. Porter is taking a vacation" we can predict that Dr. Porter will not be able to see us soon, as opposed to when "Dr. Porter is taking a walk". This requires knowing the typical duration of "taking a walk" (minutes) and that of "taking a vacation" (days). Other temporal knowledge is typical times, order, frequency, etc. of events which are addressed by the MC-TACO dataset and the TACO-LM time-aware contextual language model. 

  • Physical commonsense: a glass will likely shatter if it falls to the floor, which is a fact most people (and arguably cats) know. Physical commonsense includes knowledge about the physical properties and affordances of everyday objects, as tested in the PIQA dataset.
Commonsense is essential for humans to navigate everyday situations seamlessly and interact with each other in a reasonable and safe way, and for AI to understand human needs and actions better. Yet, endowing machines with such human-like commonsense reasoning capabilities has remained an elusive goal of AI research for decades. Past attempts, in the 1960s and 1970s, resulted in an AI winter, i.e. reduced interest and funding for AI research due to failed over-hyped research directions. In recent years, a new interest in machine commonsense has emerged, with the availability of stronger computing power and huge amounts of data. With that said, the path to machine commonsense is unlikely to be brute force training larger neural networks with deeper layers.   

Is commonsense knowledge already captured by pre-trained language models?

In the last 3 years, language models have been ubiquitous in NLP. Language models are pre-trained once, in a self-supervised manner that requires only a large text corpus. Traditionally, language models are trained to predict the next word in a sentence (top part of Figure 2, in blue), but they can also predict hidden (masked) words in the middle of the sentence, as in Google's BERT model (top part of Figure 2, in orange). This pre-training phase yields a function that gets a sequence of words (sentence, short paragraph) and returns a vector for each word in the sequence. 
  

Figure 2: Language models pre-training and fine-tuning.


As opposed to word embeddings which are static, language model-based word vectors are dynamic and re-computed for each context. At the very basic level, they assign different vectors to words when they are used in different senses, as in Figure 3. 


Figure 3: Static vs. dynamic word representations.


Do off-the-shelf pre-trained language models already capture commonsense knowledge? 

✅  They are capable to some extent, of filling incomplete commonsense facts or ranking candidate facts. For example, the language model score (≈ statement plausibility) of a fact like "a musician plays a musical instrument" is higher than "a dancer plays a musical instrument". This is a proof that, in addition to lexical and syntactic knowledge, language models capture general knowledge about the world.  

✅  They can, to some extent, associate concepts with their properties. They distinguish concepts 
associated with a given set of properties, i.e. complete a statement such as "       has fur, is big, and has claws, has teeth, is an animal, ..." with bear (just like playing the "20 question game"). They perform better when they are shown encyclopedic properties (e.g. is an animal) as opposed to perceptual properties (e.g. smooth). They can also, pretty successfully, list the properties 
associated with given concepts, e.g. complete the sentence "Everyone knows that a bear has       " with fur, claws, teeth, etc. 

However, knowledge generated from language models is noisy! 

🚫 Several papers have shown that language models are not sensitive to negation, i.e. they consider the negated version of facts ("birds can't fly") as similarly plausible. 

🚫 They are sensitive to phrasing:


🚫  In distributional word vectors, the vector representing a (sub-)word is learned from the contexts in which it appeared, leading to similar representation for semantically-similar words. In language models, the representation of similar contexts are similar, so the model learns which type of word should appear next (or instead of a masked token). This is generally a positive thing, but it sometimes over-generalizes, leading to examples such as this: 


Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the AllenNLP demo


Here, BERT has seen in its training corpus enough sentences of the type "The color of something is [color]" to know to suggest different colors as substitutes for the masked word. Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn't see enough sentences discussing the color of a dove, thus it defaults to just predicting any color.  

So knowledge in language models is not the most accurate and reliable. Is it still useful?

Yes, to some extent. One way to show it is through evaluation on tasks requiring commonsense knowledge. We will discuss several such tasks, but for now, let's focus on WinoGrande as an example. It is the large-scale version of the Winograd Schema Challenge. Given a sentence with a cloze, the goal is to fill in the blank with a previously mentioned entity or concept, out of two answer choices. For example: 

Because Brett found an internship while in college but Ian was unable to, _____ found a job less quickly after graduation. 
Choices: Brett, Ian

What makes this task especially difficult is that every instance has a twin sentence which is minimally changed such that the correct answer is the other one (for instance, replacing "less quickly" with "more quickly" will change the correct answer from Ian to Brett). 

Language model-based models top the leaderboards of WinoGrande and other commonsense tasks, but since they are trained on task-specific training data, which often contains tens or hundreds of thousands of training examples, it's hard to attribute the success to the knowledge captured in language models from the pre-training step. A better way to estimate it is with zero-shot (unsupervised) models. Typically, the way zero-shot models address multiple-choice tasks is by phrasing a statement from the instance and each answer choice, and computing the language model score as a proxy for plausibility:

PLM(The answer is answer1
PLM(The answer is answer2
...
PLM(The answer is answerk)

And then predicting the answer choice with the best language model score (highest probability, which is usually computed as the lowest perplexity). 

In our recent EMNLP paper, we took it one step further and asked whether we can use language models to generate what would otherwise be missing or implicit knowledge needed for solving a multiple-choice commonsense question answering instance. We proposed the unsupervised "self-talk" framework, that uses language models to generate information-seeking questions such as "what is the definition of..." and their corresponding answers (clarifications) to discover additional background knowledge. In the example in Figure 5, knowing that internship experience may help a person get a job is crucial for answering the question (which of Brett and Ian found a job less quickly?). On most benchmarks, the self-talk model performed better than unsupervised models with no additional knowledge, while competing with models that have access to knowledge bases. This is despite the inaccurate and noisy knowledge language models generate. However, when we showed people some of the clarifications that helped the model choose the correct answer choice, they judged only 40% of them as actually providing helpful information. This discrepancy means that our model doesn't imitate the human reasoning process - it works differently. Check out our demo! It's not always accurate but it's often funny :) 

Figure 5: An example of clarification generation for an instance from WinoGrande.


The best performance on commonsense tasks is achieved by fine-tuning language models, i.e. training them on task-specific data. Let's look at some of the benchmarks and the issues we face with supervised learning.  

How to measure commonsense reasoning capabilities? 

Multiple commonsense benchmarks have been released over the last few years. Some of them will be discussed here (see examples in Figure 6), along with the main differences and design choices when creating a benchmark.

Figure 6: Some commonsense benchmarks along with an example instance. 


Type of knowledge: some benchmarks focus on a specific type of commonsense knowledge, such as social commonsense (e.g. Social IQa),  physical commonsense (e.g. PIQA), temporal commonsense (e.g. MC-TACO),  or causes and effects (e.g. COPA), while others target a broader domain of general commonsense knowledge and reasoning (e.g. WSC, WinoGrande, CommonsenseQA, ROCStories).  

Size: most recent datasets include a large training set, in order to facilitate training large neural models. One way to create a benchmark is to hire experts to curate a high-quality dataset such as for WSC and COPA. These datasets are rather expensive to collect and are therefore typically small. The common alternative is to collect data through crowdsourcing or semi-automatically, and split it randomly to train, validation, and test sets. Models that learned data-specific shortcuts in the training set instead of generalized phenomena are likely to perform well on a test set drawn from the same distribution, but this performance is misleading and is likely a lot better than on real-world instances of the task.  Despite this understanding, this is still the dominant approach. 

Format: the vast majority of datasets are in the format of multiple-choice questions, as exemplified in Figure 6. This format is the easiest to evaluate automatically: models are judged for their accuracy, i.e. what percent of the questions they answered correctly. Unfortunately, this type of tasks also makes it possible for a model to guess the correct answer. We're not talking about a random guess, which would leave enough room for improvement. A random guess is expected to result in an accuracy of 100/k %, where k is the number of answer choices, e.g. 50% accuracy for binary tests, 33.3% for tests with 3 choices, 25% for 4 choices, etc. The risk is that the model makes an "educated guess" based on - yes, you guessed it correctly - spurious correlations between the questions and the correct/incorrect answers. 

How do you make sure a model is right for the right reasons?

That's the million-dollar question. We don't have a perfect solution for this problem yet. For a start, when collecting a new benchmark, the process of collecting incorrect answers (=distractors) should be well-designed such that distractors are plausible but unlikely. Using random answers as distractors (e.g. naturally-occurring sentences or correct answers of different questions) would create topically-different distractors, which are easy to detect (remember, relatedness is one of the strengths of distributional text representations). Asking people to come up with the distractors may introduce other annotation artifacts, such as exaggerations, going off-topic, or producing overly emotional texts, which are easy for models to detect. Some solutions have been proposed: for example, the distractors in Social IQa are answers for different questions asked on the same context. In Figure 7, the context "Alex spilt food all over the floor and it made a huge mess." appears in the dataset with two questions: "what happens next?" and "what happened before?". The distractors of "what happens next?" are the correct answers of "what happened before?", e.g. that Alex has slippery hands. A similar approach is taken in CommonsenseQA. 

Figure 7: Creating distractors for a Social IQa instance. Image credit: Maarten Sap.

An alternative solution is to filter out easy questions through "adversarial filtering", i.e. training a weaker model and iteratively removing instances that it succeeds in answering. Variants of adversarial filtering were applied to WinoGrande and PIQA. 

Finally, I believe the future is in generative tasks, in which the model needs to produce a free-text answer without being provided with the candidate answers. Several recent benchmarks are generative, such as TimeTravel (counterfactual reasoning), ART (abductive reasoning), CommonGen, and ProtoQA. The challenge in generative tasks is the lack of reliable automatic evaluation metrics. Given the gold standard reference answer(s), we would like a metric to (1) reward correct generated answers that are different from the reference answer, while (2) penalizing incorrect answers that are similar (e.g. lexically) to the reference. Human evaluation is reliable, but it is costly and is typically done once on the test set. In order to be able to improve models during development, we need automatic metrics. We currently settle for metrics based on lexical overlap such as BLEU and ROUGE which are pretty terrible at (1) and have little correlation with human judgments, or model-based metrics such as BERT score that are not great at (2). 

How to gather and represent machine-readable commonsense knowledge?

Commonsense resources provide machine-readable knowledge about the world. Resources are expected to be large-scale and accurate, consist of diverse knowledge types, and be usable in downstream tasks. ConceptNet is a large (21 million assertions), commonly-used resource consisting of general commonsense knowledge, in over 85 languages. ATOMIC consists of 880,000 triplets reasoning about causes and effects of everyday situations. Other resources are listed in Figure 8.

Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap. 


Existing resources differ in several aspects:

Representation: how is knowledge represented in the resource? ConceptNet and ATOMIC represent knowledge in natural language (Figure 9), while NELL and Cyc represent knowledge in symbolic logic:

(#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET)) 


Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap. 


Knowledge type: ConceptNet consists of semantic knowledge, i.e. properties of concepts (e.g. reading is a type of activity). ATOMIC, on the other hand, is inferential: given a templated event with "PersonX" representing the subject and "PersonY" an optional object(s) (e.g. PersonX yells at PersonY), and one of 9 pre-defined relation dimensions (e.g. PersonX's motivation) it provides a second event (e.g. PersonX wanted to express anger). 

Collection method: knowledge can be collected from humans, either experts or crowdsourcing workers. Expert-curated resources are more uniform and accurate and may use complex representations, but it is an expensive collection method, and it is very time-consuming. Alternatively, non-experts can write knowledge in natural language, making the collection faster and more scalable.

The alternative approach is to extract knowledge automatically from texts, as in NELL. This approach works, but it produces less accurate knowledge. In addition, the approach suffers from reporting bias: over-representing the rare at the expense of the trivial. For example, people are reported to murder more often than they are reported to breathe. Default properties of concepts (yellow banana) are mentioned less often than their alternatives (green banana), etc. 


How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?

Most models developed for solving commonsense benchmarks today are based on language models. Typically, each answer choice, along with the context, forms a statement. The language model computes a vector representing each statement. These vectors are then fed into a classifier that assigns a plausibility score for each candidate answer:


Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance.


Static neuro-symbolic integration

The knowledge in commonsense resources may enhance models built for solving commonsense benchmarks. For example, we can extract from ConceptNet the assertions that a job is used for making money, that spending money requires making money, that buying requires spending money and that car is something you can buy. Ideally, we would also need the knowledge that a high-paying job is a type of job, specifically one used for making a lot of money, which is required for spending a lot of money, which is required for buying something that costs a lot of money, a car being one of them. Finally, we may want to remove the edge from "buy" to "car" so we can only get to "car" from the node "buy something that costs a lot of money". 


Figure 12: Knowledge extracted from ConceptNet for the WinoGrande instance discussed above.


How do we incorporate knowledge from knowledge resources into a neural model?

The simple recipe (success not guaranteed) calls for 4 ingredients: the task addressed, the knowledge resource used, the neural component, and the combination method. We have already discussed tasks and knowledge resources, so I would only add here that ConceptNet is the main resource utilized for downstream models, although some models incorporate other knowledge sources, such as other knowledge bases (WordNet, ATOMIC), knowledge mined from text, and tools (knowledge base embeddings, sentiment analysis models, COMET - see below). 


Figure 13: Resources used by most knowledge-informed commonsense models.

The neural component is the shiny new neural architecture - language models in the last 3 years, biLSTMs in the years prior, etc. The more interesting component is the combination method. We will look at 3 examples:

Incorporating into the scoring function: Lin et al. (2017) extracted probabilistic "rules" connecting pairs of terms from multiple sources such as WordNet (restaurant→eatery: 1.0), Wikipedia categories (restaurant→business: 1.0), script knowledge mined from text (X went to a restaurant→X ate: 0.32), word embedding-based relatedness scores (restaurant→food: 0.71), and more. The model scores each candidate answer according to the scores of the inference rules used to get from the context (e.g. "Mary walked to a restaurant" in Figure 14) to the candidate answer (e.g. "She ordered foods.").  


Figure 14: "covering" each candidate answer by the original context and the rules extracted from various sources. Image credit: Lin et al. (2017).


Representing symbolic knowledge as vectors: Lin et al. (2019) used BERT as the neural component to represent the instance (statement vector). For their symbolic component, they extracted subgraphs from ConceptNet pertaining to concepts mentioned in the instance and learned to represent them as a vector (graph vector). These two vectors were provided as input to the answer scorer which was trained to predict the correct answer choice. 

Figure 15: extracting subgraphs from ConceptNet pertaining to concepts mentioned in the instance. Image credit: Lin et al. (2019).

Multi-task learning: Xia et al. (2019) fine-tuned a BERT model to solve the multiple-choice questions. They also trained two auxiliary tasks supervised by ConceptNet, in which two concepts were given as input and the classifier had to predict whether they are related or not, and the specific ConceptNet property that connects them. The BERT model was shared between the main and the auxiliary tasks, so that commonsense knowledge from ConceptNet was instilled into BERT, improving its performance on the main task.


Figure 16: multi-task learning aimed at instilling knowledge from ConceptNet into BERT.

Dynamic neuro-symbolic integration

There are two main limitations to the neuro-symbolic integration discussed above:
  1. Coverage: relevant knowledge is often not found as-is in commonsense knowledge resources. As we've seen earlier, commonsense knowledge is immeasurably vast, so much of it is not documented. 

  2. Precision and context: knowledge found in the knowledge base about concept X doesn't necessarily apply to all contexts in which X appears. For example, when provided with "PersonX adopts a cat", ATOMIC says that PersonX had to go to the shelter first (Figure 17), but that's not always the case. It may as well be that PersonX adopted a cat they found on the street or got the cat from a friend who was no longer able to care for it. 

Figure 17: ATOMIC inferences for the event "PersonX adopted a cat".


How do we provide machines with large-scale, contextualized commonsense knowledge?

The solution is to leverage manually curated commonsense knowledge resources, such as ConceptNet and ATOMIC, to train a model that can dynamically produce such knowledge for a given context. Commonsense knowledge resources are typically sparse, making training a knowledge base completion model to extend the resource less efficient. Pre-trained language models and their inherent knowledge come in handy here. Language models (such as GPT) implicitly represent knowledge, so you can re-train them on completing knowledge base assertions (e.g. from ATOMIC) to teach them the structure of knowledge. This is what COMET (COMmonsEnse Transformers) does, as illustrated in Figure 18. 


Figure 18: Illustration of the training process of COMET: The language model is fine-tuned to predict the "tail entity" (e.g. inference in ATOMIC) given the "head entity" and the relation. Image credit: Antoine Bosselut.


COMET is capable of dynamically generating inferences for any context. For example, if we modify the context from ATOMIC to "David adopted his sister's cat because they found out her husband was allergic.", which for obvious reasons does not appear in ATOMIC, COMET no longer predicts that PersonX (David) had to go to the shelter, but instead that he, for example, needed to find out about it.

COMET has been used successfully in various downstream tasks requiring commonsense knowledge. Models trained on ATOMIC or on ConceptNet are available, and the demo for both ATOMIC and COMET can be found here. There is also a Visual COMET that can generate inferences from images. 

Summary

We talked about ways to acquire and represent commonsense knowledge in machine-readable format, ways to measure commonsense reasoning abilities, and ways to integrate this kind of knowledge into models. None of these is solved yet. Manually collecting all the commonsense knowledge is infeasible, while extracting it from texts or from language models suffers from inaccuracies, reporting bias, and societal biases. Looking forward, a promising research direction is multi-modal commonsense knowledge acquisition, e.g. learning from texts along with images and videos. For example, looking through enough class photos, you might learn that the kids in the front row typically sit (especially if the kids in the last row are also seated). 


Machines may reach human performance on commonsense benchmarks but it's often due to being right for the wrong reasons rather than actually possessing and successfully applying commonsense knowledge and reasoning abilities. Generative tasks are somewhat less prone to this issue, but we would have to develop reliable automatic evaluation metrics to make them the standard. 

Machine commonsense reasoning is becoming more and more popular within NLP so I am optimistic about future breakthroughs! 

Tuesday, May 15, 2018

The Weakness In Data

Another guest post by Luigi who covers a variety of subjects here: AI, Big Data, NMT Hype, and more. Luigi attempts on a regular basis to clarify the conflation that seems rampant in the translation industry and makes my life easier by producing what I perceive as interesting content to keep this blog relevant and current.



 AI is a much-misunderstood term and thus I think it is worth a closer look to further reduce the conflation that surrounds it.  The graphic below from a presentation I made on "Linguistic AI" on behalf of SDL, describes what I think a real AI should do. However, the reality is still quite far from the broad promise made by the use of the word intelligence, and most of what we see today is narrowly focused ML deployments that indeed do seem to perform some kind of cognitive function around carefully selected data.


 There is also a lot of confusion about what machine learning (ML) is and how it relates to AI. Thus I think this graphic below is also useful to keep the ongoing discussions clear. Especially, since we hear of some talking about deep NMT  versus your basic NMT. Seriously, how deep are we talking? Most NMT today TTBOMK is based on deep learning as shown below.
 

 Luigi also touches upon the hype around NMT, specifically, on the Microsoft claim of reaching human parity with their Chinese NMT engine. While not untrue from a very narrowly defined, and very specific definition of what parity is, it is an overstatement of the actual achievement in a broader sense as us regular humans might understand. However, to see this overstatement requires actual intelligence, artificial intelligence is not enough. 

It is hyperbole that you can quickly disprove by taking any random Chinese news web page and running a translation through. You will indeed be disappointed by the complete lack of alleged human parity of this exercise, and will probably begin to ask pesky questions about what humans are we talking about. It also similar to equating a card trick to a miracle. Anyway, this kind of claim is a common marker in the MT world, which is often filled with empty promises. To be fair it is a much less deceptive and blatant overstatement than the Google announcements a year or so ago. 

It has been my observation that most if not all the do-it-yourself experimentation with SMT produced sub-optimal results. To be explicit, this means that you would have been better off using a public MT portal or working with an expert. NMT has 10+ open source toolkits, so my question(s) to the DIYers is: Which one are you going to use? Why? How do you know the others are not better? The cost and complexity to engage with NMT go way beyond loading low-quality data into an open source or any toolkit. The rate of change in the science and algorithmic evolution is unprecedented. It is my opinion that NMT is not a game for the underfunded and the naive, but I am sure many in the translation industry will expend time and resources to find this out.  

The notion of data in this era of ML and neural nets is interesting, and I recommend that you go down the thread and often silly comment trail that was triggered by this tweet from a partner at VC Andreessen Horowitz who it seemed, wanted to make the point that ML apps need very different and specific data to produce useful outcomes, not just generic "data":




 Some of my favorite responses include the examples below, which sound surprisingly like some discussions on translation quality that I have witnessed.


I heard they both go through pipelines
@BatMongoose : Maybe data is more like sand - annoyingly ubiquitous but useless until you figure out how to turn it into something (silicon wafers) 
@EVplusEV : I prefer: Data is the new bacon
: Big Data is the new snake oil.
@DanielMiessler : Data would be more like the dinosaurs, plants, and sunshine. The oil would be the insights and predictions. 

@asemotaData is the new "Oxygen" 🤣🤣



============

 In recent years, the blogosphere has lost much of its original appeal, mainly because its connected community has largely moved to social media, which, today, ended up conveying most content. Indeed, social media help much content emerge that would otherwise remain buried. Social media—as we all know—also convey content that should better be ignored anyway, but even crap has its raison d’être: That’s content marketing, baby, content marketing, and there’s nothing you can do about it, nothing.

Content, skills, and knowledge

Indeed, this content offers a plea to run some basic psychometrics on the small groups of people one follows on social media. Don’t get fooled by the Facebook/Cambridge Analytica scandal, it’s not rocket science: Even likes can tell you a lot and help you understand what your contacts are paying attention to and why especially if they are not just virtual acquaintances.

Social media activity of your contacts can even provide you with much more confirmations than expected. The fundamentals of content marketing say that the content produced should be of absolute value, but this is hardly true because marketing is supposed to exert its effects anyway and one does not always have something definitive to say.

What would you think, for example, of an acquaintance of yours recommending a post by someone who admits s/he is an absolute beginner with machine translation, has no technical knowledge of it and yet thinks s/he can provide his/her customers with solid advice anyway? And what would you think of the same acquaintance of yours who defines him/herself as an industry professional while admitting his/her revulsion for MT and declaring her cast-iron belief in any professional as being capable of sparing his/her customers a “poor figure”? Well, these people are really telling a lot about themselves with a post and a like.

 

The power of data

 

Seth Stephen-Dawidowitz’s Everybody Lies is a terrific book for how simply it shows the power of data. Just like Seth Stephen-Dawidowitz in his book, Google’s Mackenzie Nicholson displaced many attendees at the recent Smartling Global Ready EMEA, by asking a few classic questions with a seemingly obvious and yet invariably incorrect answer. For example, when it comes to clichés, no one would have bet that Italians pay far more attention to price than Germans, Scots, Israelis as Google’s data unequivocally shows.

It came as no surprise, then, that analytics generally indicate that in-house reviews mostly result overly expensive and largely pointless, as Kevin Cohn, later on, showed in the same occasion. Simply put, despite great expectations almost no actual improvement is recorded. Indeed, most edits are usually irrelevant and simply a matter of personal taste. Incidentally, Kevin Cohn is a data scientist who only speaks English and admittedly knows almost nothing about translation. Anyway, as the wise man says, data ipsa loquuntur.

 

Hypes you (don’t) expect

 

Of the many expectations that have been generating hypes over the last few years, the ones about data are not inflated, and people are, maybe slowly but steadily, getting accustomed to reckoning with data-driven predictions. As algorithms will be growing in numbers and potentials, the confidence in their applications will also grow.

In fact, hypes are aimed at and address people outside verticals, so Microsoft’s recent hype on NMT achieving human parity, for example, was not meant for the translation industry.

So why all the fuss?

As a matter of fact, the difference between human and machine translation is becoming thinner and thinner, at least looking at quality scores and statistical incidence. Also, the concept of parity may be quite hard for a layman to grasp. This, if anything, makes the desolation of posts like the one mentioned above even more evident. Indeed, it is pretty unlikely for the general media to get the news correctly in such cases like the Microsoft hype case: However complete and clear the article might have been, it was even misleading in the title, which usually is the only catchphrase for the media.

In Microsoft’s much-vexed, and yet, don’t forget it, scientific article, parity is defined mostly as a functional feature, i.e. as a measure of the ability to communicate across language barriers. Parity is compared to professional human translations, and yet keeping clearly in mind the idea that “computers achieving human quality level is generally considered unattainable and triggers negative reactions from the research community and end users alike” and that “this is understandable, as previous similar announcements have turned out to be overly optimistic.”

As a matter of fact, it is made equally clear that the quality of NMT output in the case examined exceeds that of crowd-sourced non-professional translations, which should come as no surprise for those translation pundits who have read the article.

On the other hand, a recent study from the University of Maryland found that “users reacted more strongly to fluency errors than adequacy errors.” Since the main criterion in recruiting participants was their English language ability, the study indirectly confirms that “adequacy” implies a vertical kind of knowledge, the same that could prevent hypes from arising and spreading.

The unpleasant side of this story is that, once again, many so-called translation professionals still can’t see how MT is just a stress-relieving technology, conceived and developed to enhance translation, make it easier and faster and possibly better.

That’s why (N)MT is no inflated hype, and it has actually been on the plateau of productivity for years now.

Overcoming language barriers is an ageless aspiration of humankind that does not generate any fears, unlike the much-fabled singularity. Except, possibly, amongst language professionals, despite the continuous, recurrent, self-reassurance (wishful thinking?) that machines will never replace men, at least in this creative and thus undeniably human task.

In the end, the NMT hype falls within mainstream tech news, which is sprayed as toxic gas to win a market war that is battled on much more profitable fronts than NLP, corporate business platforms. Indeed, the NMT arena is dominated by a leading actor with a supporting actor and many smaller side actors struggling for an appearance on the proscenium. Predictably, a translation industry “star”, which is just a “dwarf” in the global business universe, recently opted for buying instead of making its own NMT engine, citing the scarcity of data scientists—and money, of course—as the main reason for the decision.

Actually, not only has NMT emerged as the most promising approach, it has also been showing superior performances on public benchmarks and rapid adoption in deployments and steady improvements. Undeniably, there have also been reports of poor performance, such as the systems built under low-resource conditions, confirming that NMT systems have lower quality with out-of-domain data. This implies that the learning curve may be quite steep with respect to the amount and, most importantly, the quality of training data. Also, NMT systems are still little interpretable, meaning that any improvements are extremely complex and random, when not arbitrary.

Anyway, to be unmistakably clear, MT is definitely “at parity” with human translation, especially when this is below expectations, i.e. sadly average low-grade. And Arle Lommel is right in writing that an article titled New Study Shows That MT Isn’t Terrible would not generate much attention. At the same time, though, when he writes that “the only translators who need to worry about machine translation are those who translate like machines” he can’t possibly even imagine that this is exactly what most human translators have been doing, maybe forcedly, for decades.

Therefore, the NMT hype is such only for the people in the translation industry who, on the other hand, are much more open to stuff that insiders in other industry would label as crap.

After all, NMT is just another algorithm and, with the world going increasingly digital and (inter)connected, and so information-intensive, resorting to algorithms is inevitable, because it is necessary.

Data as fuel

 

The fuel of algorithms is data. Unfortunately, despite the long practice of producing language and translation data, translation professionals and businesses have seemingly learned very little about data and are still very late in adopting data-driven applications. Indeed, data can be an asset if you know what to do with it, how to take advantage of it, how to profit from it.

In this respect, besides showing a total ignorance of what “big data” is, the inconsiderate use of non-sensical “translation big data” has been seriously damaging any chance for the effectual trading of language and translation data. This is just one of the impact of fads and hypes, especially if ignorantly borrowed from and spread through equally ignorant (social) media.

As Andrew Joscelyne finally wrote in his latest post for the TAUS blog, «Language data […] has never been “big” in the Big Data Sense.»

By the way, what happened with “translation big data” is about to happen with AI, too, because ML—or even DL—is not AI, but too many people don’t care to deepen and see the difference.
In fact, with the translation industry processing less than 1% of translation requests, language data can’t be exactly big, while translation businesses don’t have the necessary knowledge, tools, and capability to effectively exploit and benefit from translation (project) data. Exceptions are de rigueur, of course, but one can count them on the fingers of one hand, and they all are technology providers.

 

Data and quality

 

Unfortunately, the translation industry is affected by a syndrome, blaming technology for replacing services, products, and habits with others of lower quality, impoverished and/or simplified. Luddite anyone?

Indeed, only human laziness should be blamed for unsatisfactory quality. And this is consistent with the perennial, grueling and inconclusive debate on quality, the magical mystery word that instantly explains everything and forbids further questioning.

A solid example is the anxiety for confidentiality with online MT, which is not quite an issue. Confidentiality is definitely a minor issue for an industry whose players are still extensively using email, when not FTP unsecured connections and servers for exchanging files. Confidentiality is definitely not a major issue when it is mostly delegated to NDAs, without providing for any enforcement mechanism, especially when non-disclosure agreements are perceived as offensive, for revealing lack of trust and questioning professionalism. Confidentiality is not an issue when, even in spite of bombastic certifications, the violation of any confidentiality obligations is around the corner for keeping customer’s data unsecured, having no contingency or security plan in place or re-using the same data for other projects, knowingly or not. Also, in most cases, IPR rather than confidentiality is the real issue.

Anyway, when such issues arise, never is technology to blame but human laziness, sloppiness, helplessness, and ineptness.

Are all these traits also affecting data? Of course, they are. It is not a case that translation businesses believe they are so different than other service businesses, to the point that to real innovation has ever come from them. Even when they choose to build their own platforms, these are so peculiar that they could never be made available to the whole community, even if their makers would, and they wouldn’t. After all, this is also a reason for the proliferation of unnecessary standards. Narcissism is the boulder blocking the road to change and innovation.

The same dysfunctional approach affects data. For example, should one believe in the meager results of the perennial, grueling and inconclusive debate on quality, one should only be able to measure it downstream and only by counting and weighing errors, in a typical red-pen syndrome. On the contrary, a predictive quality score can be computed based on past project data, which is extremely interesting for buyers.

 

More ML applications

 

Now, imagine a predictive quality score combined with a post-factum score deduced from content profiling and initial requirements (checklists), classic QA, and linguistic evaluation based on correlation and dependence, precision and recall and edit distance.

Only a weak point will be left, i.e. how to recruit, vet, compensate, and retain vendors to have always the best fit.

During his presentation on KPIs at the recent interpretation and translation congress in Breda, XTRF’s Andrzej Nedoma recalled how project managers always tend to use the same resources, who are not necessarily always the most suitable.

With vendor managers continuously vetting and monitoring vendors and constantly updating the vendor database, project managers could have a reliable repository to get their picks from. And with project managers updating, in turn, the vendor database with performance data, this could be combined with assessments and ratings from customer and peers to feed an algorithm that would provide for best fits at any new projects and, in short, ultimately start a virtuous circle and maximize customer satisfaction.

To be unambiguously clear once again, this is by no means an endorsement of translation marketplaces. On the contrary, the inherent vice of translation marketplaces is the ultra-exploitation of information asymmetry as they provide no mechanism to help factual vetting and evaluation, thus ultimately disintermediation. However, any platform that users from all parties could join in and be vetted and evaluated—and their performances fairly measured—will eventually prevail.

If the idea of translation marketplaces has not worked out so far is not because of a supposedly unique nature of translation; on the contrary, this is one of the conditions that make the translation industry an ideal candidate for disruption. In fact, with suitable data and the right algorithms, machine learning—including deep learning—can provide many high-value solutions.

Where’s the weakness in data then? In humans [who misunderstand and misuse it.]



=======================
Luigi Muzii's profile photo


Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm. He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization-related work.

This link provides access to his other blog posts.