With
the advent of Large Language Models (LLMs), there are exciting new
possibilities available. However, we also see a large volume of mostly
vague and poorly defined claims of "using Al" by practitioners with
little or no experience with machine learning technology and algorithms.
The signal-to-noise (hype-to-reality) ratio has never been higher, and
much of the hype fails to meet real business production use case
requirements. Aside from the data privacy issues, copyright problems,
and potential misuse of LLMs by bad actors, hallucinations and
reliability issues also continue to plague LLMs.
Enterprise users expect production IT infrastructure output to be
reliable, consistent, and predictable on an ongoing basis, but there
are very few use cases where this is currently possible with LLM output.
The situation is evolving, and many expect that the expert use of LLMs
could have a dramatic and favorable impact on current translation
production processes.
There are several areas in and around the machine translation
task where LLMs can add considerable value to the overall language
translation process. These include the following:
LLM translations tend to be more fluent and acquire more contextual information, albeit in a smaller set of languages
Source text can be improved and enhanced before translation to produce better-quality translations
LLMs can carry out quality assessments on translated output and identify different types of errors
LLMs can be trained to take corrective actions on translated output to raise overall quality
LLM MT is easier to adapt dynamically and can avoid the large re-training that typical static NMT models require
At Translated, we have been carrying out extensive research and
development over the past 18 months into these very areas, and the
initial results are extremely promising, as outlined in our recent whitepaper.
The chart below shows some evidence of our progress with LLM MT. It compares Google (static), DeepL (static), Lara RAG-tuned LLM MT, GPT-4o (5-shot), and ModernMT (TM access) for nine high-resource languages. These results for Lara are expected to improve further.
At Translated, we have been carrying out extensive research and
development over the past 12 months into these very areas, and the
initial results are extremely promising, as outlined in our recent whitepaper.
One approach involves using independent LLM modules to handle
each category separately. The other approach is to integrate
these modules into a unified workflow, allowing users to simply submit
their content and receive the best possible translation. This integrated
process includes MTQE as well as automated review and post-editing.
While managing these tasks separately can offer more control,
most users prefer a streamlined workflow that focuses on delivering
optimal results with minimal effort, with the different technology components working
efficiently behind the scenes.
LLM-based machine translation will need to be secure,
reliable, consistent, predictable, and efficient for it to be a serious
contender to replace state-of-the-art (SOTA) NMT models.
This transition
is underway but will need more time to evolve and mature.
Thus, SOTA Neural MT models may continue to dominate MT use in
any enterprise production scenarios for the next 12-15 months, except
where the highest quality automated translation is required.
Currently,
LLM MT makes the most sense in settings where high throughput, high
volume, and a high degree of automation are not a requirement and where
high quality can be achieved with reduced human review costs enabled by
language AI.
Translators are already using LLMs for high-resource languages
for all the translation-related tasks previously outlined. It is the
author’s opinion that there is a transition period where it is quite
plausible that both NMT and LLM MT might be used together or separately
for different tasks in new LLM-enriched workflows. NMT will likely
perform high-volume, time-critical production work as shown in the chart
below.
In the scenario shown above, information triage is at work.
High-volume content is initially processed by an adaptive NMT model,
followed by an efficient MTQE process that sends a smaller subset to an
LLM for cleanup and refinement. These corrections can be sent back to
improve the MT model and increase the quality of the MTQE (not shown in
the diagram above).
However, as LLMs get faster and it is easier to automate
sequences of tasks, it may be possible to embed both an initial quality
assessment and an automated post-editing step together for an LLM-based
process to manage.
An emerging trend among LLM experts is the use of agents. Agentic
AI and the use of agents in large language models (LLMs) represent a
significant evolution in artificial intelligence, moving beyond simple
text generation to create autonomous, goal-driven systems capable of
complex reasoning and task execution.
AI agents are systems that use
LLMs as their core controller to autonomously pursue complex goals and
workflows with minimal human supervision.
They potentially combine
several key components:
An LLM core for language understanding and generation
Memory modules for short-term and long-term information retention
Planning capabilities for breaking down tasks and setting goals
Some ability to iterate to a goal
Tools for accessing external information and executing actions
Interfaces for interacting with users or other systems
One approach involves using independent LLM agents to address each of the categories below as separate and discrete steps.
The other approach is to integrate these steps into a unified and
robust workflow, allowing users to simply submit content and receive
the best possible output through an AI-managed process. This integrated
workflow would include source cleanup, MTQE, and automated post-editing.
Translated is currently evaluating both approaches to identify the best
path forward in different production scenarios.
Agentic AI systems are capable of several advanced capabilities that include:
Autonomy: Ability to take goal-directed actions with minimal oversight
Reasoning: Contextual decision-making and weighing tradeoffs
Adaptive planning: Dynamically adjusting goals and plans as conditions change
Natural language understanding: Comprehending and following complex instructions
Workflow optimization: Efficiently moving between subtasks to complete processes
A thriving and vibrant open-source community will be a key
requirement for ongoing progress. The open-source community has been
continually improving the capabilities of smaller models and challenging
the notion that scale is all you need. We see an increase in recent
models that are smaller and more efficient but still capable and are
thus often preferred for deployment.
All signs point to an exciting future where the capabilities of
technology to enhance and improve human communication and understanding
get better, and we are likely to see major advances in bringing an
increasing portion of humanity into the digital sphere for productive,
positive engagement and interaction.
The internet is the primary source of information, economic opportunity, and community for many worldwide. However, the
automated systems that increasingly mediate our interactions online —
such as chatbots, content moderation systems, and search engines — are
primarily designed for and work far more effectively in English than in
the world’s other 7,000 languages
It is clear to anyone
who works with LLMs and multilingual models, that there are now many
powerful and impressive LLM models available for generating natural and
fluent texts in English. While there has been substantial hype around
the capabilities and actual potential value of a wide range of
applications and use cases, the benefits have been most pronounced for
English-speaking users.
It is also now increasingly being
understood that achieving the same level of quality and performance for
other languages, even the ones that are widely spoken, is not an easy
task. AI chatbots are less fluent in languages other than
English and are thus threatening to amplify the existing language bias
in global commerce, knowledge access, basic internet research, and
innovation.
In the past, it has been difficult to develop
AI systems — and huge language models in particular — in languages other
than English because of what is known as the resourcedness gap.
The
resourcedness gap describes the asymmetry in the availability of
high-quality digitized text that can serve as training data for a large
language model and generative AI solutions in general.
English
is an extremely highly resourced language, whereas other languages,
including those used predominantly in the Global South, often have fewer examples of high-quality text (if any at all) on which to train language models.
English-speaking
users have a better user experience with generative AI than users who
speak other languages, and the current models will only amplify this
English bias further.
It is estimated
that although GPT-3's training data consists of > 90% English text
it did include some foreign language text, but not enough to ensure that
model performance across different languages is consistent. GPT-3 was
the foundation model used to build ChatGPT and though we do not know
what data was used in GPT-4 we can safely assume that no major sources
of non-English data have been acquired, primarily because it is not
easily available.
Researchers like Pascale Fung and others have pointed out the
difficulty for many global customers because of the dominance of English
in eCommerce. It is much easier to get information about products in English in online marketplaces than it is in any other language.
Fung,
director of the Center for AI Research at the Hong Kong University of
Science and Technology, who herself speaks seven languages, sees this
bias even in her research field. “If you don’t publish papers in
English, you’re not relevant,” she says. “Non-English speakers tend to
be punished professionally.”
The following table describes the source data for the training corpus of GPT-3 which is the data foundation for ChatGPT:
Datasets
Quantity (Tokens)
Weight in Training Mix
Epochs elapsed when
training for 300 BN tokens
Common Crawl
(filtered)
410 BN
60%
0.44
WebText2
19 BN
22%
2.90
Books1
12 BN
8%
1.90
Books2
55 BN
8%
0.43
Wikipedia
3 BN
3%
3.40
Understanding what data has been used to train GPT-3 is useful. This overview provides some valuable details that also help us understand the English bias and US-centric perspective that these models have.
Fung and others are part of a global community of AI researchers testing the language skills of ChatGPT
and its rival chatbots and sounding the alarm about providing evidence
that they are significantly less capable in languages other than
English.
ChatGPT still lacks the ability to understand and generate sentences in low-resource languages. The performance disparity in low-resource languages limits the diversity and inclusivity of NLP.
ChatGPT also lacks the ability to translate sentences in non-Latin script languages, despite the languages being considered high-resource.
“One
of my biggest concerns is we’re going to exacerbate the bias for
English and English speakers,” says Thien Huu Nguyen, a University of
Oregon computer scientist who is also a leading researcher raising
awareness about the often impoverished experience non-English speakers routinely experience with generative AI. Nguyen specifically points out:
ChatGPT’s
performance is generally better for English than for other languages,
especially for higher-level tasks that require more complex reasoning
abilities (e.g., named entity recognition, question answering,
common sense reasoning, and summarization). The performance differences
can be substantial for some tasks and lower-resource languages.
ChatGPT can perform better with English prompts even though the task and input texts are intended for other languages.
ChatGPT performed substantially worse at answering factual questions or summarizing complex text in non-English languages and was more likely to fabricate information.
The research tends to point clearly to the English bias of the most popular LLMs and state: The AI systems are good at translating other languages into English, but they struggle with rewriting English into other languages—especially for languages like Korean, with non-Latin scripts.
“51.3%
of pages are hosted in the United States. The countries with the
estimated 2nd, 3rd, and 4th largest English-speaking populations—India,
Pakistan, Nigeria, and The Philippines—have only 3.4%, 0.06%, 0.03%,
0.1% the URLs of the United States, despite having many tens of millions
of English speakers.”
The
chart below displays a deeper dive into the linguistic makeup of the
Common Crawl data by the Common Sense Advisory research team.
Recently though, researchers and technology companies have attempted
to extend the capabilities of large language models into languages other
than English by building what are called multilingual language models.
Instead of being trained on text from only one language, multilingual
language models are trained on text from dozens or hundreds of languages
at once.
Researchers posit that multilingual language models can
infer connections between languages, allowing them to apply word
associations and underlying grammatical rules learned from languages
with more text data available to train on (in particular English) to
those with less.
Languages vary widely in resourcedness, or the
volume, quality, and diversity of text data they have available to train
language models on. English is the highest-resourced language by multiple orders of magnitude, but Spanish, Chinese, German, and a handful of other languages have sufficiently high resources to build language models.
However, they are still expected to be lower in quality than English language
models. Medium resource languages, with fewer but still high-quality
data sets, such as Russian, Hebrew, and Vietnamese, and low resource
languages, with almost no training data sets, such as Amharic, Cherokee,
and Haitian Creole, have too little text for training large language
models
However, there are many challenges and complexities
involved in developing multilingual and multicultural LLMs that can
cater to the diverse needs and preferences of different communities.
Multilingual language models are still usually trained
disproportionately on English language text and thus end up transferring
values and assumptions encoded in English into other language contexts
where they may not belong.
Most
remedial approaches to address the English bias rely on the acquisition
of large amounts of non-English data to be added to the core training
data to reduce the English bias in current LLMs; data which is not
easily found or often non-existent. Certainly not at the scale, volume,
and diversity that English training data exists.
English is the
closest thing there is to a global lingua franca. It is the dominant
language in science, popular culture, higher education, international
politics, and global capitalism; it has the most total speakers and the third-most first-language speakers.
The
bias in the NLP research community is evident in the chart below. ACL
papers are more likely to be published in English than any other
language by a factor of 11X to 80X!
Languages mentioned in paper abstracts. Top most mentioned languages in abstracts of papers published by the Association for Computational Linguistics, May 2022-January 2023.
Recent US congressional hearings also focused on this language-bias problem
when Senator Alex Padilla (a native Spanish speaker) of California
questioned the CEO of OpenAI about improving the experience for the
growing population of non-English users even in the US and said: “These
new technologies hold great promise for access to information,
education, and enhanced communication, and we must ensure that language
doesn’t become a barrier to these benefits.”
However,
the fact remains, and OpenAI clearly states that the majority of the
underlying training data used to power ChatGPT (and most other LLMs)
came from English and that the company’s efforts to fine-tune and study
the performance of the model is primarily focused on English “with a
US-centric point of view.”
This also results in the models performing better on tasks that involve
going from Language X to English than on tasks that involve going from
English to Language X. Because of the data scarcity and substantial
costs involved in correcting this it is not likely to change soon.
Because the training text data sets used to train GPT models also
have some other languages mixed in, the generative AI models do pick up
some capability in other languages. However, their knowledge is not
necessarily comprehensive or complete enough, and in a development
approach that implicitly assumes that scale is all you need, most
languages simply do not have enough scale in training data to perform at
the same levels as English.
This is likely to change over time to
some extent, and already the Google PaLM model claims to be able to
handle more languages, but early versions show only very small
incremental improvements in a very few select languages.
Each
new language that is "properly supported" will require a separate set
of guardrails and controls to minimize problematic model behavior.
Thus,
beyond the monumental task of finding massive amounts of non-English
text and re-training the base generative AI model from scratch,
researchers are also trying other approaches e.g., creating new data
sets of non-English text to try to accelerate the development of truly
multilingual models, or by generating synthetic data by using what is
available in high resource languages like English or Chinese, which are both less effective than simply having the adequate data volume in the low-resource language in the first place.
Nguyen
and other researchers say they would also like to see AI developers pay
more attention to the data sets they feed into their models and better
understand how that affects each step in the building process, not just
the final results. So far, the data and which languages end up in models
has been a "random process," Nguyen says.
So when you
make a prompt request in English, it draws primarily from all the
English language data it has. When you make a request in traditional
Chinese, it draws primarily from the Chinese language data it has. How
and to what extent these two piles of data inform one another or the
resulting outcome is not clear, but at present, experiments show that
they at least are quite independent.
The training data
for these models were collected through long-term web crawling
initiatives, and a lot of it was pretty random. More rigorous controls
to reach certain thresholds of content for each language -as Google
tried to do with PaLM- could improve the quality of non-English output.
It is also possible that more carefully collected and curated data that
is better balanced linguistically could improve performance across more
languages.
The T-LM (Translated Language Model) Offering
The fundamental data acquisition and limited and sub-optimal accessibility problems described above could take years to resolve.
Thus, Translated Srl is introducing a way to address the needs of a
larger global population interested in using GPT-4 for content creation,
content analysis, basic research, and content refinement in their
preferred language.
The following chart shows the improved
performance available with T-LM across several languages. Users can
expect the performance improvements to continue to increase and improve
as they provide corrective feedback daily.
Combining the power of the state-of-the-art adaptive machine
translation technology with OpenAI's latest language model will result
in empowering users across 200 languages to engage and explore the capabilities of GPT-4 in a preferred non-English language and achieve superior performance.
T-LM will help unlock the full potential of GPT-4 for businesses around the world. It provides companies with a cost-effective solution to create and restructure content and do basic content research in 200 languages, bridging the performance gap between GPT-4 in English and non-English languages.
Many
users have documented and reported sub-optimal performance when
searching with Bing Chat when they query in Spanish rather than English.
In
a separate dialog, when queried in English, Bing Chat correctly
identified Thailand as the rumored location for the next set of the TV
showWhite Lotus,
but provided “somewhere in Asia” when the query was translated to
Spanish, says Solis, who runs a consultancy called Orainti that helps
websites increase visits from search engines.
Additionally, using GPT-4 in non-English languages can cost up to 15 times more
(see the charts below). Research has shown that speakers of certain
languages may be overcharged for language models while obtaining poorer
results, indicating that tokenization may play a role in both the cost
and effectiveness of language models. This study shows the difference in cost by language family which can be significantly higher than English.
Independent researchers point out how the same prompt varies across
languages and that some languages consistently have a higher token
count. Languages such as Hindi and Bengali (which together over 800
million people speak) resulted in a median token length of about 5 times
that of English. The ratio is 9 times that of English for Armenian and
over 10 times that of English for Burmese. In other words, to express the same prompt or sentiment, some languages require up to 10 times more tokens.
To express the same sentiment, some languages require up to 10 times more tokens
Implications of tokenization language disparity
Overall, requiring more tokens (to tokenize the same message in a different language) means:
Non-English users are limited in how much information they can put in the prompt (because the context window is fixed).
It is more costly as generally more tokens are needed for equivalent prompts.
It is slower and takes longer to run and often results in more fabrication and other errors.
OpenAI’s
models are increasingly being used in countries where English is not
the dominant language. According to SimilarWeb.com, the United States
only accounted for 10% of the traffic sent to ChatGPT in Jan-March 2023.
India, Japan, Indonesia, and France all have large user populations
that are almost as large as the US user base.
Translated's
T-LM service integrates the company’s award-winning adaptive machine
translation (ModernMT) with GPT-4 to bring advanced generative AI
capabilities to every business in the languages spoken by 95% of the world's population.
This approach also lowers the cost of using GPT-4 in languages other
than English, since the pricing model is based on text segmentation
(tokenization) that is optimized for English. By ensuring that all
prompts submitted to GPT-4 are in English the billing will be equivalent
to the more favorable and generally lower-cost English tokenization.
T-LM, instead, will always use the number of tokens in English for
billing.
The Adaptive ModernMT technology,
unlike most other MT technology available today can learn and improve
dynamically and continuously with ongoing corrective feedback daily.
Thus, users who work with T-LM can drive continuous improvements in
output produced from GPT-4 by providing corrective feedback on the
translations produced by T-LM. This is something that is not possible
with the most commonly used static MT systems where users would be
confined and limited to generic system performance.
T-LM
addresses the performance disparity experienced by non-English users by
translating the initial prompt from the source language to English and
then back to the user's language using a specialized model that has been
optimized for the linguistic characteristics typically used in prompts.
T-LM
combines GPT-4 with ModernMT, an adaptive machine translation engine,
to offer GPT-4 near English-level performance in 200 languages.
T-LM
works by translating non-English prompts into English, executing them
using GPT-4, and translating the output back to the original language,
all using the ModernMT adaptive machine translation.
T-LM is available to enterprises via an API and to consumers through a ChatGPT plugin.
The
result is a more uniform language model performance capability across
many languages and enhanced GPT-4 performance in non-English languages.
Customers can optionally use their existing ModernMT keys to employ adaptive models within GPT-4.
An
indirect benefit of T-LM is that it has cost up to 15x lower than
GPT-4, thanks to a reduced number of tokens billed. GPT-4 counts
significantly more tokens in non-English languages. T-LM, instead, will
always use the number of tokens in English for billing
Therefore,
Translated's integration with OpenAI enhances GPT-4's performance in
non-English languages by combining GPT-4 with the ModernMT adaptive
machine translation, resulting in a more uniform language model
capability across languages and lower costs.
Use
cases for T-LM include assisting global content creation teams in a
broad range of international commerce-related initiatives, allowing
companies from Indonesia, Africa, and various parts of India to make
their products visible in online eCommerce platforms to US and EU
customers, providing better multilingual customer support, making global
user-generated content visible and understandable in the customer’s
language.
T-LM can be used in many text analysis tasks needed in business settings, e.g., breaking down and explaining complicated topics,
outlining blog posts, sentiment analysis, personalized responses to
customers, summarization, creating email sales campaign material, or
suggesting answers to customer agents.
T-LM
works together with GPT to create a wide range of written content or
augment existing content to give it a different intonation, by softening
or professionalizing the language, to improve content creation and
transformation automation while providing a fast and engaging user
experience. This is now possible to do in 200 languages that ModernMT
supports.
There are many ways GPT-4 can produce
‘draft’ text that meets the length and style desired, which can then be
reviewed by the user,” Gartner said in a report on how to use GPT-4.
“Specific uses include drafts of marketing descriptions, letters of
recommendation, essays, manuals or instructions, training guides, social
media or news posts.”
T-LM will allow students around the world
to access knowledge content, and use GPT-4 as a research assistant and
access a much larger pool of information. In education, GPT-4 can be
used to create personalized learning experiences, as a tutor would. And,
in healthcare, chatbots and applications can provide simple language
descriptions of medical information and treatment recommendations.
T-LM will enhance the ability of large and SME businesses to
engage in new international business by assisting in basic
communication, and understanding, and providing more complete
documentation on business proposals using the strengths of both GPT-4
and T-LM working together.
T-LM is available now through API. More information on the service can be found at translatedlabs.com/gpt.
None of the main arguments have changed with the introduction of ChatGPT. All the structural problems identified with LLMs years ago are still present and have not been alleviated with the introduction of either ChatGPT or GPT-4.
Large language models (LLMs) are all the rage nowadays and it is almost impossible to get away from the news frenzy around ChatGPT, BingGPT, and Bard. There is much talk about reaching artificial general intelligence, (AGI) but should we be worried that the machine will shortly take over all kinds of knowledge work, including translation? Is language really something that machines can master?
Machine learning applications around natural language data have been in full swing for over five years now. In 2022 natural language processing (NLP) oriented research announced breakthroughs in multiple areas, but especially around improving neural machine translation (NMT) systems and neural language generating (NLG) systems like the Generative Pre-trained Transformer 3 (GPT-3) and ChatGPT, a chat-enabled variation of GPT-3 which can produce human-like, if inconsistent, digital text. It predicts the next word given a text history, and often the generated text is relevant and useful. This is because it has trained on billions of sentences and has the ability to often glean the most relevant material related to the prompt from the data it has seen.
GPT-3 and other LLMs can generate algorithm-written text often nearly indistinguishable from human-written sentences, paragraphs, articles, short stories, and more. They can even generate software code that draws on troves of previously seen code examples. This suggests that these systems could be helpful in many text-heavy business applications and possibly enhance enterprise-to-customer interactions involving textual information in various forms.
The original hype and excitement around GPT-3 have triggered multiple similar initiatives across the world, and we see today that the massive corpus of 175 billion parameters used in building GPT-3 has already been overshadowed by several other models that are even larger — Gopher from Deepmind has been built with 280 billion parameters and claims better performance in most benchmark tests used to evaluate the capabilities of these models.
More recently, ChatGPT has taken the world by storm and many knowledge workers are fearful of displacement from all the hype, even though we see the same problems with all LLMs, a lack of common sense, the absence of understanding, and the constant danger of misinformation and hallucinations. Not to mention the complete disregard for data privacy and copyright in their creation.
GPT-4 parameter and training data overviews have been kept secret by OpenAI, which has now decided to cash in on the LLM gold rush, but many estimate parameters to be in the trillion-plus range.
It is worth stating the originally stated OpenAI ethos has faded now, and one wonders if it was ever taken seriously. Their original mission statement was:
"Our goal is to advance digital intelligence in the
way that is most likely to benefit humanity as a whole, unconstrained by
a need to generate a financial return. Since our research is free from
financial obligations, we can better focus on a positive human impact."
In an age where we face bullshit everywhere we turn, why should this be an exception?
The hype around some of these “breakthrough” capabilities inevitably raises questions about the increasing role of language AI capabilities in a growing range of knowledge work. Are we likely to see an increased presence of machines in human language-related work? Is there a possibility that machines can replace humans in a growing range of language-related work?
A current trend in LLM development is to design ever-larger models in an attempt to reach new heights, but no company has rigorously analyzed which variables affect the power of these models. These models can often produce amazing output but also have a fairly high level of completely wrong factual “hallucinations” where the machine simply pulls together random text elements in so confident a manner as to often seem valid to an untrained eye. But many critics are saying that larger models are unlikely to solve the problems that have been identified — namely, the textual fabrication of false facts, the absence of comprehension, and common sense.
"ChatGPT “wrote” grammatically flawless but flaccid copy. It served up enough bogus search results to undermine my faith in those that seemed sound at first glance. It regurgitated bargain-bin speculations about the future of artificial intelligence. "
The initial euphoria is giving way to an awareness of the problems that are also inherent in LLMs and an understanding that adding more data and more computing power cannot and will not solve the toxicity and bias problems that have been uncovered. Critics are saying that scale does not seem to help much when it comes to “understanding,” and building GPT-4 with 100 trillion parameters, at a huge expense, may not help at all. The toxicity and bias that are inherent in these systems will not be easily overcome without strategies that involve more than simply adding more data and applying more computing cycles. However, what these strategies are, is not yet clear though many say this will require looking beyond machine learning.
GPT-3 and other LLMs can be fooled into creating incorrect, racist, sexist, and biased content devoid of common sense. The model’s output depends on its input: garbage in, garbage out.
"Just Calm Down About GPT-4 Already and stop confusing performance with competence. What the large language models are good at is saying what an answer should sound like, which is different from what an answer should be."
Techniques like reinforced learning from human feedback (RLHF) can help to build guardrails against the most egregious errors but also reduce the scope of possible right answers. Many say this technique cannot solve all the problems that can emerge from algorithmically produced text as there are too many unpredictable scenarios.
If you dig deeper, you discover although its output is grammatical, and even impressively idiomatic, its comprehension of the world is often seriously off. You can never really trust what it says. Unreliable AI that is pervasive and ubiquitous is a potential creator of societal problems on a grand scale.
Despite the occasional or even frequent ability to produce human-like outputs, ML algorithms are at their core only complex mathematical functions that map observations to outcomes. They can forecast patterns that they have previously seen and explicitly learned from. Therefore, they’re only as good as the data they train on and start to break down as real-world data starts to deviate from examples seen during training.
In December 2021, an incident with Amazon Alexa exposed the problem that language AI products have. Alexa told a child to essentially electrocute herself (touch a live electrical plug with a penny) as part of a challenge game. This incident — and many others with LLMs — show that these algorithms lack comprehension and common sense, and can make nonsensical suggestions that could be dangerous or even life-threatening.
“No current AI is remotely close to understanding the everyday physical or psychological world, what we have now is an approximation to intelligence, not the real thing, and as such it will never really be trustworthy,” said Gary Marcus in response.
Large pre-trained statistical models can do almost anything, at least enough for a proof of concept, but there is little they can do reliably because they skirt the required foundations.
Thus we see an increasing acknowledgment from the AI community that language is indeed a hard problem — one that cannot necessarily be solved by using more data and algorithms alone, and other strategies will need to be employed. This does not mean that these systems cannot be useful. Indeed, we understand how they are useful but have to be used with care and human oversight, at least until machines have more robust comprehension and common sense.
We already see that machine translation (MT) today is ubiquitous, and by many estimates is responsible for 99.5% or more of all language translation done on the planet on any given day. But we also see that MT is used mostly to translate material that is voluminous, short-lived, transitory and that would never get translated if the machine were not available. Trillions of words a day are being translated by MT daily, yet when it matters, there is always human oversight on translation tasks that may have a high impact, or when there is a greater potential risk or liability from mistranslation.
While machine learning use cases continue to expand dramatically, there is also an increasing awareness that a human-in-the-loop is often necessary since the machine lacks comprehension, cognition, and common sense.
As Rodney Brooks, the co-founder of iRobot, said in a post entitled An Inconvenient Truth About AI:
“Just about every successful deployment of AI has either one of two expedients: It has a person somewhere in the loop, or the cost of failure, should the system blunder, is very low.”
Fig. 1. Linguistic communication of thoughts by speaker and listener
What is it about human language that makes it a challenge for machine learning?
Members from the singularity community summarized the problem quite neatly. They admit that “language is hard” when they explain why AI has not mastered translation yet. Machines perform best in solving problems that have binary outcomes.
Michael Housman, a faculty member of Singularity University, explained that the ideal scenario for machine learning and artificial intelligence is something with fixed rules and a clear-cut measure of success or failure. He named chess as an obvious example and noted machines were able to beat the best human Go player. This happened faster than anyone anticipated because of the game’s very clear rules and limited set of moves.
Machine learning works best when there is one or a defined and limited set of correct answers.
Housman elaborated, “Language is almost the opposite of that. There aren’t as clearly-cut and defined rules. The conversation can go in an infinite number of different directions. And then, of course, you need labeled data. You need to tell the machine to do it right or wrong.”
Housman noted that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.”
Another issue is that language is surrounded by layers of situational and life context, intent, emotion, and feeling. The machine simply cannot extract all these elements from the words contained in a sentence or even by looking at hundreds of millions of sentences.
The same sequence of words could have multiple different semantic implications. What lies between the words is what provides the more complete semantic perspective, and this is learning that machines cannot extract from a sentence. The proper training data to solve language simply does not exist and will likely never exist even though current models seem to have largely solved the syntax problem with increasing scale.
Concerning GPT-3/4 and other LLMs: The trouble is that you have no way of knowing in advance which formulations will or won’t give you the right answer. GPT’s fundamental flaws remain. Its performance is unreliable, causal understanding is shaky, and incoherence is a constant companion.
Hopefully, we are now beginning to understand that adding more data does not solve the overall problem, even though it appears to have largely solved the syntax issue. More data makes for a better, more fluent approximation to language; it does not make for trustworthy intelligence.
The claim to these systems are early representations of machine sentience or AGI is particularly problematic, and some critics are quite vocal in their criticism of these overreaching pronouncements and forecasts.
Summers-Stay said this about GPT-3: “[It’s] odd, because it doesn’t ‘care’ about getting the right answer to a question you put to it. It’s more like an improv actor who is totally dedicated to their craft, never breaks character, and has never left home but only read about the world in books. Like such an actor, when he doesn’t know something, he will just fake it. You wouldn’t trust an improv actor playing a doctor to give you medical advice.”
Ian P. McCarthy said, “A liar is someone who is interested in the truth, knows it, and deliberately misrepresents it. In contrast, a bullshitter has no concern for the truth and does not know or care what is true or is not.” Gary Marcus and Ernest Davis characterize GPT-3 and new variants as “fluent spouters of bullshit” that even with all the data are not a reliable interpreter of the world.
For example, Alberto Romero says: “The truth is these systems aren’t masters of language. They’re nothing more than mindless ‘stochastic parrots.’ They don’t understand a thing about what they say, and that makes them dangerous. They tend to ‘amplify biases and other issues in the training data’ and regurgitate what they’ve read before, but that doesn’t stop people from ascribing intentionality to their outputs. GPT-3 should be recognized for what it is: a dumb — even if potent — language generator, and not as a machine so close to us in humanness as to call it ‘self-aware.’”
The most compelling explanation that I have seen on why language is hard for machine learning is by Walid Saba, founder of Ontologik.Ai. Saba points out that Kenneth Church, a pioneer in the use of empirical methods in NLP i.e. using data-driven, corpus-based, statistical, and machine learning (ML) methods was only interested in solving simple language tasks — the motivation was never to suggest that this technique could somehow unravel how language works, but rather he meant, “It is better to do something simple than nothing at all.”
However, subsequent generations misunderstood this empirical data-driven approach which was originally only intended to find practical solutions to simple tasks, to be a paradigm that will scale into full natural language understanding (NLU).
This has led to widespread interest in the development of LLMs and what he calls “a futile attempt at trying to approximate the infinite object we call natural language by trying to memorize massive amounts of data.”
While he sees some value in data-driven ML approaches for some NLP tasks (summarization, topic extraction, search, clustering, NER) he sees this approach as irrelevant for natural language understanding (NLU) where understanding requires a much more specific and accurate understanding of the one and only one thought that a speaker is trying to convey. Machine learning works on the specified NLP tasks above because they are consistent with the probably approximately correct (PAC) paradigm that underlies all machine learning approaches, but he insists that this is not the right approach for “understanding” and NLU.
He explains that there are three reasons why NLU or "understanding" is so difficult for machine learning:
1. The missing text phenomenon (MTP) is believed to be at the heart of all challenges in NLU.
In human communication, an utterance by a speaker has to be decoded to get to the specific meaning intended, by the listener, for understanding to occur. There is often a reliance on common background knowledge so that communication utterances do not have to spell out all the context. That is, for effective communication, we do not say what we can assume we all know! This genius optimization process that humans have developed over 200,000 years of evolution works quite well, precisely because we all know what we all know.
But this is where the problem is in NLU: machines don’t know what we leave out because they don’t know what we all know. The net result? NLU is difficult because a software program cannot understand the thoughts behind our linguistic utterances if it cannot somehow “uncover” all that stuff that humans leave out and implicitly assume in their linguistic communication. What we say is a fraction of all that we might have thought of before we speak.
Fig. 2: Deep learning to create $30T in market cap value by 2037? (Source: ARK Invest).
2. ML approaches are not relevant to NLU: ML is compression, and language understanding requires uncompressing.
Our ordinary spoken language is highly (if not optimally) compressed. The challenge is in uncompressing (or uncovering) the missing text. Even in human communications, faulty uncompressing can lead to misunderstanding, and machines do not have the visual, spatial, physical, societal, cultural, and historical context, all of which remain in the common understanding but unstated zone to enable understanding. This is also true to a lesser extent for written communication.
What the above says is the following: machine learning is about discovering a generalization of lots of data into a single function. Natural language understanding, on the other hand, and due to MTP (missing text phenomena), requires intelligent “uncompressing” techniques that would uncover all the missing and implicitly assumed general knowledge text. Thus, he claims machine learning and language understanding are incompatible and contradictory. This is a problem that is not likely to be solved by 1000X more data and computing.
3. Statistical insignificance: ML is essentially a paradigm that is based on finding patterns (correlations) in the data.
Thus, the hope in that paradigm is that there are statistically significant differences to capture the various phenomena in natural language. Using larger data sets assumes that ML will capture all the variations. However, renowned cognitive scientist George Miller said: “To capture all syntactic and semantic variations that an NLU system would require, the number of features [data] a neural network might need is more than the number of atoms in the universe!” The moral here is this: Statistics cannot capture (nor even approximate) semantics even though increasing scale appears to have success with learning syntax. This fluency is what we see as "confidence" in the output.
Pragmatics studies how context contributes to meaning. Pragmatist George Herbert Mead argued that communication is more than the words we use: “It involves the all-important social signs people make when they communicate.” Now, how could an AI system access contextual information? It simply does not exist in the data they train on.
The key issue is that ML systems are fed words (the tip of the iceberg), and these words don’t contain the necessary pragmatic information of common knowledge. Humans can express more than words convey because we share a reality. But AI algorithms don’t. AI is faced with the impossible task of imagining the shape and contours of the whole iceberg given only a few 2D pictures of the tip of the iceberg.
Most of the data needed to achieve "understanding" is not available
Philosopher Hubert Dreyfus, a leading 20th-century critic, argued against current approaches to AI, saying that most of the human expertise comes in the form of tacit knowledge — experiential and intuitive knowledge that can’t be directly transmitted or codified and is thus inaccessible to machine learning. Language expertise is no different, and it’s precisely the pragmatic dimension often intertwined with tacit knowledge.
To summarize: we transmit highly compressed linguistic utterances that need a mind to interpret and “uncover” all the background information. This multi-modal, multi-contextual uncompression leads to “understanding.” Both communication and understanding require that humans fill in the unspoken and unwritten words needed to reach comprehension.
Languages are the external artifacts that we use to encode the infinite number of thoughts we might have.
In so many ways, then, in building ever-larger language models, machine learning and data-driven approaches are trying to chase infinity in a futile attempt to find something that is not even “there” in the data.
Another criticism focuses more on the “general intelligence” claims being made about AI by people like OpenAI. Each of our AI techniques manages to replicate some aspects of what we know about human intelligence. But putting it all together and filling the gaps remains a major challenge. In his book, data scientist Herbert Roitblat provides an in-depth review of different branches of AI and describes why each of them falls short of the dream of creating general intelligence.
The common shortcoming across all AI algorithms is the need for predefined representations, Roitblat asserts. Once we discover a problem and can represent it in a computable way, we can create AI algorithms that can solve it, often more efficiently than ourselves. It is, however, the undiscovered and unrepresentable problems that continue to elude us: the so-called edge cases. There are always problems outside the known set, and thus there are problems that models cannot solve.
“These language models are significant achievements, but they are not general intelligence,” Roitblat said. “Essentially, they model the sequence of words in a language. They are plagiarists with a layer of abstraction. Give it a prompt, and it will create a text that has the statistical properties of the pages it has read, but no relation to anything other than the language. It solves a specific problem, like all current artificial intelligence applications. It is just what it is advertised to be — a language model. That’s not nothing, but it is not general intelligence.”
“Intelligent people can recognize the existence of a problem, define its nature, and represent it,” Roitblat writes. “They can recognize where knowledge is lacking and work to obtain that knowledge. Although intelligent people benefit from structured instructions, they are also capable of seeking out their own sources of information.”
In a sense, humans are optimized to solve unseen and new problems by acquiring and building the knowledge base needed to address these new problems.
The Path Forward
Machine learning is being deployed across a wide range of industries, solving many narrowly focused problems and, when well implemented with relevant data, creating substantial economic value. This trend will likely only build momentum.
Some experts say that we are only at the beginning of a major value-creation cycle driven by machine learning that will have an impact as deep and as widespread as the development of the internet itself. The future giants of the world economy are likely to be companies that have leading-edge ML capabilities.
However, we also know that AI lacks a theory of mind, common sense and causal reasoning, extrapolation capabilities, and a body, so it is far from being “better than us” at almost anything slightly complex or general. These are challenges that are not easily solved by deep learning approaches. We need to think differently and move on from more data plus more computing approaches to solving all our AI-related problems.
“The great irony of common sense — and indeed AI itself — is that it is stuff that pretty much everybody knows, yet nobody seems to know what exactly it is or how to build machines that possess it,” said Gary Marcus, CEO and founder of Robust.AI. “Solving this problem is, we would argue, the single most important step towards taking AI to the next level. Common sense is a critical component to building AIs that can understand what they read; that can control robots that can operate usefully and safely in the human environment; that can interact with human users in reasonable ways. Common sense is not just the hardest problem for AI; in the long run, it’s also the most important problem.”
Common sense has been called the “dark matter of AI” — both essential and frustratingly elusive. That’s because common sense consists of implicit information: the broad (and broadly shared) set of unwritten assumptions and rules of thumb that humans automatically use to make sense of the world. Critics of over-exuberant AI claims frequently point out that two-year children have more common sense than existing deep-learning-based AI systems whose “understanding” is often quite brittle.
Common sense is easier to detect than to define, and its implicit nature is difficult to represent explicitly. Gary Marcus suggests combining traditional AI approaches with deep learning: “First, classical AI is a framework for building cognitive models of the world that you can then make inferences over. The second thing is, classical AI is perfectly comfortable with rules. It’s a strange sociology right now in deep learning where people want to avoid rules. They want to do everything with neural networks and do nothing with anything that looks like classical programming. But some problems are solved this way that nobody pays attention to, like making a Google Maps route.
We need both approaches. The machine-learning stuff is pretty good at learning from data, but it’s poor at representing the kind of abstraction that computer programs represent. Classical AI is pretty good at abstraction, but it all has to be hand-coded, and there is too much knowledge in the world to manually input everything. So it seems evident that what we want is some kind of synthesis that blends these approaches.”
This is also the view of Yann LeCun (Head of AI, Meta) who won a Turing Prize and whose company has also released an open-source LLM. He does not believe that the current fine-tuning RLHF approaches can solve the quality problems we see today. The autoregressive models of today generate text by predicting the probability distribution of the next word in a sequence given the previous words in the sequence.
Autoregressive models are “reactive” and do not plan or reason, according to LeCun. They make stuff up or retrieve stuff approximately, and this can be mitigated, but not fixed by human feedback. He sees LLMs as an “off-ramp” and not the destination of AI. LeCun has also said: “A system trained on language alone will never approximate human intelligence, even if trained from now until the heat death of the universe.”
LeCun has also proposed that one of the most important challenges in AI today is devising learning paradigms and architectures allowing machines to supervise their own world-model learning and then use them to predict, reason, and plan.
Thus, when we consider the overarching human goal of understanding and wanting to be understood, we must admit that this is very likely always going to require a human in the loop, even when we get to building deep learning models with trillions of words. The most meaningful progress will be related to the value and extent of the assistive role that language AI will play in enhancing our ability to communicate, share, produce, and digest knowledge.
Human-in-the-loop (HITL) is the process of leveraging machine power and enabling high-value human intelligence interactions to create continuously improving learning-based AI models. Active learning refers to humans handling low-confidence units and feeding improvements back into the model. Human-in-the-loop is broader, encompassing active learning approaches and data set creation through human labeling.
HITL describes the process when the machine is unable to solve a problem based on initial training data alone and needs human intervention to improve both the training and testing stages of building an algorithm. Properly done, this creates an active feedback loop allowing the algorithm to give continuously better results with ongoing use and feedback. With language translation, the critical training data is translation memory.
However, the truth is that there is no existing training data set (TM) so perfect, complete, and comprehensive as to produce an algorithm that consistently produces perfect translations.
Again to quote Roitblat: “Like much of machine intelligence, the real genius [of deep learning] comes from how the system is designed, not from any autonomous intelligence of its own. Clever representations, including clever architecture, make clever machine intelligence.”
This suggests that humans will remain at the center of complex, knowledge-based AI applications involving language even though the way humans work will continue to change.
As the use of machine learning proliferates, there is an increasing awareness that humans working together with machines in an active-learning contribution mode can often outperform the possibilities of machines or humans alone. The future is more likely to be about how to make AI a useful assistant than it is about replacing humans.
The contents of this blog are discussed in a live online interview format (with access to the recording) that Nimdzi shares on LinkedIn.
The actual interview starts 13 minutes after the initial ads.