Anon: LLMs are not ungrounded. They are grounded indirectly through the experiences of other people when they speak, the way a blind person’s knowledge of the visual world is mediated by what they re told by (sighted) people. Blind people know a great deal about the visual world — even about color, which can only be directly experienced through vision
SH: You’re perfectly right that the meanings of words can be grounded indirectly through language (i.e., through more words, whether from dictionaries, encyclopedias, textbooks, articles, lectures, chatting or texting – including texting with ChatGPT, the (sightless) statistical parrot with the immense bellyful of other people’s words, along with the computational means to crunch and integrate those words, partly by a kind of formal verbal figure-completion). Indirect grounding is what gives language (which, by the way, also includes symbolic logic, mathematics and computation as a purely syntactic subset) its immense (possibly omnipotent) communicative power.
But language cannot give words their direct grounding. Grounding, like dictionary look-up, cannot be indirect all the way down. Otherwise it is not bottom-up grounding at all, just circling endlessly from meaningless symbol to meaningless symbol.
Let’s recall what “grounding” is: It’s a connection between words and their referents. Between “apples” and apples (in the world). “Apples” is directly grounded (for me) if I can recognize and manipulate apples in the world. But not every word has to be directly grounded. Most aren’t., and needn’t be. Only enough words need to be grounded directly. The rest can be grounded indirectly, with language. That’s what we showed in the paper on the latent structure of dictionaries in the special issue of TICS edited by Gary Luyan in 2016. We showed that with a “minimal grounding set” of around 1000 grounded words you could go on to ground all the rest of the words in the dictionary through definitions alone. But those 1000 grounding words have to have been directly grounded, in some other way — not just indirectly, in terms of other words and their verbal definitions. That would have been circular.
All dictionaries, are circular; indeed all of language is. All the words in a dictionary are parasitic on other words in the dictionary. Direct grounding is “parasitic” too, but not on words. It is parasitic on the sensorimotor capacity to recognize and manipulate their referents in the world. Not every word. But enough of them to ground all the rest indirectly.
You spoke about grounding indirectly “in the experiences of others.” Well, of course. That’s language again. But what is “experience”? It’s not just know-how. I can describe in words what an apple looks like, what to do with it, and how. But I can’t tell that to you (and you can’t understand it) unless enough of my words and yours are already grounded (directly or indirectly), for both you and me, in what you and I can each perceive and do, directly, not just verbally, in the world. We don’t have to have exactly the same minimal grounding set. And we probably don’t just ground the minimal number directly. But what is grounded directly has to be grounded directly, not indirectly, through words.
The reason that blind people (even congenitally blind people, or almost congenitally blind and deaf people like Helen Keller) can learn from what seeing-people tell them is not that they are grounding what they learn in the “experience” of the seeing-person. They ground it in their own direct experience, or at least the subset of it that was enough to ground their own understanding of words. That was what I was trying to explain with Monochrome Mary, GPT, and Me. Indirect grounding can be done vicariously through the words that describe the experience of others. But direct grounding cannot be done that way too, otherwise we are back in the ungrounded dictionary-go-round again.
About Mollo & Milliere’s “Vector Grounding Problem“: I’m afraid M&M miss the point too, about the difference between direct grounding and indirect (verbal or symbolic) grounding. Here are some comments on M&M‘s abstract. (I skimmed the paper too, but it became evident that they were talking about something other than what I had meant by symbol grounding.)
M&M: The remarkable performance of Large Language Models (LLMs) on complex linguistic tasks has sparked a lively debate on the nature of their capabilities. Unlike humans, these models learn language exclusively from textual data, without direct interaction with the real world.
SH: “Learn language” is equivocal. LLMs learn to do what they can do. They can produce words (which they do not understand and which mean nothing to them, but those words mean something to us, because they are grounded for each of us, whether directly or indirectly). LLMs have far more capacities than Siri, but in this respect they are the same as Siri: their words are not grounded for them, just for us.
M&M: Nevertheless, [LLMs] can generate seemingly meaningful text about a wide range of topics. This impressive accomplishment has rekindled interest in the classical ‘Symbol Grounding Problem,’ which questioned whether the internal representations and outputs of classical symbolic AI systems could possess intrinsic meaning.
SH: I don’t really know what “intrinsic meaning” means. But for an LLM’s own words — or mine, or for the LLM’s enormous stash of text to mean something “to” an LLM (rather than just to the LLM’s interlocutors, or to the authors of its text stash) — the LLM would have to be able to do what no pure wordbot can do, which is to ground at least a minimal grounding set of words, by being able to recognize and manipulate their referents in the world, directly.
An LLM that was also an autonomous sensorimotor robot — able to learn to recognize and manipulate at least the referents of its minimal grounding set in the world — would have a shot at it (provided it could scale up to, or near, robotic Turing Test scale); but ChatGPT (whether 4, 5 or N) certainly would not, as long as it was just a wordbot, trapped in the symbolic circle of the dictionary-go-round. (N.B., the problem is not that dictionary definitions can never be exhaustive, just approximate; it is that they are circular, which means ungrounded.)
M&M: Unlike these systems, modern LLMs are artificial neural networks that compute over vectors rather than symbols.
SH: The symbols of mathematics, including vector algebra, are symbols, whose shape is arbitrary. Maths and computation are purely syntactic subsets of language. Computation is the manipulation of those symbols. Understanding what (if anything) the symbols mean is not needed to execute the recipe (algorithm) for manipulating them, based on the symbols’ arbitrary shapes (which might as well have been 0’s and 1’s), not their meanings.
M&M: However, an analogous problem arises for such systems, which we dub the Vector Grounding Problem. This paper has two primary objectives. First, we differentiate various ways in which internal representations can be grounded in biological or artificial systems…
SH: The notion of “internal representations” is equivocal, and usually refers to symbolic representations, which inherit the symbol grounding problem. Breaking out of the ungrounded symbol/symbol circle requires more than an enormous corpus of words (meaningless symbols), plus computations on them (which are just syntactic manipulations of symbols based on their shape, not their meaning). Breaking out of this circle of symbols requires a direct analog connection between the words in the speaker’s head and the things in the world that the symbols refer to.
M&M: …identifying five distinct notions discussed in the literature: referential, sensorimotor, relational, communicative, and epistemic grounding. Unfortunately, these notions of grounding are often conflated. We clarify the differences between them, and argue that referential grounding is the one that lies at the heart of the Vector Grounding Problem.
SH: Yes, the symbol grounding problem is all about grounding symbols in the capacity to recognize and manipulate their referents in the real (analog, dynamic) world.
M&M: Second, drawing on theories of representational content in philosophy and cognitive science, we propose that certain LLMs, particularly those fine-tuned with Reinforcement Learning from Human Feedback (RLHF), possess the necessary features to overcome the Vector Grounding Problem, as they stand in the requisite causal-historical relations to the world that underpin intrinsic meaning.
SH: The requisite “causal-historical” relation between words and their referents in direct sensorimotor grounding is the capacity to recognize and manipulate the referents of the words. A TT-scale robot could do that, directly, but no LLM can. It lacks the requisite (analog) wherewithal.
M&M: We also argue that, perhaps unexpectedly, multimodality and embodiment are neither necessary nor sufficient conditions for referential grounding in artificial systems.
SH: It’s unclear how many sensory modalities and what kind of body is needed for direct grounding of the referents of words (TT-scale), but Darwinian evolution had a long time to figure that out before language itself evolved.
I’d be ready to believe that a radically different synthetic robot understands and means what it says (as long as it is autonomous and at life-long Turing-indistiguishable scale), but not if it’s just a symbol-cruncher plus a complicated verbal “interpretation,” supplied by me.