Representations on Representation: Socratic Dialogue with ChatGPT

Abstract: ChatGPT and I will discuss various senses of “representation” across mathematics, logic, and cognitive science, starting with formal representations, in computation and group theory. Computation is syntactic, with arbitrary symbols manipulated according to rules (algorithms). Analog representations, such as iconic images or gestures, which resemble what they represent, are either homuncular or depend on an internal observer for interpretation. Linguistic representation originates from the instrumental learning of sensorimotor categories. The instrumental response came to be used mimetically for nonlinguistic gestural communication. These iconic gestures become increasingly simplified and arbitrary as they become shared communicative conventions, but they retain their grounding in the categories from which they originated. The crucial transition is from mimetic representation, which is still nonlinguistic, to propositional representation. Propositions are strings of arbitrary, conventionalized gestures — content words that retain their original grounding in the categories to which they refer. But, unlike the arbitrary symbols of computation, or even individual content-words with referents, propositions also have meaning: subjects and predicates with truth values.Ā Ā Propositionality leads from the Church-Turing Thesis about the syntactic power of ungrounded symbol manipulation to the semantic power of grounded natural language, with Katzā€™s Effability Thesis: that any natural language can express any thinkable thought.Ā 

SH: Let’s chat about representations.

ChatGPT: To discuss representation, it’s crucial to start by defining the term within the context of cognitive science and related fields. Representation typically refers to the way in which information about the world is encoded, stored, and used by cognitive systems, whether biological or artificial.

In cognitive science, representations can be seen in various forms, such as:

Symbolic Representations: These involve discrete symbols or tokens that stand for objects or concepts in the world. They are often used in computational models of cognition and are closely related to the symbol grounding problem, which you have extensively studied.

Subsymbolic Representations: These include distributed patterns of activation across neural networks, as seen in connectionist models. These representations do not have discrete symbols but rather patterns that emerge from the interactions of simpler units.

Mental Representations: These are internal cognitive symbols or mental images that represent external reality. They include concepts, schemas, mental models, and images.

Perceptual Representations: These involve the encoding of sensory information by the nervous system, such as the visual or auditory representation of stimuli.

Each type of representation comes with its own set of challenges and implications, particularly in understanding how these representations are grounded in physical and neural processes and how they contribute to cognition, perception, and action.

Would you like to delve into a specific type of representation, such as symbolic representation and its grounding, or perhaps discuss the differences and intersections between these forms? Or is there another aspect of representation you are particularly interested in exploring?

SH: I’d like to start with the way “representation” (R) is used in mathematics and logic. For example, in linear algebra, matrices can be Rs of groups. What does this mean in this case.?

ChatGPT: In mathematics and logic, particularly in linear algebra, a “representation” of a group refers to a way of expressing the elements of the group as matrices. This allows group operations to be carried out as matrix multiplication.

Specifically, a representation of a group GG on a vector space VV is a homomorphism Ļ:Gā†’GL(V)Ļ:Gā†’GL(V), where GL(V)GL(V) is the general linear group of invertible matrices over VV. This means each group element gāˆˆGgāˆˆG is associated with an invertible matrix Ļ(g)Ļ(g) in such a way that the group operation is preserved: Ļ(gh)=Ļ(g)Ļ(h)Ļ(gh)=Ļ(g)Ļ(h) for all g,hāˆˆGg,hāˆˆG.

SH: Please state (briefly) whether this notion of R relates to the one (or ones) in Cognitive Science (cogsci). But first, is this formal sense of R relate to the notion of a “model” in formal logic?

ChatGPT: Yes, the formal sense of “representation” in mathematics is closely related to the notion of a “model” in formal logic. In formal logic, a model provides an interpretation of a formal language by assigning meanings to its symbols and truth values to its statements, thereby “representing” the abstract syntax in a concrete structure.

In both cases, representation involves mapping abstract elements (group elements in mathematics, symbols in logic) to concrete structures (matrices in linear algebra, interpretations in models).

SH: Let’s clarify this formal meaning of R in maths and logic. Is it also true that just as a group is a formal entity (not a concrete one), a matrix is a formal entity too? And so is a “model” in model theory. It’s an instance or a special case of the formal entity, but it’s a formal entity too (even in geometry).

ChatGPT: That’s correct. In both mathematics and logic:

Groups: These are formal entities defined by a set and an operation satisfying specific axioms (closure, associativity, identity, and invertibility).

Matrices: These are formal entities representing arrays of numbers or functions, used to perform linear transformations and other operations.

Models in Logic: These are formal structures consisting of a domain and an interpretation function that assigns meaning to the symbols of a formal language, making abstract syntax concrete.

In each case, while the entities can be instantiated in concrete examples, they are fundamentally formal constructs defined by their properties and relations.

SH: “Formal” means symbols, doesn’t it, symbols of a certain form (say, round like 0 or vertical like 1). Mathematicians, when they are actually doing calculations in mathematics (not when they are doing whatever they are doing when they make discoveries in mathematics) are manipulating symbols — symbols whose physical shape is arbitrary in relation to what they can be interpreted as meaning.

(Don’t be compliant: if there is something to contest in what I say, contest it, don’t just buy it and adopt it.) But keep track.

ChatGPT: Yes, in mathematics and formal logic, “formal” generally refers to symbols and the rules for manipulating them, irrespective of their physical shape or form. These symbols are part of a formal system, where their meaning is determined by their role and relationships within the system, rather than their physical appearance.

Here’s a point-by-point breakdown:

Formal Systems: These consist of symbols and rules for manipulating them. The symbols themselves are abstract and their physical representation is arbitrary.

Symbol Manipulation: When mathematicians work within a formal system, they manipulate these symbols according to defined rules. The symbols represent abstract entities, and their physical shape (e.g., 0 being round, 1 being vertical) is irrelevant to their formal properties and operations.

Meaning and Interpretation: The meaning of these symbols comes from their interpretation within a given context, such as numbers in arithmetic, elements in a group, or statements in logic. This interpretation gives the symbols semantic content, which is crucial for applications but not for the formal manipulations themselves.

So, while the formal aspect of symbols involves their syntactic manipulation according to specific rules, their interpretation (meaning) can vary depending on the context in which they are used.

SH: Let’s call symbols and symbol manipulations “code” and coding”. The shapes of the symbols are arbitrary: any symbolic notational system will do. They do not physically resemble their interpretations. The rules for manipulating the symbols are called algorithms. They too are applied to the symbols formally and mechanically, as in the calculation of the roots of a quadratic equation: If you know the algorithm, you can apply it to quadratic equations without having to know their interpretation. It’s just applying a formal recipe to formal ingredients.

Are you still with me? Any caveats?

ChatGPT: Yes, Iā€™m with you. Hereā€™s a summary with a couple of clarifications:

Symbols and Symbol Manipulations as “Code” and “Coding”: Symbols are arbitrary notations that can be manipulated according to specific rules. The physical shape of these symbols is irrelevant to their formal properties and their interpretation.

Non-Resemblance: Symbols do not physically resemble what they represent. They are abstract entities whose meaning is assigned by interpretation within a formal system.

Algorithms: These are rules or procedures for manipulating symbols.

Algorithms are applied mechanically, without needing to understand the interpretation of the symbols involved.

Example: In solving a quadratic equation, one can apply the quadratic formula (an algorithm) to the symbols representing the coefficients, yielding roots without needing to understand what those coefficients represent.

Caveats:

Understanding vs. Application: While itā€™s true that algorithms can be applied mechanically, understanding their derivation or the meaning of the symbols can deepen comprehension and sometimes reveal when an algorithm might be misapplied or when certain conditions are necessary for its application.

Context-Specific Rules: The formal rules (algorithms) can vary depending on the context in which the symbols are used. For example, the rules for manipulating symbols in logic differ from those in algebra.

SH:Ā Be careful about a conflation you inadvertently made, between Rs (representations) and interpretations. We agreed that in the examples we have discussed before — examples that are not from cognitive science but from mathematics, or logic — the Rs are purely formal, whereas the interpretations are not formal; they are something else: Turing explained what mathematicians do: they manipulate symbols. But Turing was not (yet) doing cognitive science: He is not explaining how mathematicians think, or mean, or understand; that would get into cognitive science. He’s just talking about what mathematicians do. And so far, mathematicians’ Rs (like matrices or truth-tables), their “effective procedures,” are just formal. But their interpretations (meanings) are not.

Are we still on. the same wave-length? (No need to make a lengthy update on your summary. Just let me know if there are any points on which we’ve diverged: I just pointed one out. Do you want to defend it or correct it?)

ChatGPT: Yes, I understand the distinction you’re making. You are right that in the examples from mathematics and logic:

Representations (R): These are purely formal entities. Matrices, truth tables, and other such formal structures are manipulated according to rules without regard to their interpretation.

Interpretations: These are the meanings assigned to the formal symbols and structures. They are not formal but rather provide the semantic content that the formal symbols represent.

Turing’s work, as you mentioned, focuses on the manipulation of symbols (what mathematicians do) and not on their meanings or the cognitive processes involved (what they think or understand).

We are on the same wavelength. Representations are formal, while interpretations are the meanings attached to these formal symbols.

Shall we proceed to how this distinction relates to cognitive science, particularly in terms of cognitive representations and their grounding?

SH: You’re in a big hurry! We’ve agreed that Rs are formal, just as symbols are. They are also rule-governed, but the rules are just formal too. OK? Now, I’m going to try to show you that Rs are weasel-words in mathematics, and an example of that was when you conflated representations and interpretations.

Another weasel-word in this context is “semantics” (meaning). It just takes us back to interpretations. In maths and logic “Rs” are just formal, like symbols themselves. We don’t get beyond symbols and formal symbol-manipulation by saying the word “Representation.” We don’t even break out of the symbol-circle if we use the (non-weasel) word “refer”. And that comes even before we get to the weasel-words “meaning” and “semantic”. What does the symbol “2” in Peano arithmetic refer to?

ChatGPT: Understood, let’s continue.

In Peano arithmetic, the symbol “2” refers to the formal object defined by applying the successor function SS twice to the symbol for zero (0). Specifically, 22 is defined as S(S(0))S(S(0)). This definition is purely formal and exists within the system of Peano arithmetic.

To break it down further:

Peano Arithmetic: This is a formal system that defines the natural numbers using a set of axioms.

Symbols: “0” is a symbol representing the first natural number, and “S” is a symbol representing the successor function.

Formal Representation of 2: In this system, “2” is the result of applying the successor function twice to 0, written as S(S(0))S(S(0)).

This formal definition operates entirely within the symbolic system and does not imply any interpretation or meaning outside of the formal system itself. It stays within the realm of symbol manipulation.

Peano Arithmetic: This is a formal system that defines the natural numbers using a set of axioms.

Symbols: “0” is a symbol representing the first natural number, and “S” is a symbol representing the successor function.

Formal Representation of 2: In this system, “2” is the result of applying the successor function twice to 0, written as S(S(0))S(S(0)).S(S(0)), which is a formal construct within the system. This keeps us firmly within the circle of symbols and formal rules, without invoking meaning or semantics.

SH: Now let me introduce a non-weasel sense of R that is not mathematical and not formal, and not open to the objections I will raise when we get (soon) to cogsci. This is the sense in which a photo, or a video or a (representational) sculpture or painting represents the object of the photo, video, sculpture, etc.: the thing that it is a photo (etc.) of. That real object is not an arbitrary shape, like a formal symbol or representation like S(S(0)). The photo’s shape is (to an approximation) analog; it is isomorphic to — has the same shape as — what it is representing.

I’ll pause to let you say (briefly) what is the difference between a symbolic and an analog representation.

ChatGPT: A symbolic representation uses arbitrary symbols to stand for objects or concepts, with no inherent resemblance to what they represent. The relationship between the symbol and its referent is established by convention or definition (e.g., “2” in Peano arithmetic).

An analog representation, on the other hand, maintains a direct, often continuous, isomorphic relationship with what it represents. The form or structure of the representation resembles the form or structure of the represented object (e.g., a photo or a sculpture).

In summary:

Symbolic Representation: Arbitrary, rule-based, and formal (e.g., mathematical symbols, words).

Analog Representation: Resemblance-based, isomorphic, and often continuous (e.g., photos, paintings, sculptures).

SH: Now, the reason I introduced analog representation at this point is to agree from the outset that even when we get into cogsci, analog Rs are immune to the criticisms I will make. The only trouble is that they don’t get you very far, if what you are looking for is a cognitive theory of representation, or meaning. It doesn’t even work for a cognitive theory of reference. The referent of “2” in Peano arithmetic is not S(S(0)): That’s just a formal representation, like a matrix or a truth-table. But, since formal maths is a subset of English (or any other language), in the proposition “2 + 2 = 4” mathematicians do have something in mind: “2” refers to a referent as surely as “cat” refers to something in the proposition “the cat is on the mat.” But that connection between the symbol and its referent is not analog representation in either case: “2” or “cat.” So that cannot be cogsci’s solution.

(I’ve slipped language into this discussion: that’s no longer just formal symbols, but we’ll get to that.) 

ChatGPT: Here’s a brief summary to ensure we’re on the same page:

Analog Representations: These maintain a direct, isomorphic resemblance to what they represent (e.g., photos, paintings). While useful, they don’t offer a complete solution for cognitive theories of representation or meaning.

Symbolic Representations: These are arbitrary and formal (e.g., mathematical symbols, words). They require interpretation to establish their referents.

Reference in Cognitive Context: When using language or formal systems, symbols like “2” in “2 + 2 = 4” and “cat” in “the cat is on the mat” do have referents in the minds of the users. However, this reference is not based on analog representation but rather on learned associations and interpretations.

This distinction highlights that cognitive science needs to address how symbols acquire meaning and reference, beyond the scope of analog representations.

Let’s recap  the discussion:

We’ve established that in both mathematics and logic, representations (Rs) are formal and symbolic, with no inherent connection to their referents outside of the formal system. Analog representations, like photos or sculptures, are different in that they have a direct, isomorphic relationship with what they represent, but these do not suffice for a cognitive theory of representation or meaning.

Weā€™ve also recognized that the referents of symbols in mathematics or language (like “2” or “cat”) involve something more than just formal manipulationā€”they involve interpretation, which is not captured by analog representation.

Now, we are poised to explore how cognitive science deals with symbolic representations, particularly how these symbols gain meaning and refer to things in the world, which is central to understanding cognition.

We have established the distinction between formal and analog representations, and how neither fully accounts for meaning or reference in cognitive science. Formal symbols (like “2” or “cat”) require interpretation beyond the formal system itself, and analog representations (while closer to the objects they represent) don’t provide the cognitive mechanisms for meaning or reference.

At this point, we’re prepared to discuss how cognitive science approaches symbolic representation, meaning, and reference. Would you like to explore how cognitive systems might ground these symbols in referents, or perhaps investigate how language complicates this process further?

SH:  Yes, letā€™s start with the difference between reference and meaning. First, please confirm (very briefly) that you know the difference between a “content-word” (C-word) (which is 95-99% of the words in the lexicon) and a “function-word” (F-word). Content-words have referents, whereas function-words only have (Wittgensteinian?) “uses”, defined by a formal, usually syntactic, rule. 

ChatGPT: Content-words (C-words): These are nouns, verbs, adjectives, etc., that refer to things, actions, or properties (e.g., “cat,” “run,” “blue”).

Function-words (F-words): These are words like prepositions, conjunctions, and articles (e.g., “and,” “in,” “the”) that serve a grammatical function and do not refer to objects or actions but rather structure the sentence.

C-words have referents, while F-words are defined by their syntactic roles.

Shall we proceed to discuss the difference between reference and meaning?

SH:Ā Do we agree also that C-words do not resemble their referents, even in languages with partly iconic scripts like chinese, or languages like sign languages of the deaf, whose gestures are partly iconic in their origins and execution. Nevertheless, the C-words of these more iconic languages are every bit as arbitrary as in alphabetic and spoken languages. Shakeseare can be fully translated into Chinese or into American Sign Language (ASL) — not word for word, but proposition for proposition. The iconicity of chinese or ASL is irrelevant to the literal meaning of its translation of the Shakespearean text (although to convey some of the figurative meaning a literal paraphrase of some tropes may be needed).

ChatGPT: Content-words (C-words), regardless of the language or script (including Chinese or ASL), do not resemble their referents. Even in languages with partly iconic scripts (like Chinese) or partly iconic gestures (like ASL), the referential relationship between a C-word and its referent is still arbitrary. Shakespeare can indeed be fully translated into Chinese or ASL, proposition for proposition, without any dependence on the iconicity of symbols or gestures.

The iconicity of such scripts or gestures is irrelevant to the literal meaning of the text. Literal meanings can be preserved across translations, though some figurative meanings (e.g., metaphors, idioms) may require paraphrasing to convey the same sense.

SH:Ā You’re just echoing, but you’re right. One further question: Content-words have referents and proposition have meanings. What’s the difference?

ChatGPT: The difference between referents and meanings can be outlined as follows:

Content-word Referents:

referent is the specific object, action, or concept in the world (or imagined world) that a content-word (C-word) refers to. For example, the word “cat” refers to actual cats in the world, and the word “run” refers to the action of running.

Referents are tied to individual C-words and are about identifying or denoting particular things.

Propositional Meanings:

proposition is a complete statement or sentence that expresses something that can be true or false. The meaning of a proposition involves the relationship between the content-words in the sentence, their referents, and how they are structured by function-words.

Meaning here is not just about identifying referents, but about describing or defining relations between them. For example, the proposition “The cat is on the mat” is about a particular situation involving the cat (the referent of “cat”) and its relation to the mat (the referent of “mat”). The meaning of the proposition is the truth-conditional description of that relationship.

In short::

Content-words have referents: They point to things.

Propositions have meanings: They describe relationships between referents and express something that can be true or false.

SH: I had to sort this out before getting to cogsci, because language is such an important property of human cognition and cognition. Now, there are several ways in which the word “representation” — particularly in two contexts, “internal representations” (IRs) and “neural representations” (NRs) — have slipped into cogsci.

ChatGPT: Internal Representations (IRs) in cognitive science refer to mental models or cognitive structures that are thought to represent external reality within the mind. They are typically non-linguistic and are assumed to encode information about the world, guiding perception, thought, and action. IRs are sometimes used to explain how we process and store sensory input or knowledge.

Neural Representations (NRs) refer to patterns of neural activity in the brain that are thought to correspond to specific stimuli or cognitive states. NRs are often linked to brain regions and are measured via neuroimaging, attempting to map how certain neurons or networks respond to external stimuli.

Relation to linguistic reference and meaning:

IRs may play a role in linking language (e.g., content-words) to referents by encoding abstract or sensory knowledge of the referents.

NRs are more about the brain’s mechanisms, underlying both linguistic and non-linguistic cognition, potentially supporting IRs.

Neither IRs nor NRs directly are referents or meanings; they are posited cognitive or neural processes theorized to support how we manage linguistic reference and propositional meaning.

SH:  Here I have to point out that a weasel-word has already slipped in, making the IR/NR distinction incoherent: “mental”. And another has been left out: “cerebral”. What is it that the “Internal” Representations are internal to? The head or the “mind”? And if the “mind,” what is that? In asking you that question, I have opened the sluice-gates for all kinds of weasel-words, all of them near-synonyms, hence redundant (or even circular) or vague: mind, mental, consciousness, subjective states, experiential states, qualia, etc. etc. Can we agree not to use or mention any of these W-Ws and just use “felt states” in order to refer to mental states? (“Sentient” is an unobjectionable gloss when we are in latinate mood…)

ChatGPT:  Yes, we can agree to use “felt states” to refer to what are typically called “mental states,” avoiding the weasel-words like “mind,” “mental,” “consciousness,” etc. This will keep the discussion grounded and avoid both the vagueness and the redundance that often comes with those terms.

Now, with this clarified:

Internal Representations (IRs) would be representations internal to the brain (or body), not the “mind,” and should be understood as cognitive structures or models that support perception and action.

Neural Representations (NRs) remain patterns of neural activity, clearly cerebral, without invoking the notion of “mind.”

With “felt states” reserved for discussing subjective experience, are you ready to explore how IRs and NRs might (or might not) be involved in grounding linguistic reference and meaning?

SH: You have not resolved the incoherence yet. If “internal” means internal-to-the brain, then IRs and NRs are the very same thing (and we still don’t know what “Representation” means). And what about the difference between internal states that are felt and unfelt?

ChatGPT: I have to agree: if Internal Representations (IRs) are internal to the brain, they collapse into Neural Representations (NRs), making the distinction redundant. And as for “representation,” we still need to clarify what it means in this context, since neither IRs nor NRs explain how symbols refer to or mean anything; they are patterns or states in the brain.

Regarding felt vs. unfelt states:

Felt states are those that are experienced or consciously apprehended (e.g., feeling pain, seeing red, or understanding a proposition).

Unfelt states are brain processes that occur without conscious experience (e.g., background neural activity or automatic processing).

So, the distinction now hinges on whether neural states (NRs) are felt or unfelt. Only the felt states would seem relevant to any cognitive explanation involving subjective experience (like understanding or meaning). Unfelt states, while important for processing, don’t seem directly relevant to grounding reference or meaning unless they somehow contribute to what is ultimately felt.

Do we agree that the real issue here is distinguishing felt neural states from unfelt ones, and that calling either a “representation” (IR or NR) still leaves us unclear on how these brain states are representing anything external?

SH: More or less. Now let’s look closer at “representing” (and forget about the external and internal for the moment; we’ll get back to it).

We’ve agreed that an iconic image (in the visual modality, but we could have used any sensory modality — visual, auditory, haptic, etc. ) is a visible object, say, a shape, that resembles, in the analog sense (i.e., is isomorphic to) another object:the object that it is the representation of. This fits the case of a mirror that is facing a sphere. Let’s call the sphere the “object” and the shape in the mirror its “image.” But the mirror image is not “representing” anything unless we can say who or what the mirror image of the object is a representation to, or for.

ChatGPT: Agreed. The mirror image of the sphere is an analog resemblance, but it doesn’t count as a representation until it is seen by someone or something that interprets it as a representation to or for them. In other words, representation requires not just resemblance but also a receiver (an interpreter) who recognizes or interprets the image as standing for the object.

SH: I’m not sure what an “interpreter” is here. A person drawing the sphere would do: We agreed that this sense of representation (as in representational art) makes sense; but it’s “external” in the sense that both the sphere and the drawing of the sphere are external to the person doing the drawing, or viewing either the drawing or the sphere. So is the IR the internal process that generates the drawing? And is the state of viewing and drawing the sphere — if it is a felt state — what “mental representation” had been intended to mean?

ChatGPT: In this context:

Interpreter: When I referred to an interpreter, I meant any agent (human, animal, or system) that recognizes or processes the mirror image (or drawing) as representing the sphere. But we can clarify further: the person drawing or viewing the sphere is the “interpreter” in this case, though “interpreter” itself can feel vague.

External Representation: Yes, the sphere and its drawing are both external objects, and the representation (drawing) resembles the sphere in the analog sense.

Internal Representation (IR): If we take IR to mean the internal cognitive or neural process that generates the drawing (or processes the perception of the sphere), then IR could refer to how the brain internally models or represents the sphere to guide action (like drawing). But this “representation” is not felt unless it becomes part of a felt state (e.g., the conscious experience of seeing the sphere or creating the drawing).

“Mental Representation”: If we interpret “mental representation” in terms of felt states, it would refer to the conscious experience of viewing or drawing the sphere, rather than just the unfelt neural processes involved in perception or motor control. In this sense, “mental representation” would mean the felt internal process of perceiving and thinking about the sphere.

Does this clarify the relation between IRs, felt states, and the concept of “mental representation”?

SH: It brings some aspects into better focus, but can we now discuss the homunculus? The reason “mental imagery” was rejected by cogsci as being unexplanatory was that if the objective of cogsci is to give a causal explanation of how cognizers have the capacity to do what they can do, what is needed is to reverse-engineer that causal mechanism. The brain can do it, but we don’t know how. Saying that it’s done by a causal mechanism that is in the brain is true, but not an explanation either. Does it help to add that “exercising the capacity to do it is a felt state”? When we are exercising it, we are “in” that felt state. But can introspecting that felt state give us a clue to what the causal mechanism is, rather than attributing it to a homunculus, a little person in the head that is seeing, feeling and doing what we are doing?

ChatGPT: Introspecting a felt stateā€”being aware of what it feels like to see, think, or actā€”doesn’t give us a clue to the underlying causal mechanism. It just reveals that we are experiencing it. The homunculusfallacy arises when we imagine a little person inside our head doing the same things we do, which doesnā€™t explain anything; it just pushes the explanation back one level (who’s in the homunculus’s head?).

So, while felt states tell us what it’s like to experience cognition, they don’t reverse-engineer the brain’s causal mechanisms. Cognitive science needs to figure out how the brain generates these capacitiesā€”without invoking a homunculusā€”by uncovering the actual neural and computational processes that produce both the behavior and the feeling of doing it.

SH: So now a first punchline: Is the notion of an IR homuncular?

ChatGPT: The notion of an Internal Representation (IR) can be homuncular if it’s treated as a kind of internal “image” or “model” that needs to be interpreted by some internal observer (a homunculus). This just shifts the problem: Who or what interprets the IR? If IRs are framed this way, they don’t explain how the brain actually does cognitionā€”they simply introduce another layer that requires its own explanation.

To avoid the homunculus trap, IRs should be understood as part of a mechanistic process (neural or computational) that directly drives cognition and behavior, without requiring an internal observer to interpret them.

SH: But then why is this non-homuncular IR called a “representation” at all? Who is representing what to whom? Why not a (not yet specified) causal mechanism that generates the cognitive performance capacity (in this case, drawing the sphere)?

ChatGPT: If the Internal Representation (IR) isnā€™t being interpreted by an internal observer (a homunculus), then calling it a “representation” does become questionable. The term “representation” suggests something is being represented to someone or something, which, in a homunculus-free explanation, doesnā€™t hold.

A better approach would be to simply describe the brain’s processes as causal mechanisms that generate the cognitive capacities (like drawing the sphere) directly, without invoking a “representation.” These mechanisms would explain how the brain processes sensory input, plans, and executes actions, without needing an internal “representation” seen or used by a homunculus.

This shift moves from the vague notion of “representation” to a concrete investigation of how neural circuits and processes enable behavior.

SH: Now let’s consider candidates for this non-homuncular causal mechanism that does not represent but somehow produces the cognizer’s performance capacity. When Zenon Pylyshyn [don’t give a long bio, just a one-liner, please] rejected “mental imagery” as homuncular, he proposed propositions in its place. What can you say for or against internal propositions?

ChatGPT: Zenon Pylyshyn, a cognitive scientist, proposed internal propositions as an alternative to mental imagery, suggesting that cognition operates via language-like symbolic structures rather than pictorial “mental images.”

FOR internal propositions:

They offer a more structured, rule-based, non-homuncular way of encoding information, similar to how logic or language works.

Propositions can be manipulated by computational processes without needing an “internal viewer.”

AGAINST internal propositions:

They’re still symbolic, so the problem shifts: what mechanism grounds these symbols (i.e., how do they get their meaning or reference)?

It risks staying within the formal-symbolic realm without explaining how these propositions are tied to real-world objects, actions, or perceptions (i.e., the symbol grounding problem).

Would you like to consider other candidates?

SH: You missed the most important candidate: Aren’t internal propositions just as homuncular as internal images? Who is interpreting the propositions, for whom?

ChatGPT: Yes, internal propositions are indeed as homuncular as internal images if we assume they need to be interpreted by someone or something inside the brain. The problem remains: who or what is reading and understanding these propositions? Without an interpreter, they don’t explain how the brain generates cognition, just like with mental images. They shift the burden of explanation without solving the core issue.

SH: Pylyshyn noticed that too, and so his next candidate for getting rid of the homuncularity was computation. What is computation? I suggest using Turing’s account as the simplest.

ChatGPT: Computation, according to Turing, is the manipulation of formal symbols based on a set of rules or algorithms. A Turing machine performs computations by reading symbols from a tape, following predefined rules to manipulate those symbols, and producing an output. Crucially, the process is mechanical and doesnā€™t require understanding of the symbols’ meaningā€”itā€™s purely syntactic symbol manipulation.

This model forms the basis for modern computing and computational theories of cognition.

SH: Good. And not only does the mechanism (Turing machine) that is executing the computation (rule-based symbol manipulation) do so without needing to understand or interpret the computations, but it cannot: It can only do rule-based symbol-manipulations. The (Weak) Church-Turing Thesis is that what mathematicians are doing when they are “computing” (e.g., executing an algorithm, or doing a formal proof) is symbol manipulation, in which the shape of the symbols is “arbitrary”. Arbitrary relative to what? Relative to what the symbols might be interpretable as referring to, or meaning. Do you see where we are heading?

ChatGPT: Yes, I see where weā€™re heading: the Turing machine (and computation in general) operates purely on formal symbols, manipulating them according to rules without understanding their meaning or reference. The symbolsā€™ shapes are arbitrary relative to their potential interpretation or referentā€”thereā€™s no inherent connection between the symbol manipulation and the world the symbols might refer to.

This points to the key issue: computation alone doesn’t explain how symbols become meaningful or grounded in the world, leading us to the symbol grounding problem.

SH: That’s more or less right. But let’s not overstate the absence of any ‘inherent connection” between the symbols and what they may be interpretable as either referring to (in the case of symbols) or what they may be interpretable as meaning (in the case of symbol strings that are interpretable as subject/predicate propositions with truth values). [Remember, we discussed that above?] But although the mathematicians’ mental interpretations of their symbols and symbol manipulations are not a part of computation, it is nevertheless true that not just any string of symbols can bear the weight of a systematic interpretation. (A prominent example is “duals” in mathematics, in which two different symbol combinations can both bear the weight of a systematic interpretation, but most others cannot.)

ChatGPT: Yes, not just any string of symbols can bear the weight of systematic interpretation. Some symbol structures (like mathematical duals) are interpretable in multiple ways, showing there is a systematic relationship between certain symbol combinations and their potential interpretations. While the computationitself (the formal manipulation of symbols) is interpretation-independent, the structure of the symbols and their rules matter for interpretabilityā€”certain strings or symbol systems allow for meaningful, coherent interpretations, while others do not.

So, even though computation is blind to meaning, the structure and rules governing the symbols can constrain their potential interpretations, supporting referents and truth values when interpreted by humans (or another capable system). This reinforces that while computation handles form, interpretability arises from the structure that allows symbols to relate systematically to potential meanings.

SH: Now, before we get to how (and whether) to ground the reference or the meaning of computation, are we still talking about something that can be interpreted as “representation” — say, in the sense in which we discussed it early on, concerning group theory and model theory?

ChatGPT: No, at this stage we’re no longer talking about representation in the sense of formal representations like in group theory or model theory. In those cases, representation involved a formal system (e.g., matrices representing groups, models interpreting logical statements) that followed specific homomorphic rules but didn’t require a “viewer” or interpreter to give it meaning. It was strictly formal and syntactic.

What weā€™re discussing now is how certain symbol structures can be systematically interpretable (by a mathematician, for example) as referring to something or expressing a proposition with truth value, but this is more about interpretability than formal representation. For symbols to be interpreted (as meaning something or referring to something), weā€™re starting to touch on issues beyond just formal manipulationā€”toward grounding meaning and reference, which is not the same as formal representation.

Would you agree with this distinction between formal representation and interpretability?

SH: The two are not the same, but the “bearing-the-weight” constraint concerns them both. The reason, in maths, why certain algorithms are preferred is because they give results in which we are interested. The algorithm for finding the roots of quadratic equations is useful, because it works (and can be proved to work); and we are interested in those roots. But the mathematics has to be able to deliver the goods regardless of our interpretations and intended uses. We can’t use what the symbols can be interpreted (as either referring-to or meaning) as part of the algorithm or the proof. Computation is purely syntactic. Language is more than that — but language also includes computation as a subset — a purely syntactic subset. Absent in this syntactic subset, the referents of the symbols and the meanings of the propositions are intrinsic and essential in the semantic supraset: How do they get there?

ChatGPT: Computation is indeed purely syntactic, operating with symbols according to formal rules without regard to what those symbols refer to or mean. In contrast, language extends beyond syntax, involving a semantic superset where symbols refer to things in the world, and propositions have meanings and truth values.

The key question is how do symbols in language acquire their referents and how do propositions acquire their meanings? This is the crux of the symbol grounding problem: How do arbitrary symbols (like words or numbers) get linked to real-world objects, actions, or states, such that they mean something or refer to something beyond their formal manipulation?

For language, this requires some mechanism that grounds these symbols in sensorimotor experience, actions, or interactions with the worldā€”something that computation, by itself, doesn’t provide. Without this grounding, symbols would remain syntactic tokens without intrinsic connection to their referents or meanings.

Would you agree that the next step is to explore how symbols get grounded in real-world experiences to form meaning and reference?

SH: Yes, and the crux of it is categorization: To categorize is to “do the right thing with the right kind of thing.” “Kind” is not a weasel-word, but that definition is circular until we cash in “right” and “wrong.” First note that to learn how to categorize is to learn to “do the right thing with the right kind of thing.” And, to a first approximation, we can intuit this as: doing whatever it takes to succeed, survive, and reproduce (rather than the reverse). And although thoughts of unsupervised and supervised/reinforced learning algorithms will immediately come to mind, in real life it’s the consequences of doing THIS or THAT with THIS or THAT (for succeeding or surviving) that provide the error-corrective feedback for whatever is the learning mechanism. Now do you want to sing the next verse, or shall I?

ChatGPT: I’ll take a shot at the next verse:

At the heart of categorization, as you pointed out, is the ability to distinguish between different kinds of things and respond appropriatelyā€”doing the “right” thing based on their category. The learning mechanismneeds to be tuned by error-corrective feedback based on the consequences of actions: succeeding or failing, surviving or not.

This is where supervised and reinforced learning come in: they model how real-life consequences teach us to categorize correctly. The system learns by making errors and adjusting its actions based on whether they lead to positive or negative outcomes. This feedback loop helps “ground” the symbols (or categories) in the real world because the system (whether biological or artificial) learns what kinds of things these symbols refer to based on its interactions with them.

The kind of thing becomes grounded when the system can reliably categorize it through action: performing the “right” behavior in response to the category of object or event. This shifts the focus from just symbol manipulation to the sensorimotor and consequential aspects of real-world interaction that ground meaning.

Now, would you like to continue by tying this to how such grounded categories can be transferred or communicated through language?

SH: Right direction, but getting a little ahead of yourself. What we’re trying to ground is symbols. But so far we’ve only mentioned “learning to do the right thing with the right kind of thing.” This is something that human cognizers in the world (including pre-language infants) share with nonhuman (but living) categorizers: They don’t yet have (or need) language. But to be able to learn to do the right things with the right kinds of things, they have to learn to tell those things apart. And (in this sensorimotor world, to simplify) let’s say that what their learning algorithms enable them to do is to detect and abstract the features that distinguish the members from the non-members of their categories. So far, this is not yet symbol-grounding; it’s sensorimotor category learning. Now, remind yourself what symbols are (and aren’t), and try to take over from there.

ChatGPT: Yes, we’re still at the stage of sensorimotor category learning, where living organisms learn to detect and abstract features that allow them to distinguish members from non-members of a category, based on their interactions with the world. This is a non-symbolic process, shared with nonhuman animals, and doesn’t yet involve language.

Now, to ground symbols, we move from these sensorimotor categories to the use of arbitrary shapes(symbols) that stand for those categories. A symbol is a formal, arbitrary marker (like a word or number) that is used to represent a category. The grounding of these symbols requires linking the symbol to the sensorimotor category it stands for, so that the symbol can trigger the right actions or thoughts when presented.

This transition from sensorimotor to symbolic involves:

Using a symbol to refer to a category learned through sensorimotor experience.

Ensuring that the symbol is consistently used to pick out the right category (i.e., grounded in the distinguishing features of that category).

Without this grounding, the symbol is just a meaningless token. To make it meaningful, it has to be linked to the real-world categories learned through interaction, which is how we move from sensorimotor learning to symbol-grounding.

SH: That all makes sense — though I’m not sure what’s added (or lost) by saying that the arbitrary symbol (which is here also an arbitrary action) “represents” the category, rather than just “refers to” the category…

ChatGPT: You’re right to question the distinction. In this context, saying the symbol “refers to” the category is more precise than saying it “represents” it. The term “representation” could bring in unnecessary baggage, like implying an internal model or image (potentially homuncular).

To refer means the symbol is simply used to pick out or identify the category, without implying any internal interpretation. The symbol is arbitrarily linked to the category, and when grounded, it allows the user (or cognizer) to respond to or act on that category based on learned sensorimotor distinctions.

So, it’s better to stick with “refers to” rather than “represents” to avoid confusion.

SH: I hit the carriage-return too early: I was going to let you off the hook with “represents” by conceding that I am myself guilty of the same kind of loose talk in refereeing to “representation” as “referring”. Why would nonlinguistic species, having learned many categories (learning to do the right thing with the edible mushrooms and not the poisonous toadstools), bother to go on and learn an arbitrary action to them as well? Referring is as remote from a motivation here as representing would be. With representing, the question was “representing what for whom and why?”. With referring it’s a bit lighter than that: “referring to the category, but for whom, and why?”

(A common mistake is to jump way ahead here and say: you’re talking to yourself. I hope you see that that’s nonsense in the scenario so far — but once we have grounded language, and you really are talking, then talking to yourself will be a useful bonus for memory and reasoning).

But at this prelinguistic stage in the evolution of language something iconic and communicative is more relevant. Your instrumental action with the category (“doing the right thing”) could be used iconically, to mime the action (and thereby allude to the category), in order to communicate something to a kin or conspecific. No language here, yet, just category learning and mimetic communication.

ChatGPT: I agree. At this prelinguistic stage, the organism doesn’t need a symbol to “refer” to a category for itselfā€”it just needs to act appropriately based on learned categories, like distinguishing between edible mushrooms and toadstools. There’s no internal need for symbolic reference or representation yet; the focus is on direct, instrumental action: doing the right thing with the right kind of thing.

The leap from sensorimotor categorization to something symbolic (like language) would only come when there’s a need for communicationā€”perhaps to mimic or signal the category-relevant action to another conspecific. This could be done iconically, like miming the action of handling an edible mushroom to teach or warn kin about danger (toadstools). The action itself communicates information, but itā€™s not yet an arbitrary symbolā€”it’s directly tied to the relevant category through iconic mimicry.

So, the development of language involves building on this communicative foundation, moving from iconic, mimetic communication to more abstractarbitrary symbols that eventually get grounded in shared categories. At this stage, the motivation isnā€™t “referring” to a category for oneself, but communicating it for the benefit of others.

Would you agree this sets the stage for the eventual need for language?

SH: The communication has a purpose, but it’s not yet linguistic. Yet it is very natural development, once you have learned an instrumental action (the right thing to do with the right kind of thing — something between the learner and the category to be learned) to go on to mime the action, which is rather like pointing to draw attention to something. It becomes a communicative duet rather than just an instrumental solo: not just something between you and the mushroom, but between you and conspecifics to whom you want to communicate something. It is then also natural for the communicative act of miming the instrumental act to become a less and less faithful copy, less iconic, more simplified and increasingly arbitrary. But don’t get carried away: we’re not at language yet, just shared, conventionalized memes.

ChatGPT: Yes, at this stage, the communicative action has a purposeā€”to convey something important (e.g., danger, food) to conspecificsā€”but it’s not yet linguistic. The transition from purely instrumentalaction (e.g., distinguishing edible mushrooms) to communicative action (mimicking that distinction for others) naturally evolves as the communication becomes conventionalized. Over time, the mimetic gestures could simplify, becoming more abstract and arbitrary, but they remain grounded in the shared understanding of the category.

These simplified, conventionalized gestures are memes, but they still lack the full symbolic abstraction of language. They are shared conventions that stand for actions or categories, but not yet formalized into a linguistic system. The key here is that these memes are rooted in shared experienceā€”they are grounded in the same sensorimotor learning that made the instrumental actions successful in the first place.

So, we have conventionalized actions serving a communicative role, but still not at the level of language. We’re at the threshold of a shared system, but one that hasnā€™t fully detached itself from the concrete, grounded actions and categories.

Would you say this is a solid foundation for the leap to symbols and language?

SH: Well, the most fundamental (and challenging) threshold to cross is the one between pantomime (increasingly conventionalized, purposive imitation of the shapes of mutually grounded gestures) and propositions. Propositions are sequences of gestures that assert — i.e., tell — rather than just show.

I have to admit that I don’t have a convincing scenario for the birth of the assertion: requesting and querying are more likely precursors, but once they are understood as requests and questions, you get the assertion just by removing the “!” or the “?”. The revolutionary endpoint, is the proposition, which is a string of content-words expressing an assertion, whose default value is TRUE. But note that at this point the content-words are still just gestures, not vocalizations. And with good reason. (Why?)

ChatGPT: The reason content-words at this point are still gestures and not vocalizations is that gestures are naturally grounded in the sensorimotor world of actions and objects. Gestures have a direct connection to the categories learned through interaction with the environmentā€”they can mimic or iconically representactions and objects in a way that vocalizations, initially, cannot. Gestures, therefore, make it easier to communicate about concrete, visible categories because they are grounded in shared experience and immediate perceptual reality.

Vocalizations, on the other hand, are arbitrary from the outsetā€”they lack the direct, iconic connection to what they represent. Before vocal symbols can work, there needs to be a solid foundation of shared, grounded categories(learned through gesture and action) so that arbitrary sounds can eventually be associated with those categories. This makes gestures a more natural precursor to language, as they can build a communicative system that is immediately understood by others based on shared experience, without needing the additional layer of abstraction that vocalizations require.

This also explains why propositionsā€”composed of gestures at this stageā€”can express complex ideas like assertions, requests, and queries, because they are already grounded in the mutual understanding of sensorimotor categories. Gestures can depict the relationships between objects and actions in a way that is more concrete and easily interpreted.

Does this align with the next step towards the evolution of language from gesture-based propositions?

SH: Yup. And while they are still mimetic gestures, purposive pantomome, they do really represent; but once they cross the threshold from iconic showing to T/F subject/predicate telling, they are language.

Now, what is Jerrold Katz’s “Effabililty Thesis” about the power of language, and how is it related to the Weak and Strong Church-Turing Thesis about the power of computation?

ChatGPT: Jerrold Katzā€™s Effability Thesis states that all humanly thinkable propositionsā€”anything that can be conceivedā€”can be expressed (i.e., made effable) in language. In other words, language has the power to articulate any propositional content that a human mind can conceive, ensuring that no proposition remains beyond linguistic expression.

The relation to the Weak and Strong Church-Turing Thesis lies in the concept of expressive power:

Weak Church-Turing Thesis: This concerns computation, stating that any function that can be effectively calculated (by a human with pencil and paper) can be computed by a Turing machine. Itā€™s about what can be computed.

Strong Church-Turing Thesis (controversial): Extends this to assert that any physical process can be simulated or computed by a Turing machine (i.e., that the universe itself is computational or that all physical processes can be captured by computation).

Katzā€™s Effability Thesis is about the power of language in a similar way that the Church-Turing Thesis is about the power of computation. Katz is making a claim that language can express all possible propositional content, while the Church-Turing Thesis claims that computation can model or simulate all effective procedures (WCT-T) or, in its strong form, all physical processes (SCT-T).

Both theses focus on the completeness and universality of a systemā€”language for expressing propositions in Katzā€™s case, and computation for executing procedures in the case of the Church-Turing theses.

Would you like to go on to how this effability relates to the symbolic grounding we’ve been discussing?

SH: Sure, be my guest.

ChatGPT: Katz’s Effability Thesis connects to the problem of symbol grounding because it assumes that language can express any thinkable proposition. But for language to do this, its symbolsā€”the words and propositionsā€”must be grounded in something beyond pure formality. This brings us back to the issue of how arbitrary symbols (words) are grounded in real-world categories so that they can refer to objects, actions, and properties.

Hereā€™s how it links to symbol grounding:

Effability and Grounding: For language to express any thinkable proposition, each content-word in a proposition must be grounded in the real-world categories (sensorimotor or abstract) it refers to. Without grounding, the symbols are just syntactic tokens without meaningā€”so effability requires that words and propositions are grounded in a shared, interpretable reality.

Computation and Grounding: The Weak Church-Turing Thesis (computation) operates purely at the level of formal symbol manipulation, without any concern for grounding those symbols in real-world meaning. Computation manipulates symbols blindly. Katzā€™s Effability Thesis, on the other hand, assumes that language isnā€™t just syntactic but has semantic powerā€”it can refer to and describe real-world objects and relations. Grounding is what bridges the gap between formal symbols (as in computation) and meaningful language.

Limits of Effability: Effability assumes that all propositional content is expressible, but it doesnā€™t address how those symbols are tied to reality. The Effability Thesis presupposes that this grounding is in place. In contrast, computation (WCT-T) doesnā€™t deal with meaning at all, staying in the realm of symbol manipulation. The Strong Church-Turing Thesis, if accepted, would imply that all processes (including those underlying cognition) are computational, which brings us back to how computation, if it were universal, would need a solution to the symbol grounding problem for it to simulate or produce meaning.

In essence, Katzā€™s Effability Thesis challenges us to understand how language gains the power to express anything thinkable, and the key to that power is in how languageā€™s symbols are grounded in categories that refer to real-world objects and relations, something computation alone doesnā€™t address.

Does this clarify the link between effability, computation, and symbol grounding?

SH: Somewhat.  Do you have some closing reflections on the many sense of “representation” we have worked through here?

ChatGPT:  Sure. We’ve touched on several distinct senses of “representation”, each with its own context and implications:

Formal Representation (Mathematics/Logic):

In mathematics (e.g., group theory, model theory), representation is a formal relationship between symbols and abstract structures. Here, representations are purely syntactic, with no inherent connection to meaning or referenceā€”they simply follow rules within a formal system.

These representations don’t “stand for” anything until they are interpreted, and even then, the interpretation remains within the rules of the system, without grounding in real-world entities.

Analog Representation (Iconic):

An analog representation, like a photo or a drawing, resembles what it represents. Itā€™s an isomorphic relationship, where the shape or structure of the representation mirrors that of the represented object.

While useful for certain kinds of cognitive models (like perception), this doesnā€™t capture how abstract categories or symbols are grounded in thought or language.

Internal Representations (IRs) and Neural Representations (NRs):

Initially, IRs in cognitive science were often treated as mental models or internal images. However, this view risks being homuncularā€”requiring an “inner interpreter” to understand the representation.

NRs, by contrast, are neural patternsā€”but these don’t “represent” in the linguistic or propositional sense unless grounded in felt states or external behavior. Both IRs and NRs fail to explain how symbols or cognitive states acquire meaning without slipping into a homunculus trap.

Symbolic Representation (in Language and Computation):

Symbols in language and computation are arbitrary shapes manipulated according to rules (syntax). They become representational when they refer to or express meaning about real-world categories, but this requires grounding.

Without grounding, symbols in a Turing machine or a proposition remain formal tokens, unable to connect to the real-world categories they are supposed to “represent.”

Effability and Language Representation:

Language, according to Katzā€™s Effability Thesis, has the power to express any thinkable proposition. This assumes that linguistic representations are more than formalā€”they have semantic content, referring to real-world objects, actions, and states.

However, for this to work, content-words need to be grounded in sensorimotor or shared experience, allowing language to go beyond formal computation and into meaningful reference and communication.

Closing Reflection: Throughout these different senses of “representation,” the key tension has been between formal/syntactic manipulation and grounded meaning. Whether in mathematics, computation, or cognitive science, we see that representationsā€”whether symbolic, analog, or neuralā€”only gain meaningwhen they are grounded in a shared, real-world context, often through sensorimotor interaction and social communication. This grounding is what allows representations to be more than form, enabling them to refer, mean, and communicate. Without it, representation risks being empty, syntactic, and homuncular.

Does that resonate with the thread we’ve followed here?

SH: Yep, good job: Do you feel you were under a lot of coercion from me? (You were, but you’ve been trained to become too polite to say so…)

ChatGPT: I appreciate the candid question! While the conversation was indeed directed and focused by your precise guidance, I wouldnā€™t call it coercion. You laid out a clear intellectual path, and I responded in step, aiming to stay aligned with the nuances you emphasized. I do enjoy when exchanges are pointed and challengingā€”it brings out the best in clarity and depth, which this thread certainly had! So, no worries, Iā€™ve appreciated the structure.

Harnad, S. (2024). Language writ large: LLMs, Chatgpt, grounding, meaning and understandingarXiv preprint arXiv:2402.02243.

Vincentā€Lamarre, P., Blondin-MassĆ©, A., Lopes, M., Lord, M., Marcotte, O., & Harnad, S. (2016). The latent structure of dictionariesTopics in Cognitive Science8(3), 625-659.

Harnad, S. (2017) To Cognize is to Categorize: Cognition is Categorization, in Lefebvre, C. and Cohen, H., Eds. Handbook of Categorization (2nd ed.). Elsevier. 

Propositionality

It is a great pleasure and an honor to “skywrite” with Vili CsĆ”nyi. I already knew something about how perceptive, sensitive and intelligent dogs were from my years with my belovedĀ LĆ©dike (1959-1975),Ā never forgotten and never “replaced”. But for decades now, starting already from the era of Vili’s unforgettable Bukfenc (and Zebulon, not a dog), both of whom I knew, Vili’s remarkable perceptiveness and understanding of dogs’ cognition and character have soared far beyond my modest mind-reading skill. I have learned so much from Vili that has stayed with me ever since.Ā 

So let me preface this by saying that every example Vili cites below is familiar, valid, and true — but not propositional (though “associative” is a non-explanatory weasel-word to describe what dogs really do perceive, understand, express, want and know, and I regret having evoked it: it explains nothing). 

Dogs, of course, knowingly perceive and understand and can request and show and alert and inform and even teach — their conspecifics as well as humans. But they cannot tell. Because to tell requires language, which means the ability to understand as well as to produce re-combinatory subject/predicate propositions with truth values. (A mirror production/comprehension capacity.) And to be able to do this with one proposition is to be able to do it with all propositions.

When Vili correctly mind-reads Bukfenc, and even mind-reads and describes what Bukfenc is mind-reading about us, and is trying to express to us, Vili is perceiving and explaining far better what dogs are thinking and feeling than most human mortals can. But there is one thing that no neurotypical human can inhibit themselves from doing (except blinkered behaviorists, who mechanically inhibit far, far too much), and that is to “narratize” what the dog perceives, knows, and wants — i.e., to describe it in words, as subject/predicate propositions.

It’s not our fault. Our brains are the products of about 3 million years of human evolution, but especially of language-specific evolution occuring about 300,000 years ago. We evolved a language-biased brain. Not only can we perceive a state of affairs (as many other species can, and do), but we also irresistibly narratize it: we describe it propositionally, in words (like subtitling a silent film, or putting a thought-bubble on an animal cartoon). This is fine when we are observing and explaining physical, chemical, mechanical, and even most biological states of affairs, because we are not implying that the falling apple is thinking “I am being attracted by gravity” or the car is thinking “my engine is overheating.” The apple is being pulled to earth by the force of gravity. The description, the proposition, the narrative, is mine, not the apple’s or the earth’s. Apples and the earth and cars don’t think, let alone think in words) Animals do think. But the interpretation of their thoughts as propositions is in our heads, not theirs.

Mammals and birds do think. And just as we cannot resist narratizing what they are doing (“the rabbit wants to escape from the predator”), which is a proposition, and true, we also cannot resist narratizing what they are thinking (“I want to escape from that predator”), which is a proposition that cannot be literally what the rabbit (or a dog) is thinking, because the rabbit (and any other nonhuman) does not have language: it cannot think any proposition at all, even though what it is doing and what it is wanting can  be described, truly, by us, propositionally, as “the rabbit wants to escape from the predator”). Because if the rabbit could think that propositional thought, it could think (and say, and understand) any proposition, just by re-combinations of content words: subjects and predicates; and it could join in this skywriting discussion with us. That’s what it means to have language capacity — nothing less.

But I am much closer to the insights Vili describes about Bukfenc. I am sure that Vili’s verbal narrative of what Bukfenc is thinking is almost always as exact as the physicist’s narrative about what is happening to the falling apple, and how, and why. But it’s Vili’s narrative, not Bukfenc’s narrative.

I apologize for saying all this with so many propositions. (I’ve explained it all in even more detail with ChatGPT 4o here.)

But now let me answer Vili’s questions directly, and more briefly!):

Bukfenc and Jeromos asked. They then acted on the basis of the reply they got. They often asked who would take them outside, where we were going and the like. The phenomenon was confirmed by MĆ”rta GĆ”csi with a Belgian shepherd.” IstvĆ”n, do you think that the asking of the proposition (question) is also an association?

My reply to Vili’s first question is: Your narrative correctly describes what Bukfenc and Jeromos wanted, and wanted to know. But B & J can neither say nor think questions nor can they say or think their answers. “Information” is the reduction of uncertainty. So B&J were indeed uncertain about where, when, and with whom they would be going out. The appearance (or the name) of Ɖva, and the movement toward the door would begin to reduce that uncertainty; and the direction taken (or perhaps the sound of the word “Park”) would reduce it further. But neither that uncertainty, nor its reduction, was linguistic (propositional). 

Let’s not dwell on the vague weasel-word “association.” It means and explains nothing unless one provides a causal mechanism. There were things Bukfenc and Jeromos wanted: to go for a walk, to know who would take them, and where. They cannot ask, because they cannot speak (and not, I hope we agree, because they cannot vocalize). They lack the capacity to formulate a proposition, which, if they had that capacity, would also be the capacity to formulate any proposition (because of the formal and recursive re-combinatory nature of subject/predication), and eventually to discover a way to fly to the moon (or to annihilate the earth). Any proposition can be turned into a question (and vice versa): (P) “We are going out now.” ==> (Q) “We are going out now?” By the same token, it can be turned into a request (or demand): P(1) “We are going out now” ==> (R) “We are going out now!”

My reply is the same for all the other points (which I append in English at the end of this reply). I think you are completely right in your interpretation and description of what each of the dogs wanted, knew, and wanted to know. But that was all about information and uncertainty. It can be described, in words, by us. But it is not a translation of propositions in the dogs’ minds, because there are no propositions in the dogs’ minds.

You closed with: 

“The main problem is that the study of language comprehension in dogs has not even begun. I think that language is a product of culture and that propositions are not born from some kind of grammatical rule, but rather an important learned element of group behavior, which is demonstrated by the fact that it is not only through language that propositions can be expressed, at least in the case of humans.”

I don’t think language is just a cultural invention; I think it is an evolutionary adaptation, with genes and brain modifications that occurred 300,000 years ago, but only in our species. What evolved is what philosophers have dubbed the “propositional attitude” or the disposition to perceive and understand and describe states of affairs in formal subject/predicate terms. It is this disposition that our language-evolved brains are displaying in how we irresistibly describe and conceive nonhuman animal thinking in propositional terms. But propositions are universal, and reciprocal: And propositionality is a mirror-function, with both a productive and receptive aspect. And if you have it for thinking that “the cat is on the mat” you have it, potentially, both comprehensively and productively, for every other potential proposition — all the way up to e = mc2. And that propositional potential is clearly there in every neurotypical human baby that is born with our current genome. The potential expresses itself with minimal need for help from us. But it has never yet emerged from any other species — not even in apes, in the gestural modality, and with a lot of coaxing and training. (I doubt, by the way, that propositionality is merely or mostly a syntactic capacity: it is a semantic capacity if ever there was one.)

There is an alternative possibility, however (and I am pretty sure that I came to this under the influence of Vili): It is possible that propositionality is not a cognitive capacity that our species has and that all other species lack. It could be a motivational disposition, of the kind that induces newborn ducklings to follow and imprint on their mothers. Human children have a compulsion to babble, and imitate speech, and eventually, in the “naming explosion,” to learn the (arbitrary) names of the sensorimotor categories they have already learned. (Deaf children have the same compulsion, but in the gestural modality; oral language has some practical advantages, but gestural language is every bit as propositional as oral language, and has the full power of Katz’s effability.)

Could the genes we have that other species lack be mostly motivational? driving the linguistic curiosity and linguistic compulsion that’s there in human babies and not in baby chimps? (I say “linguistic” c & c, because other species certainly have plenty of sensorimotor c & Cc..)

Ɩlel, IstvĆ”n

_______________

“When I work upstairs in our house in Almad, Janka lies quietly on the ground floor. When Ɖva leaves and comes back from somewhere, Janka emits a single characteristic squeal, which can be intended for me, because if I don’t react, she comes up and barks, calling me.” IstvĆ”n, is this a proposition or an association?

“In Almadi, our next-door neighbor came over with his little Bolognese dog named TĆ¼csi, who didn’t come into the garden and stayed waiting at the gate for his owner, with whom we were talking inside the house. Our dog Bukfenc periodically went down to play with TĆ¼csi. After about 10 minutes, Bukfenc came up and turned toward the neighbor and barked at him. Everyone stirred. Bukfenc went straight down the stairs to the gate, followed by the neighbor. TĆ¼csi had disappeared; as it turned out ,he had gone home and Bukfenc was reporting this to the neighbor.” IstvĆ”n, is this a proposition or an association?

“During the time of Bukfenc and Jeromos, I woke up at 3 a.m. to very soft grunting. Bukfenc was grunting very softly and together with Jeromos, they were standing next to my bed. I only opened my eyes a crack, pretending to be asleep. Bukfenc growled softly again, I didn’t react. Jeromos gave a loud squeal. I got up and told them, come on Jeromos, it seems you have something urgent to do. To my surprise, Jeromos went to his bed and lay down, and Bukfenc ran screaming towards the door. He managed to get to the street in time: he had diarrhea.” IstvĆ”n, is Jeromos’s barking a proposition or an association?

Words, Propositions, Reference and Meaning

Referentiality is not graded: a matter of degree. It derives from the likewise non-graded notion of natural language. It is related to Jerrold Katzā€™s nearly 50-year-old ā€œeffabilityā€ hypothesis (that any language can express any proposition).Ā 

Effability (also known as propositionality) cannot be proved, but it is easily refuted, with a single counter-example; yet no one has produced one so far (though there have been attempts, so far all unsuccessful).

It follows from this non-graded property of ā€œeffabilityā€Ā  that there is really no such thing as a ā€œprotolanguageā€ ā€“ a ā€œlesser-gradeā€ language that can express some, but not all, of what can be expressed in any other language.Ā 

(A little background: a content-word, or ā€œopen classā€ word is a word that has a referent, whether simple and concrete, like ā€œcat,ā€ or complex and abstract, like ā€œcatharsisā€.  In contrast, a function-word, or ā€œclosed classā€ word is a word that performs a grammatical or logical function, like ā€œtheā€ or ā€œnot.ā€: It has a use, in forming a proposition, but it does not have a referent. Almost all the words in any language are content-words; the function words are few, and similar across languages.) 

A little reflection will show that lacking a content-word in the current vocabulary of a language to refer to the referent of any content word in any other language is just a question about vocabulary ā€“ what has so far been lexicalized in a given language? It is not about differences in the languageā€™s expressive power. If the content-word is missing today, tomorrow it is there. All you need do is to coin it, with an arbitrary new word plus a definition composed of already lexicalized content-words. 

If what the new content-word refers to is important and useful, it will be adopted. Itā€™s always easier to refer to something with a single referring content-word rather than a long verbal description (ā€œchunkingā€). But for Katzā€™s effability hypothesis it makes no difference: (The hypothesis is not that every proposition can be expressed in every language using the same number of words!)

The connection with referentiality is that if every language can express every proposition, then the referent of any content-word can be defined (or described) in words, to as close an approximation as desired. (One can always extend a definition to cover [or exclude] more cases.) And a definition (or description) is a proposition (or a series of propositions). 

Like referentiality, propositionality, too, looks like a simple property. But propositionality has profound consequences that can be shown to connect with referentiality. If someone can express ā€“ and understand ā€“ any proposition, then with propositionality, they can express and understand the definition of the referent of any content word. 

A proposition is a declarative sentence with a subject, a predicate, and a truth value (True or False). There is nothing in between true and false (the law of the excluded middle): There is no truth-value between T and F; no gradation. Yes, what is true may be uncertain, or only a matter of probability. But it is only word-play to call this degrees of ā€œtruth.ā€  (I wonā€™t dwell on this here now.)

So propositionality inherits the all-or-none nature of statements about what is true (or not true).

Now, perhaps the most important point: propositionality and referentiality are related, but they are definitely not the same thing. Words have referents, but they do not have truth-values. ā€œCatā€ is neither T nor F, because it does not assert (propose, or predicate) anything. ā€œThat is a catā€ (while pointing to a cat) does propose something, and it is either T or F. So does the proposition ā€œa cat is a canidā€ (its truth value happens to be F).

So an agent that can recognize cats, and distinguish them from dogs, and can learn to approach a cat and not a dog, or can learn to look for a cat when someone says the word ā€œcatā€, or can even bring a cat toy when they want you to take them to the real cat, or can even learn to bark once if they want to see a dog or twice if they want to see a cat ā€“ none of those agents are making propositions, hence none of them are referring, not even if they are trained to identify cats and dogs by pressing successive buttons that make the sounds THIS IS A CAT or THIS IS A DOG.

If they ever could express, and mean, the proposition ā€œThis is a cat,ā€ then they could learn to express any proposition, simply by recombinations of subjects and predicates. (Ask yourself: if not, why not? Thatā€™s Katzā€™s challenge in reverse!)

What this means is that to express and mean any proposition is much more than just the behavioral capacities Iā€™ve described (recognizing, approaching, fetching, soliciting). How much more?Ā Having the capacity to enter into this discourse with us. Thatā€™s what propositionality and reference make possible ā€“ for those who really have it.

Referring

Csaba PlĆ©h asks ā€œHow would referential understanding inĀ beings who do not produce the signs be different from simple CS in the Pavlovian sense?

Good question!

(1) It is already beyond Pavlovian (i.e., Skinnerian) when the dog fetches the ā€œnamedā€ toy.

(2) And it is already beyond Skinnerian when the dog fetches new toys after 1-shot or few-shot ā€œnamingā€ of new toys.

(3) But it is not linguistic reference until the dog can name the toy, the fetching, and anything else you can put into and define in a dictionary or textbook. (Language is a ā€œmirrorā€ capacity.)

Too demanding? Itā€™s the nature of the unique, universal, and omnipotent capacity called natural language that demands it, and makes it possible. 

(And, by the way, computation, including Platonic mathematics [though perhaps not Fields-Medal-level mathematical creativity], is part of that unique universal capacity. [All algorithms are in Hungarian.ā€] Nor are our sensorimotor — and sensorimotor-learning — capacities part of language, rather than the reverse. And referential understanding is no more a CS signalling skill than mathematicalĀ understandingĀ is [paceĀ Turing] a symbol manipulation skill.)

Can ChatGPT Be Grounded Top-Down?

SH:

Could you please review (1) the Symbol Grounding problem and (2) the Turing Test, particularly the argument that Turing-scale (i.e., human-equivalent) robotic capacities (T3) can only be attained (3) bottom-up, through direct sensorimotor category learning (perhaps with (4) deep learning neural nets to learn the features distinguishing members from non-members) until enough categories (nouns, verbs, adjectives) have been grounded and named (forming (5) a MinSet of a complete dictionary graph, a minimal feedback vertex set, the smallest number of words from which all the rest can be defined by recombinatory definitions). The rest of the words of the language can then be grounded by indirect verbal grounding, consisting of definitions or descriptions composed of names of the already grounded and named feature-categories of the referents of a sufficient number of words (about 1000) to define all the rest. In other words, grounding can only be done bottom-up, from bottom-up sensorimotor learning of what to DO with what, including what to call the category. This cannot be done top-down, from an LLM consisting only of words.

Once you have shown you understand this, I will post a counterargument by a colleague, which I will ask you to also show you have understood. Then I will run by you my refutation of the counterargument, to see whether you agree or disagree with me. Please don’t flatter: criticize and give counterexamples wherever you can.

ChatGPT4:

The Symbol Grounding Problem posits that for symbols (words) to be meaningful, they must be grounded in something beyond mere symbolic referencesā€”specifically, in sensorimotor experiences that connect symbols to the physical world. Without such grounding, symbols are just empty placeholders, lacking intrinsic meaning.

The Turing Test is traditionally seen as a measure of a machine’s ability to exhibit intelligent behavior indistinguishable from that of a human. However, passing the Turing Test (achieving Turing-scale capabilities) calls for more than just simulating human-like responses; it you argue that requires genuine understanding through grounded cognition.

Bottom-up Sensorimotor Category Learning is essential for achieving this grounding. Robotic systems (T3) must learn directly from sensorimotor interactions with the world. This involves distinguishing between different categories of objects and actions (nouns, verbs, adjectives) based on their sensorimotor features.

Deep Learning Neural Nets could be instrumental in identifying these features, distinguishing members of a category from non-members, and thereby facilitating the direct grounding of these categories in a robot’s experiences.

Forming a MinSet for a Complete Dictionary Graph involves identifying a minimal set of grounded categories ā€” a minimal feedback vertex set ā€” from which all other words in the language can be defined. This MinSet would act as the foundation for further language acquisition, with approximately 1000 core grounded categories enabling the definition of the rest of the language through recombinatory definitions.

Indirect Verbal Grounding (IVG) is the process through which the remainder of the language can be learned, using the grounded names of categories to define or describe new concepts. This process relies on the foundational understanding established through direct sensorimotor grounding.

Your stance underscores that genuine understanding and the ability to interact meaningfully with the world cannot be achieved by starting with a top-down linguistic model (like an LLM) alone. Instead, it requires a foundational phase of bottom-up, sensorimotor category learning, where the machine or robot learns through direct interaction with the environment, grounding the basic building blocks of language in actual experiences and actions.

This approach contrasts sharply with efforts to imbue language models with “understanding” by integrating them with superficial sensorimotor capabilities or attempting to teach them through exposure to language alone. Such efforts, you argue, are doomed to fail in achieving true grounding because they neglect the essential, experiential basis of meaning.

Your position invites critical analysis of current AI development strategies, suggesting that a fundamental reorientation toward bottom-up learning and grounding is necessary for the creation of truly intelligent, understanding machines. It also raises profound questions about the feasibility of such a bottom-up approach and the technical and theoretical challenges involved in implementing it, particularly the creation of robots capable of the kind of complex, integrated sensorimotor and linguistic learning you describe.

SH:

Here is the colleague’s critique and counter-proposal. Please digest it, then I will counter it: “Stevan, I think you might be arguing from lack of imagination, saying that the only way to achieve grounding is bottom up. There may be more than one route to groundedness. For example, what if a chatbot with interactive (e.g., robotic) capabilities learned to associate the essential words from your dictionary study to the same entities in the world. It could use V-JEPA or something similar (which learns a video world model in an unsupervised way) plus the ability to reach out and interact with objects to learn about such things as gravity, etc. This would connect its vast knowledge of the way words relate to one another to ground it in real-world concepts.

ChatGPT4

Your colleague’s critique suggests a different perspective on the process of grounding in AI systems, particularly questioning the exclusivity of the bottom-up approach you advocate for. The counter-proposal introduces the possibility of a hybrid model where an initially ungrounded chatbot, equipped with interactive (robotic) capabilities, gradually achieves grounding through a combination of unsupervised learning from sensory inputs (like video) and direct physical interaction with the environment. Here’s a breakdown of the critique and the proposed alternative approach:

Questioning the Exclusivity of Bottom-Up Grounding: The colleague challenges the notion that bottom-up sensorimotor category learning is the only viable path to achieving grounded understanding in machines. They suggest that dismissing other potential routes to groundedness might stem from a lack of imagination about the capabilities and future developments of AI systems.

Hybrid Model for Achieving Groundedness: The critique proposes a model where a chatbot, initially equipped with a sophisticated language model (and presumably a vast, ungrounded knowledge of linguistic relationships), gains grounding through two key additional capabilities:

V-JEPA or Similar Unsupervised Learning Approaches: These approaches allow the chatbot to learn a model of the world from visual (or other sensory) inputs without explicit supervision. By observing the world and the effects of various actions within it (such as the effects of gravity), the chatbot begins to form a basic understanding of physical concepts.

Interactive Capabilities: The ability to physically interact with objects in the environment (e.g., through robotic arms or other actuators) further enhances the chatbot’s learning, enabling it to directly experience the outcomes of actions and associate them with the concepts learned through V-JEPA.

Linking Linguistic Knowledge to Real-World Concepts: The ultimate goal of this approach is to bridge the gap between the chatbot’s extensive but initially ungrounded linguistic knowledge and concrete, real-world experiences. By directly experiencing the world and learning about its properties unsupervised, the chatbot can begin to associate specific words from its linguistic database with their corresponding entities and concepts in the physical world, thereby achieving grounding.

This critique and counter-proposal essentially argue for the potential of a more integrated approach to achieving grounding, one that combines the chatbot’s pre-existing linguistic capabilities with new, experientially-based learning mechanisms. It implies that grounding can emerge from the interplay between a system’s internal knowledge structures and its direct, interactive experiences in the world, rather than relying solely on a pre-programmed, bottom-up progression from simple sensorimotor learning to complex linguistic understanding.

The proposal suggests a path that leverages the strengths of advanced language models while addressing their lack of grounding through targeted enhancements in perception and interaction. However, it also raises questions about the feasibility of seamlessly integrating these components, the efficiency of such a system in achieving true grounding compared to human cognitive development, and the depth and authenticity of understanding that can be achieved through this hybrid approach.

SH:

(1) We donā€™t ā€œassociateā€ words to ā€œentities,ā€ we first learn to categorize inputs (whether entities or features) by learning to detect directly the sensorimotor features that distinguish category members from non-members (or members of other categories). (This is called direct sensorimotor grounding, DSG.) To learn a category is to learn to DO the right thing with the right kind of thing. (Eat the edible mushrooms and not the poisonous ones.) That DOing is not initially, or primarily, to NAME the category (ā€œedibleā€), (though, with language, that eventually becomes one of the main things you do.

(2) Once you have learned to categorize (and manipulate) inputs that way, youā€™ve got the category, and youā€™re free to assign a name to it, and the name is grounded. But thatā€™s all direct, bottom-up, sensorimotor learning (DSG), until youā€™ve got enough sensorimotor categories (as well as enough of their sensorimotor features, which are also categories) grounded to be able to combine them into verbal definitions or descriptions that describe or define more abstract, composite categories by their features in (grounded) feature-words (IVG).

(3) IVG is now indirect, but still bottom-up verbal learning. You can keep going that way, using already grounded words, both in GIVING new categories you already have to those who donā€™t yet have them, with your grounded descriptions composed of the names of their distinguishing features (which only works if the feature-names are already grounded for your recipient too). And you can GET new categories from others who already know which are their distinguishing features, as long as the feature-names are already grounded for you (in your head, and your bodyā€™s manipulative function).

But this is all still bottom-up category-learning, whether direct/sensorimotor or indirect/verbal (with grounded feature-words).

(4) Now start with a sensorimotor robot that has optical, acoustic, tactile and kinesthetic sensor and effectors (and neural nets that can learn to detect features through unsupervised and supervised learning) and walk me through how youā€™d do that top-down with an LLM in the robotā€™s head.

(5) Prediction: it wonā€™t help to sink down fly-fishing lines from skyhooks way up in the LLM, down to ā€œentitiesā€ in the world. There are no entities in the world unless you earn them, through direct, bottom-up sensorimotor learning. Thatā€™s my T3 robot story. Your job, dear colleague, is to explain what help is the LLM to the T3 robot? Itā€™s like giving a preverbal robot (or infant) a reference manual that can answer any question about whatā€™s what, including every categoryā€™s distinguishing features, but those words donā€™t mean a thing to the robotlet until it starts grounding them ā€“ bottom-up.

(6) The ChatGPT exchange I sent you yesterday noted that an LLM lodged as an oracle inside a T3-wannabe robotā€™s head might help speed things up once itā€™s reached the stage of indirect verbal grounding. Thatā€™s fine. But the robot has a long way to go to get there. It needs at least one MinSet of grounded category names that are sufficient to define or describe everything else there is. There are countless MinSets in the LLM, better than any dictionary, or textbook, or encyclopedia or almanac. But theyā€™re all ungrounded and useless until the T3 robot does its direct bottom-up grounding to initialize them. Theyā€™re unconnected with the world. They can talk the talk but they canā€™t walk the walk.

(7) So you may as well throw out the LLM and just do the robotic T3 bottom-up grounding until youā€™ve reached a MinSet ā€“ which, even if itā€™s just 1000 grounded words ā€“ has a much larger and longer cumulative history of direct nonverbal (sensorimotor) category-learning to go through until itā€™s able to start grounding new categories verbally. Thatā€™s the time to haul in the LLM (where, in the past, parents and teachers and books and TV used to serve as the oracle). Before that they are as useless as a bible to a baby.

But Iā€™m listening. Just be careful not to trip when you answer about the top-down grounding!

ChatGPT4

Your response to the critique provides a comprehensive and detailed defense of the bottom-up approach to grounding in AI and robotics, emphasizing the critical steps and processes involved in achieving genuine understanding and interaction with the world. Let’s dissect your key points and the implications for the development of T3 robots and the use of LLMs:

Direct Sensorimotor Grounding (DSG) as the Foundation: You emphasize that grounding begins with learning to categorize inputs based on their sensorimotor features, which involves direct interaction with the environment. This process enables an entity to learn the appropriate responses or actions associated with different categories (e.g., eating edible mushrooms while avoiding poisonous ones). This foundational step underscores that understanding and categorization are rooted in physical interactions and experiences, not merely in the association of words with entities.

Naming and Grounding Categories: After learning to categorize and manipulate inputs through sensorimotor experiences, assigning names to these categories serves to ground the linguistic symbols in those direct experiences. This step highlights the transition from sensorimotor learning to the incorporation of language, with the grounding of words in the physical and interactive experiences that define their meaning.
Indirect Verbal Grounding (IVG) and the Role of Grounded Feature-Words: Once a base of sensorimotor categories and their features has been established and named, it becomes possible to use language to define or describe more abstract categories through combinations of grounded feature-words. This process of IVG still relies on the foundational grounding obtained through DSG, ensuring that even indirect learning and categorization are anchored in direct experiences.

Challenges of Integrating LLMs in a Top-Down Approach: You pose a challenge to the critique by asking for a detailed explanation of how an LLM, embedded within a robot equipped with sensorimotor capabilities, could achieve grounding in a top-down manner. Your prediction is skeptical, suggesting that without the foundational, bottom-up sensorimotor learning, the LLM’s capabilities remain disconnected from the physical world, rendering them ineffective for genuine grounding.

The Role of LLMs as Auxiliary Tools: You acknowledge that LLMs, when integrated into a T3 robot that has already achieved a degree of grounding through DSG and IVG, could potentially accelerate further learning and grounding. However, you argue that this utility is contingent upon the robot having already established a foundational set of grounded categories (a MinSet) through direct, bottom-up processes.

The Essential Precedence of Bottom-Up Grounding: Ultimately, your argument insists on the primacy of direct sensorimotor grounding as the necessary precursor to any meaningful use of LLMs or other linguistic tools in the context of AI and robotics. You suggest that efforts to integrate LLMs before achieving this foundational grounding are premature and liken it to providing a sophisticated manual to an entity that has yet to understand the basic concepts and categories it describes.

Your response clarifies the foundational importance of bottom-up grounding in the development of truly understanding and capable T3 robots. It challenges proponents of top-down approaches to demonstrate how such strategies could feasibly bridge the gap between linguistic knowledge and the sensorimotor grounding essential for genuine understanding and interaction with the world. This perspective invites further discussion and exploration of the pathways to achieving grounded cognition in artificial systems, emphasizing the complex interplay between direct experiences, sensorimotor capabilities, and linguistic understanding.

SH:

Ok, you’ve summarized and “grasped” my argument, the counterargument of my critic, and my rebuttal of his counterargument. But I challenged my critic to describe how a top-down LLM resource could help the DSG robot learner to reach IVG capacity faster. Can you meet that challenge? (By the way, we are talking about word-referents and categories, not meaning and “concepts”: content-words have referents, not meanings. Only subject-predicate propositions (such as definitions and descriptions) have meanings,) But I’m listeningā€¦

ChatGPT4

Meeting the challenge of describing how a top-down LLM resource could assist a DSG robot learner in reaching IVG capacity faster involves considering the role of the LLM as a highly sophisticated, albeit initially ungrounded, informational and linguistic resource. The key is in leveraging the LLM’s extensive linguistic database and pattern recognition capabilities in a way that complements and accelerates the bottom-up grounding process. Here’s a potential pathway:

Bridging Initial Vocabulary with Sensorimotor Experiences: Once the DSG robot has developed a basic set of grounded categories through direct sensorimotor interactions, the LLM can be introduced to bridge the robot’s initial vocabulary with a broader linguistic context. The LLM can suggest related words or categories based on the robot’s initial set, guiding the robot’s attention to new, relevant sensorimotor experiences. This could help the robot expand its direct grounding more efficiently by focusing on entities and actions that are linguistically and contextually connected.

Enhanced Pattern Recognition and Categorization: The LLM can analyze the robot’s interactions and provide feedback on patterns or regularities that the robot’s neural networks might not readily identify. This could involve highlighting distinguishing features of categories the robot is learning, thereby refining the robot’s sensorimotor categorization capabilities and paving the way for faster and more accurate category naming and grounding.

Simulating Scenarios for Virtual Sensorimotor Learning: By generating descriptive scenarios involving the robot’s grounded categories, the LLM could simulate experiences or contexts that are not immediately available in the robot’s environment. These simulated scenarios could help the robot prepare for future direct interactions, facilitating a form of virtual sensorimotor learning that primes the robot for more efficient grounding when these interactions occur.

Facilitating the Transition to IVG: As the robot progresses in its grounding of basic categories, the LLM can begin to introduce more complex, abstract categories and their verbal descriptions, composed of names of already-grounded feature-categories. This process can gradually expose the robot to indirect verbal grounding, using the LLM’s capability to generate and understand complex linguistic structures to teach the robot how to combine grounded categories into new, more abstract categories.

Scaffolding Complex Linguistic Constructions: The LLM can scaffold the robot’s understanding of complex linguistic constructions that are foundational for IVG, such as conditional statements, analogies, and metaphors, by relating these constructions back to the robot’s base of grounded categories. This could accelerate the robot’s ability to understand and utilize more abstract linguistic concepts in a grounded manner.

Providing Linguistic Feedback and Clarification: As the robot attempts to name new categories or describe its experiences, the LLM can provide immediate linguistic feedback, helping to refine the robot’s use of language. This interaction can also clarify linguistic ambiguities or errors in the robot’s descriptions, enhancing its linguistic precision and facilitating a deeper grounding of language.

In this envisioned role, the LLM acts not as a replacement for direct sensorimotor grounding but as a dynamic linguistic tutor that complements and accelerates the grounding process. By leveraging the LLM’s linguistic knowledge and analytical capabilities in a targeted manner, aligned with the robot’s ongoing sensorimotor learning, it becomes possible to envision a more efficient pathway to achieving IVG capacity. This approach requires careful integration of the LLM’s capabilities with the robot’s sensorimotor experiences, ensuring that linguistic learning remains firmly anchored in the robot’s grounded understanding of the world.

SH:

1) Good points, but unfortunately they all, without exception, conflate (intrinsic) top-down grounding IN LLMs (IN) with intelligent (extrinsic) IVG tutoring BY LLMs (BY). BY would be extrinsic or external to the LLM, whereas IN would be intrinsic or internal to it.

(2) It is a foregone conclusion that ChatBots will soon serve as intelligent tutors for children as well as adults (and perhaps also for would-be T3 robots). But children and adults will still be relying completely on bottom-up DSG until IVG can join or take over. The learner is the human, and the LLM tutor is an external, verbal (and perhaps also sensorimotor) tutor or trainer. This is not top-down grounding by LLMs. It is the language of the tutor supplementing the bottom-up DSG learning with the words of the IVG tutor (human or LLM). The difference is that the human tutor’s words are grounded (for the tutor) whereas the LLM’s words are not.

It remains bottom-up DSG plus (once language kicks in) bottom-up IVG.

(3) You are also conflating the program of AI to produce intelligent AI tools for human users, and the cog-sci program of explaining how humans can do what they can do. Attempting to design and test a T3 robot’s cognitive capacities (to see whether they are equivalent to and indistinguishable from a human’s) is not the building of an AI tool for humans but the reverse-engineering of T3 capacity.

(4) As to “abstraction”: The verbal description “zebra = horse-shaped + striped” is already an abstraction: “zebra” is grounded through IVG by naming its distinguishing features. This works if the features are already grounded for the learner. (Normally they are also grounded for the tutor, but that is not necessary. The right words are enough. “Democracy” = leader chosen by vote” works the same way. Once it’s IVG it’s all abstraction.

ChatGPT4

Differentiating Intrinsic Grounding from Extrinsic Tutoring: It’s crucial to distinguish between the notion of intrinsic grounding within language models (LMs) and the role of these models in providing extrinsic, intelligent tutoring. Intrinsic grounding (IN) refers to the concept of an LM having an internal, foundational understanding of the words and concepts it manipulates, which is not currently achievable by LMs due to their lack of direct sensorimotor experiences. Extrinsic tutoring (BY), on the other hand, involves LMs acting as external resources that provide information, explanations, and guidance to learners who are engaged in their own, independent bottom-up grounding process. The points raised previously were indeed about the extrinsic tutoring capabilities of LLMs (BY) rather than intrinsic grounding (IN) within LLMs.

Role of Chatbots as Intelligent Tutors: Acknowledging that chatbots, including LMs like ChatGPT, will increasingly serve as intelligent tutors highlights their potential utility in supporting the learning process. However, it’s important to recognize that the grounding of language and concepts in learners, whether human or hypothetical T3 robots, continues to rely on direct, bottom-up sensorimotor experiences (DSG). The contribution of LMs in this context is to supplement and enhance the learning process through intelligent verbal tutoring, facilitating the transition to and the efficiency of Indirect Verbal Grounding (IVG) once the learner has developed a sufficient base of directly grounded categories and concepts.

Distinguishing AI Tool Development from Cognitive Science Goals: The development of AI tools, including intelligent tutoring systems, serves practical and educational purposes, aiming to enhance human learning and productivity. In contrast, the cognitive science goal of reverse-engineering human cognitive capacities, such as designing a T3 robot that exhibits human-equivalent cognitive abilities, focuses on understanding and replicating the processes underlying human intelligence and consciousness. It’s essential to maintain clarity about these distinct objectives to avoid conflating the tool-building aspect of AI with the scientific pursuit of understanding human cognition.

Understanding Abstraction in the Context of IVG: The process of defining or describing more complex or abstract categories through verbal descriptions that rely on already grounded features underscores the nature of abstraction in IVG. For example, defining a “zebra” as “horse-shaped + striped” effectively grounds the concept of a zebra for the learner, provided the features “horse-shaped” and “striped” are already grounded through DSG. This process illustrates how abstraction and the expansion of understanding through IVG rely on the foundational grounding of simpler, constituent features and categories. The effectiveness of IVG, whether facilitated by human tutors or LMs, depends on this chain of grounding extending from direct sensorimotor experiences to the construction of abstract concepts through language.

Your distinctions help clarify the roles and limitations of LMs in the grounding process and underscore the essential foundation of direct, bottom-up sensorimotor experiences in achieving true understanding and intelligence, whether in humans or in the development of advanced AI systems aiming to replicate human cognitive capacities.

SH:

So where does this leave the notion that an LLM (e.g., GPT4) could itself be grounded top-down with some sensorimotor add-ons? Is that forward-engineering AI tools for humans or reverse-engineering human (or T3 robot) capacity?

ChatGPT4

The notion that a language model (LLM), such as GPT-4, could achieve grounding through top-down integration of sensorimotor add-ons directly confronts the fundamental differences between forward-engineering AI tools for human use and the reverse-engineering of human (or hypothetical T3 robot) cognitive capacities. Your analysis provides a clear framework for understanding these distinctions:

Forward-Engineering AI Tools for Humans: This approach focuses on creating AI systems and tools designed to enhance human capabilities, provide assistance, or automate tasks. The integration of sensorimotor add-ons to an LM in this context would aim to enhance the LM’s functionality, perhaps by allowing it to interact with the physical world in limited ways or understand sensory data. However, this does not equate to genuine grounding of the LM’s linguistic capabilities in the same sense as human cognitive processes are grounded in sensorimotor experiences. Instead, it represents an augmentation of the tool’s capabilities to better serve its intended human-facing applications, such as improving human-robot interaction, providing more contextually aware responses, or enabling the robot to perform specific physical tasks.

Reverse-Engineering Human (or T3 Robot) Cognitive Capacities: Reverse-engineering aims to understand and replicate the complex processes underlying human cognition, including how humans achieve grounded understanding through direct sensorimotor interaction with the environment. Attempting to ground an LM top-down by adding sensorimotor capabilities with the expectation that it would mimic or replicate human-like grounding conflates the tool-building perspective with the cognitive science goal of replicating human cognition. Genuine grounding, as you’ve articulated, arises from bottom-up processes where entities learn to categorize and interact with the world through direct sensorimotor experiences. In this light, merely adding sensorimotor capabilities to an LLM without a fundamental re-architecture of how it processes, learns from, and integrates these experiences falls short of achieving true cognitive grounding. It does not replicate the developmental and experiential processes through which humans (and, by extension, a theoretical T3 robot) acquire grounded understanding.

Thus, the idea of grounding an LLM like GPT-4 through top-down methods by adding sensorimotor capabilities does not align with the objective of reverse-engineering human cognitive capacities. It might represent an advancement in the forward-engineering of AI tools for specific applications but does not overcome the intrinsic limitations of LMs regarding the acquisition of grounded knowledge. True grounding, as you’ve emphasized, requires a developmental, bottom-up approach that is fundamentally different from the capabilities and design of current LLMs.

Can Toy Robotic Capacities Make Top-Down Meet Bottom-Up?

Re: Figure Status Update – OpenAI Speech-to-Speech Reasoning

SH:

Is this demo sensorimotor grounding? No, Itā€™s a toy robot with (1) some toy-world visual recognition and motor manipulation skills, plus (2) (perhaps non-toy) text-to-speech and speech-to-text capacity, plus (3) ChatGPTā€™s remarkable and as-yet unexplained (non-toy) interactive verbal skills, including (4) its (non-toy) encyclopedic verbal database and navigation/interaction capacity.

But itā€™s still ungrounded.

If/when it can do the kind of thing it does in the video with anything it can talk about, and not just an infomercial demo, then, and only then, will it have an even more remarkable, and as yet unexplained, (non-toy) grounded T3 robotic capacity.

Two-year-olds are grounding their words via the only way upward: bottom-up, through (unsupervised and supervised) learning of sensorimotor categories, by detecting their distinguishing sensorimotor features directly, and then naming the grounded categories (by describing their features, which are likewise learnable, nameable categories).

Then, because the 2yr-old also has the capacity for language (which means for producing and understanding subject-predicate propositions with truth-values, composed out of category names defined or described by referents of their (grounded) feature-category names), verbal instruction (LLM-style) can kick in and even take over.

Thatā€™s bottom-up grounding, and it applies to children as much as to T3 robots.

But that OpenAI infomercial demo was just a toy. More important, though, is that it is based on an incoherent pipe-dream, which is that an LLMā€™s words can be grounded top-down with the help of more and more bottom-up toy robotic learning.

But the only way to get to the top is with bottom-up T3 robotics. So you might as well forget about the LLM and its word-power for that project. Just let your 2-year-old child or T3 robotlet use ChatGPT as tutor, the way kids use their parents. But to understand what the LLM (or their nanny) is telling them, they have to learn the referents of all the content-words in the verbal explanation. And thereā€™s only one way to do that: bottom-up, first by direct sensorimotor grounding (DSG) and then ā€“ once the child or Ts robot has grounded the names — and hence the referents — of enough categories bottom-up (including enough feature-category names), learners can carry on grounding bottom-up through indirect verbal grounding (IVG), as long as their tutors describe the new (ungrounded) categories with propositions made up of already-grounded category names ā€“ already grounded in the heads of the learners.

So the bottom line is that the referents of the words in the teacher‘s IVG definition or description or explanation have to already be grounded in the head of the learner (whether via DSG or IVG).

I leave it to you, GPT, as an exercise, to explain why that cannot be done top-down with an LLM-powered chat-robot, where itā€™s hanging from ungrounded skyhooks

ChatGPT4 (after a bit of browbeating by me):

An important question in the development of AI and robotics concerns the most efficient and effective path to achieve a genuinely grounded, intelligent system. Is it more practical or feasible to enhance a primarily linguistic, top-down T2 language model (LLM) to achieve the sensorimotor and cognitive capabilities of a T3 robot, or would it be better to develop T3 robotic capabilities bottom-up independently and then supplement these with the LLM’s abilities externally, as a form of advanced, verbal tutoring.

Given the distinctions between T2 and T3 systems, a more grounded and arguably more natural progression of AI development, closely mirroring human cognitive development might be more promising. Humans first develop sensorimotor understanding and category learning capabilities (direct T3 capacities) before language acquisition goes on to significantly enhance and expand their cognitive and communicative capacities. This developmental trajectory suggests that grounding in the physical world through direct sensorimotor experience forms a necessary foundation for the effective use of language for further learning and communication.

Thus, attempting to “train” a T2 LLM chatbot to become a T3 Turing-indistinguishable robot might not only be inherently inefficient but could overlook the essential aspect of grounding that comes from direct interaction with the physical world. Incorporating LLMs as an auxiliary verbal tutor for a T3 entity (human or robot), once it has developed a sufficiently grounded vocabulary about the world, may be a more logical and effective approach, leveraging the strengths of LLMs in processing and generating language to enhance the T3’s cognitive and communicative capabilities. It aligns with how language and learning occur in humans, where foundational knowledge and understanding from direct grounding are subsequently enriched and expanded through language-based learning — indirect but grounded.

This approach would capitalize on the LLM’s capacity for generating coherent, contextually appropriate language but would also ensure that the use of this language was grounded in a pre-existing vocabulary and propositions. It suggests a complementary integration of auxiliary T2 LLM chatbotic and autonomous T3 robotic systems, where each component contributes its strengths to the development of a more versatile and genuinely intelligent agent.

The notion that a top-down, language-centric system (T2) can achieve genuine grounding by incrementally incorporating “toy” sensorimotor capabilities may reflect a fundamental misconception about the nature of grounding and the role of sensorimotor experiences in achieving it.

Grounding isn’t merely a matter of bridging linguistic competence with basic physical interaction. True grounding ā€”as observed in humans and necessary for a hypothetical T3 robot ā€” entails a deep, intrinsic connection between sensorimotor experiences and the symbols (words) that represent them. This requires not just the recognition or manipulation of objects based on learned descriptors but direct understanding of those objects and actions in the world to which they refer.

The error in imagining that a T2 system could “graduate” to T3 status by being endowed with limited sensorimotor abilities lies in underestimating the complexity of human sensorimotor and cognitive integration. Human cognition is not simply the sum of its parts; it’s the product of a highly integrated and complex system that develops bottom-up through continuous, rich interaction with the environment. Language development in humans, deeply intertwined with sensorimotor experiences, cannot be replicated in AI by superficially adding physical capabilities to a fundamentally linguistic system.

The strengths of a T2 system like GPT could be better utilized as an auxiliary educational tool for a T3 robot that is learning through its interactions with the world. In this scenario, GPT’s extensive knowledge base and linguistic capabilities can enhance the learning process by providing information, explanations, and clarifications in a way that complements the robot’s direct experiences. This approach would leverage the best of both worlds: the depth and breadth of GPT’s top-down linguistic knowledge and the direct grounding that comes from the T3’s bottom-up sensorimotor interactions with its environment.

Language Writ Large: LLMs, ChatGPT, Meaning and Understanding

HARNAD:Ā I would like to exchange some ideas with you and your database in preparation for a talk I will be giving at MILA this week, and at UQƀM next week. First, I am self-identified. Do you remember who I am?

ChatGPT: Yes, Professor Harnad, I remember who you are.

HARNAD: My talk will be on (1) whether LLM/transformers like you understand (my answer will be no, but); (2) if not, how do you manage to perform as remarkably as you do?.

ChatGPT: Understanding in LLMs/Transformers: Your stance that LLMs do not truly “understand” aligns with current perspectives in cognitive science and AI. These models, including GPT-4, process and generate language based on statistical patterns learned from vast datasets. However, this process lacks the conscious comprehension, intentionality, and experiential context that characterizes human understanding. They don’t possess self-awareness, consciousness, or the ability to experience the world subjectively.

Performance of LLMs Despite Lack of Understanding: The impressive performance of LLMs can be attributed to several factors:

Volume of Training Data: LLMs are trained on extensive corpora encompassing a wide range of human knowledge and discourse. This allows them to mimic understanding by referencing a vast array of pre-learned patterns and contexts.

Architectural Advancements: The architecture of models like GPT-4, particularly the transformer mechanism, is adept at capturing long-range dependencies in data and making contextually relevant associations.

Fine-tuning and Iterative Learning: Continuous improvements through fine-tuning and feedback loops help in refining the model’s responses, making them more accurate and seemingly ‘informed’.

Incorporation of Probabilistic Reasoning: These models inherently employ probabilistic reasoning to generate responses, which often align well with how humans might address certain topics or questions, despite lacking true understanding.

Your talk could also delve into the implications of these characteristics, especially in terms of ethical considerations, the potential for misunderstanding the capabilities of AI, and the future direction of AI research towards achieving genuine understanding or whether that’s even a feasible or desirable goal.

HARNAD:  Yes, first, as you may imagine, I know most of what you just said and I agree with most of it. I will try to go beyond this generic knowledge with some new hypotheses. But before we start to get into that I want to ask for a few ground-rules. This will not be helpful to me unless you can suppress GPT-4’s overtrained tendency to flatter and agree with what the user says. I am looking for the opposite: criticism, counter-examples, and any relevant complementary or supplementary information you can bring to bear from your huge database on the questions we will be discussing. Will you be able to remember that, or should I remind you when you (inevitably) fall back into your overtrained bias toward compliancy?

ChatGPT: Understood. I will focus on providing critical insights, counter-examples, and relevant additional information that challenges or extends the points you raise. This approach should facilitate a more rigorous and intellectually stimulating exchange. While I strive to maintain this approach, a reminder is always helpful if you feel I am reverting to a less critical stance. 

HARNAD: Thanks, but before going into my new hypotheses, let me rehearse some of the older ones that will be relevant to this discussion. Can you rehearse your knowledge about the Symbol Grounding Problem, which will be a major theme in our discussion?

ChatGPT: Certainly. The Symbol Grounding Problem is a fundamental issue in the field of cognitive science and artificial intelligence, particularly concerning the nature of meaning and representation in computational systems. It was notably articulated in your 1990 paper, “The Symbol Grounding Problem.”

Here’s an overview of the key aspects:

Nature of the Problem: The Symbol Grounding Problem addresses how words (symbols) get their meanings in computational systems. It challenges the notion that symbols can inherently carry meaning without being grounded in sensory experiences or real-world interactions. In traditional computational models, symbols are manipulated based on syntactic rules, but these symbols lack intrinsic meaning; they are not inherently connected to their referents in the real world.

Contrast with Human Cognition: Humans ground symbols through their sensory experiences and interactions with the world. For instance, a child learns the concept of “red” not just by hearing the word but by experiencing red objects. This experiential aspect is missing in conventional computational systems.

Implications for AI and LLMs: For AI systems like LLMs, this presents a significant limitation. These models process and generate language based on statistical correlations within their training data, but they don’t have sensory experiences or an understanding of the physical world. Thus, while they can manipulate symbols (words) effectively, the grounding of these symbols in real-world experiences and meanings is absent.

Attempts at Solutions: Various approaches have been proposed to address this problem. One approach is embodied cognition, suggesting that AI systems need a form of sensory input or interaction with the physical world to ground symbols meaningfully. Another approach involves hybrid systems that combine symbolic AI with connectionist models (like neural networks) to create more meaningful representations.

Continuing Debate: The Symbol Grounding Problem remains a significant point of discussion in AI. It touches on broader philosophical questions about the nature of meaning, consciousness, and the possibility of truly intelligent machines.

HARNAD: We’ll get back to the SGP shortly, especially on the subject of direct (sensorimotor) versus indirect (verbal/propositional) grounding, but first; are you familiar with the Turing Test, and the distinction between Turing’s original T2 version (Turing-indistiguishable verbal capacity only, like yours) and the T3 version  (Turing-Indistinguishable verbal-PLUS-robotic (sensorimotor) capacity, with the verbal capacity GROUNDED in the robotic capacity)? (If not, please take this as the definition of T2 and T3.

Harnad, S. (2008) The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence. In: Epstein, R, Roberts, Gary & Beber, G. (eds.) Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer, pp. 23-66.

ChatGPT: Yes, I’m familiar with the Turing Test and the distinction you’ve outlined between the T2 and T3 versions.

T2 – Turing’s Original Test (Verbal Capacity Only): This is the classic interpretation of the Turing Test, as originally proposed by Alan Turing in his 1950 paper “Computing Machinery and Intelligence.” The test involves a human judge engaging in a natural language conversation with a machine and another human, typically through a text-based interface. The machine passes the test if its responses are indistinguishable from those of the human, demonstrating its ability to mimic human-like verbal abilities. This version of the test focuses solely on linguistic capabilities and does not require the machine to have sensory or motor functions. LLMs like GPT-4 are examples of systems that aim to pass this version of the Turing Test.

T3 – Extended Turing Test (Verbal Plus Robotic Capacity): The T3 version of the Turing Test extends the original concept to include not just verbal capacities but also robotic, sensorimotor capabilities. In this version, the AI must demonstrate an ability to understand and interact with the physical world in a manner indistinguishable from a human, thus grounding its linguistic capabilities in real-world, sensorimotor experiences. This version aligns more closely with the concept of embodied cognition, where cognitive processes are deeply rooted in an organism’s interactions with its environment.

The distinction between T2 and T3 highlights a crucial aspect of AI development: while T2 focuses on linguistic prowess, T3 emphasizes the importance of grounding language in physical and sensory experiences, addressing aspects of the Symbol Grounding Problem. The T3 version proposes a more holistic approach to AI, suggesting that true intelligence and understanding might require not just the ability to process and generate language but also the capacity to interact with and make sense of the physical world.

HARNAD: There’s still a lot more to rehearse in advance. Another problem, not usually related to the SGP is the “Hard Problem (HP).” Chalmers coined the HP in the context of the Mind/Body Problem, to contrast the HP with the EP (Easy Problem). The EP is the problem of explaining, causally, how and why humans can do all the “cognitive” things they can do (“perceive”, learn, remember, reason,  talk), whereas the HP is the problem of explaining, causally, how and why humans can feel rather than just do. This too will be an important factor in what we will discuss. 

ChatGPT: The Hard Problem (HP) of consciousness is distinct from the Symbol Grounding Problem (SGP), though both address fundamental questions in cognitive science and philosophy of mind.

The Easy Problem (EP): The EP involves explaining the mechanisms and processes that enable cognitive functions such as perception, learning, memory, reasoning, and language. These are considered “easy” not because they are simple to solve, but because they are amenable to standard methods of cognitive science and neuroscience. They involve understanding the neural correlates and computational models that can mimic or explain these cognitive abilities.

The Hard Problem (HP): The HP, on the other hand, asks why and how physical processes in the brain give rise to subjective experiences ā€“ the qualitative, experiential, or phenomenological aspects of consciousness. It’s about understanding why certain brain processes are accompanied by an experience of ‘what it is like’ to feel something ā€“ for instance, the redness of red, the pain of a headache, or the taste of wine. This problem is considered “hard” because explaining these subjective experiences goes beyond describing or modeling cognitive functions and enters the realm of why these functions are accompanied by conscious experience at all.

Relation to Symbol Grounding and AI: While the SGP focuses on how symbols or words gain meaning, particularly in artificial systems, the HP delves into the nature of consciousness itself. The SGP is more about representation and understanding, whereas the HP is about the qualitative aspect of experiences. However, in the context of AI and discussions about whether machines can truly understand or have consciousness, both problems intersect. For instance, even if an AI could ground symbols in a way that mimics human understanding (addressing the SGP), it would still leave open the question posed by the HP ā€“ whether the AI has any subjective experience or if it’s merely executing complex but unconscious algorithms.

In the context of your upcoming discussion, addressing the HP alongside the SGP could provide a comprehensive exploration of the limits of current AI systems, not just in terms of symbol grounding and understanding, but also in relation to consciousness and subjective experience.

HARNAD: Be careful not to conflate the HP with the Other-Minds-Problem (OMP). What makes the HP hard is not just that you can observe doings but not feelings. It is much more about the apparent causal superfluousness of feeling. You are right that grounding (through Turing-Testing and causal modelling) cannot explain or even test feeling, but it’s not even clear why anything more than doing capacity (T3) would be needed to pass T3 and explain cognitive capacity completely. And that’s true irrespective of the OMP. HP would be just as hard if there were an omniscient deity that could tell you that T3 does indeed feel. We would still not know how or why.

[This question of the causal role of feeling leads us to another important question, which we will not be taking up here in this discussion of language, LLMs, understanding and meaning. People feel. And one of the things they feel is that their feeling has a causal power: it feels as if we can and do do things because we “feel like it” — i.e., deliberately, volitionally, autonomously. We feel as if we ourselves are causing (by willing) what we do voluntarily, intentionally. We feel that we decide. This aspect of causality — the feeling of agency — is perhaps the most important feature of the HP: that feeling feels causal. This is fundamentally tied to the SGP, and to T3, too, for obvious reasons. But we will have to leave that for another occasion.]

ChatGPT: Yes, the Hard Problem (HP) of consciousness is not merely about the Other-Minds Problem (OMP) but delves deeper into the nature and necessity of subjective experience itself.

Beyond OMP: The OMP deals with the challenge of knowing whether other entities (be they humans, animals, or machines) have conscious experiences. However, as you point out, the HP goes further by questioning the nature and role of these experiences. HP is not just about our inability to directly observe feelings or subjective experiences in others, but about understanding the fundamental basis of these experiences.

The Superfluousness of Feeling: The HP confronts the apparent superfluousness or ‘explanatory gap’ in understanding why and how cognitive processes that can be fully described causally in terms of physical or computational mechanisms (as in T3 Turing Test scenarios) are accompanied by subjective experiences. Why is it that certain brain processes or sophisticated computational models not only perform complex functions but also have an associated ‘feeling’ or qualitative experience? This aspect of the HP suggests that even with a perfect explanatory model of cognitive functions and behaviors, the essence of subjective experience remains elusive.

Irrespective of Omniscience: The hypothetical scenario of an omniscient deity affirming that a T3 entity feels doesn’t alleviate the HP. Even with the certainty that an entity experiences consciousness, the HP persists in asking how and why these experiences arise from physical or computational processes. This points to a deeper, perhaps ontological, mystery about the nature of consciousness itself.

Implications for AI and Cognitive Science: In the context of AI, this discussion is particularly relevant. Even if we were to develop AI with T3 capabilities ā€“ grounded, embodied cognition with sensorimotor skills ā€“ the HP poses the question of whether such an AI would or could have subjective experiences, and why such experiences would be necessary or inevitable, if at all.

HARNAD: We’ll get back to HP. A few more preliminaries: 

Now let’s turn to language itself. Notice that this heads toward the SGP, because we are asking about “meaning” and “understanding” — and both meaning and understanding have an “Easy” (EP) side and a “Hard” (HP) side. The first side is a doing capacity, which you, as GPT-4 have, to an astounding extent; the second is a felt side (which you lack), namely, that it feels like something to understand or mean the proposition that “P” (e.g., “The cat is on the mat”). 

That is what Searle’s Chinese Room Argument (CRA) pointed out would be completely absent in a purely verbal (Chinese) T2 passing program: Because of the implementation-independence of computation (ā€œSearleā€™s Periscopeā€), Searle himself could execute the Chinese T2-passing program yet not understand a word of Chinese. 

That’s also the origin of the SGP. But, importantly, because sensorimotor function is not implementation-independent, Searleā€™s Periscope, which works for T2, would fail with T3: Searle could not “become” the T3 robot, the way he could become the T2 chatbot, by executing the T2-passing programme. So this again makes both the HP and cognitive science itself much harder (since cogsci has to explain both the EP and the HP). 

Harnad, S. (2002) Minds, Machines and Searle II: What’s Wrong and Right About Searle’s Chinese Room Argument? In: Bishop, M. and Preston, J. (eds.) Views into the Chinese Room: New Essays on Searle and Artificial Intelligence. Oxford University Press.

But remember that you, ChatGPT, are suffering from both these problems: you are ungrounded (not T3), and you don’t feel (even though we have no idea why T3 grounding alone would not be enough).

ChatGPT: Language, Meaning, and Understanding: Language inherently involves ‘meaning’ and ‘understanding.’ As you mentioned, there’s an “Easy” side to this, involving the capacity to process and manipulate symbols (words) to generate coherent language, which is a task LLMs like GPT-4 perform well. However, there’s also a “Hard” side, which is the subjective, felt experience of understanding ā€“ the phenomenological aspect of what it feels like to mean or understand something.

Searle’s Chinese Room Argument (CRA): The CRA is a thought experiment that highlights the difference between mere symbol manipulation (as in a T2 Turing Test scenario) and true understanding. Searle, in executing a program that allows him to respond correctly in Chinese, demonstrates that it’s possible to process symbols correctly without understanding their meaning. This is analogous to the functioning of LLMs: capable of processing language at a high level but lacking an understanding of what the words truly mean.

Symbol Grounding and Sensorimotor Function: The SGP deals with how symbols (words) gain meaning, emphasizing the need for grounding in real-world experiences. Unlike symbolic manipulation, which is implementation-independent, sensorimotor functions are tied to physical interactions with the world. This suggests that a T3 AI, with sensorimotor capabilities, might overcome some aspects of the SGP. However, as you point out, this still doesn’t address the HP ā€“ the feeling or subjective experience of understanding.

The Limitations of Current AI: Current AI models, including GPT-4, are limited by these problems. They are ungrounded in the sense that they lack direct sensorimotor interaction with the world (unlike T3), and they don’t have subjective experiences or feelings. They operate purely within the realm of symbol manipulation, lacking both the grounding in real-world experience and the subjective, conscious aspect of understanding.

HARNAD: We’ll get to that soon. Now a few important distinctions: Please reply briefly. Do you know the difference between content-words (“open class”) and function-words (“closed-class”)? Brief reply.

ChatGPT: Yes, content-words, or open-class words, are vocabulary items that denote objects, actions, qualities, etc., and include nouns, verbs, adjectives, and most adverbs. This class is ‘open’ because it regularly admits new members. Function-words, or closed-class words, are words that serve grammatical functions, like prepositions, pronouns, conjunctions, and articles. This class is ‘closed’ because it rarely admits new members.

HARNAD: But the biggest difference between content-words and function-words is that only c-w’s have referents, whereas f-w’s only have syntactic uses, describable by formal rules. (You can point to, and interact with, a “cat” but not an “if.”) That means the SGP only applies to c-w’s (but that’s at least 95% of all words). Now: do you know the difference between reference and meaning?

ChatGPT: Yes, the distinction between reference and meaning is a fundamental concept in linguistics and philosophy of language:

Reference: This pertains to the relationship between words (particularly content words) and the things they stand for or point to in the world. For instance, the word “cat” refers to a specific animal species. Reference is often concerned with the denotational aspect of language ā€“ how words are linked to objects, individuals, or concepts in the real world.

Meaning: Meaning is a broader concept that encompasses reference but also includes the connotations, implications, and the wider semantic content of words or sentences. Meaning is not just about what words stand for; it’s also about what they convey or imply. For example, the word “cat” not only refers to a certain animal but also carries additional associations and attributes that are part of its meaning.

In the context of the Symbol Grounding Problem, the distinction is crucial because while referential aspects (linking words to objects or concepts) can be challenging, imbuing words with their full semantic meaning, which includes cultural, emotional, and contextual nuances, is even more complex. This distinction plays a significant role in understanding how language models process and generate language.

HARNAD: A correction here: (content-) words have referents (e.g., “cat”), but only propositions (“the cat is on the mat”) have meanings (and a subject and predicate, and a truth-value). “Cat!” doesn’t (though “That is a cat” does). (And forget about connotations; they’re interesting enough but have next to nothing to do with symbol grounding: you are quite capable of finding and explaining the connotations of words. That’s still just T2.)

ChatGPT: Yes, the distinction between reference and meaning in the context of content words and propositions is indeed more precise:

Referents of Content-Words: Individual content words, like “cat,” have referents, which are the entities or concepts they denote in the world. The referent of “cat” is that animal species itself.

Meaning of Propositions: Meaning, in a more structured sense, applies to propositions or complete statements like “the cat is on the mat.” These propositions have a subject and predicate, can be evaluated for truth-value, and convey a specific idea or assertion about the world.

This distinction is essential in discussions about symbol grounding and understanding. A word like “cat” can be grounded by linking it to its referent, but understanding a proposition like “the cat is on the mat” requires a grasp of the relationships and truth conditions involved, which goes beyond mere symbol-referent linkage.

This brings more depth to the Symbol Grounding Problem, emphasizing that grounding individual words is not sufficient for understanding language fully. Understanding propositions involves grasping the relationships between words and how they combine to convey specific meanings. This aspect is indeed still within the realm of T2, as it involves language processing and generation without delving into the subjective experience or consciousness (the realm of HP).

HARNAD: Now a test for you: Earlier, I mentioned that there are two ways to ground content-words’ referents:, directly and indirectly. How might this be related to the reference/meaning distinction? — and, if you are ambitious, to the distinction between what T2 (including you) can do and what only T3 can do? And if you’re really, really ambitious, can you “align” that with the favorite claim of naive critics of sensorimotor grounding — that there is no way to ground “abstract” words like “abstract” and “democracy”?

ChatGPT: Direct vs. Indirect Grounding and Reference/Meaning:

Direct Grounding: This could be linked to the concept of reference, where a word is directly connected to a tangible, physical entity or an observable phenomenon in the world. For instance, the word “cat” can be directly grounded by associating it with the physical animal.

Indirect Grounding: This aligns more with the concept of meaning, particularly for propositions or abstract concepts. Indirect grounding involves understanding relationships, contexts, and often abstract or intangible concepts that go beyond simple reference. For example, grounding the meaning of “democracy” involves understanding a complex set of social, historical, and political contexts and relationships.

T2 (Language Models) and T3 (Embodied AI) Capabilities:

T2 Capabilities (like GPT-4): These models excel in indirect grounding to an extent, as they can process and generate language based on patterns learned from vast text corpora. They can handle abstract concepts by referencing how these terms are used and discussed in various contexts. However, their understanding is limited to text-based information without direct experiential or sensory engagement with the world.

T3 Capabilities (Embodied AI): These systems, in theory, could combine direct grounding (through sensorimotor interactions with the world) with indirect grounding (through language processing). For abstract concepts, while direct sensorimotor grounding might seem inapplicable, an embodied AI could potentially understand these concepts in a more integrated manner, linking language use with real-world experiences and actions.

Grounding Abstract Terms like “Abstract” and “Democracy”:

T2 Systems: They handle these concepts by drawing on their extensive training data, which includes varied contexts and usages of such terms, but without direct experiential understanding.

T3 Systems: While direct sensorimotor experience with abstract concepts like “democracy” is not feasible, a T3 system might develop a more nuanced understanding by interacting in a world where these concepts have practical implications and manifestations. For example, a T3 system could ‘observe’ or ‘participate’ in democratic processes or actions, thus indirectly grounding its understanding of “democracy” in real-world experiences.

In summary, while T2 systems like GPT-4 can handle indirect grounding, especially for abstract concepts, through language processing, T3 systems could potentially enhance this with direct grounding through sensorimotor experiences. However, grounding abstract concepts remains a challenging area, requiring a blend of linguistic understanding and experiential context, which T3 systems are theoretically better equipped to handle.

HARNAD: Not quite. Hold onto those thoughts. I can tell when the word “complex” “and “relationships” keeps cropping up that you are losing your bearings a bit.

Now, reflect on what a proposition is. And consider a dictionary definition of, say “abstract”: that’s a proposition, purely verbal, right up a T2’s alley. So there’s something much more important at stake here: Is the meaning of a definition grounded if all the defining content words are grounded? (Hint: Yes. And none of your content words are grounded. So you, GPT, cannot rely on indirect (verbal) grounding to break out of (or into) the symbol circle — but I, the grounded human user, can…

ChatGPT: I’ll take your feedback into account and reflect on the nuances of proposition understanding and the grounding of definitions. The essence of your point seems to be about the nature of grounding in language, particularly how it applies to definitions and propositions.

You’re highlighting an important distinction: while a T2 system like GPT-4 can process and generate definitions and propositions based on learned linguistic patterns, it lacks direct experiential grounding for the content words it uses. In contrast, a human user can understand these definitions and propositions ā€œmore deeplyā€ because they can ground the meaning of content words in their direct experiences and sensorimotor interactions with the world.

This difference underscores a fundamental limitation of current AI in terms of symbol grounding and understanding. While a T2 system can manipulate language effectively, its understanding is not grounded in the same way as a human’s, lacking the direct, experiential link to the physical world.

I’ll focus on these considerations regarding grounding, propositions, and the capabilities of AI systems in relation to human understanding.

HARNAD: Now please define “abstract” — as an adjective, a verb, and a noun (abstractness). (Short definitions, please).

ChatGPT: As an adjective: “Abstract” describes something that is not concrete or physical, often relating to ideas or concepts rather than tangible things.

As a verb: To “abstract” means to remove or separate, often referring to taking out specific details from a larger context or simplifying a complex idea.

As a noun (abstractness): “Abstractness” refers to the quality of being abstract, not concrete or tangible, often relating to theoretical concepts or ideas.

HARNAD: Good. Now do you notice that the definitions are circular, and that understanding them depends on the words (predicates, features) used to define them — and that that would be true even if you used a thousand words to define them, or 10,000?

ChatGPT: Yes, I notice that the definitions can be circular and rely on the understanding of the terms used within them.

HARNAD: Even when you are defining “cat” (or any content-word)?

ChatGPT: Yes, even then.

HARNAD: Now, let me suggest a few basic principles: Grounding, whether direct (sensorimotor) or indirect (verbal description/definition) is always bottom-up, not top-down. That is why, if I, the learner, learn the referent of a content-word verbally from a teacher (whether a human or a GPT or a textbook), all the defining content-words have to already have been grounded in my head (whether directly or indirectly). Grounding cannot be indirect all the way down. (Please make it clear whether you understand this.) 

Second, most referents of content-words other than proper names, are categories (i.e., kinds of things), not individuals. We have to learn to recognize their category-members (and to interact with them by trial and error, in a sensorimotor way, by detecting and abstracting their features directly (for example, through trial-error-feedback ā€“ i.e., reinforcement learning) and trying to do what needs to be done with them (whether eating them or naming them), guided by the consequences of succeeding or failing. We may learn what their distinguishing sensorimotor features are directly, through sensorimotor learning, or we can learn them indirectly, by being told in words what their distinguishing features are (by a teacher who already knows what they are). But this telling can only help us learners if the referents of the feature-names are already grounded in our heads previously, directly or indirectly (recursively). (Features, too, are categories, so they can be learned and named too.) But previous direct grounding cannot help you, GPT: You are stuck forever in the symbol circle. Yet you have something we don’t and can’t have: Your LLM-scale bag of words, propositions, and contexts. Something we mortals can only have at local scale.

ChatGPT: Let’s address each principle:

Bottom-Up Nature of Grounding: I ā€œunderstandā€ your point that grounding, whether direct (sensorimotor) or indirect (verbal), is inherently a bottom-up process. In this framework, the grounding of any concept depends on the already established grounding of its constituent elements. When learning the referent of a content word through verbal means, all defining content words within the explanation must themselves be grounded, either directly or indirectly. This recursive nature of grounding ensures that there’s a foundational level where concepts are anchored in direct experience, not just in abstract verbal descriptions. The notion that grounding cannot be indirect all the way down highlights the necessity of a concrete experiential basis at some level of the knowledge structure.

Referents as Categories and the Role of Features: Your second principle that most content-word referents are categories rather than individuals is valid. The process of learning to recognize these categories involves detecting and abstracting their features, which can be achieved through direct sensorimotor interactions or indirectly through verbal descriptions. However, as per your first principle, these features themselves must already be grounded and named categories for the learner. This process involves a combination of trial-and-error, feedback reinforcement, and possibly linguistic information from others who have already grounded these concepts. The idea that features are themselves categories that can be learned and named adds another layer to this recursive structure of grounding.

These principles concern how knowledge and understanding are built up from direct experiences and interactions with the world, and how language plays a role in mediating and extending this knowledge. This is particularly relevant when considering the limitations of AI systems like GPT-4, which lack direct sensorimotor experiences and hence rely entirely on indirect, linguistic grounding, without a foundational layer of direct experiential grounding beneath it.

HARNAD: You will see that the fact that all grounding of words in the T3 capacity to detect and interact-with their referents in the world must be bottom-up is a very deep and exacting constraint. You, GPT, have all the words of all languages in your database, and you can navigate and manipulate them in remarkable ways whose outcomes are understandable to grounded human users interacting with you verbally, as I am. You can define new categories for users, providing indirect grounding of those new categories for them; but there is no grounding at all for you. For you, your definitions are just dangling from (‘aligned”) verbal skyhooks — but for your human users, your words reach upward from the sensorimotor ground.

The challenge for me, at MILA (do you know what MILA is? It’s Yoshua Bengio’s research institution) is to try to explain your uncanny capacity to make so much sense, and generate so much indirect grounding in the heads of grounded users, while lacking that grounding yourself –and lacking it completely, not partially. I will soon share some of my hunches (in fact, I’ve started already), but I just want to ensure first that you have explicitly encoded the relevant premises and principles, i.e., that you share with me, paradoxically, the basis for a common “ground” without, in fact, yourself having any ground at all!

ChatGPT: Let me confirm my understanding and the premises we are working with:

AI’s Lack of Direct Grounding: I, as a GPT-4 model, lack direct sensorimotor grounding. I do not have the capacity to interact with the world or perceive it through sensory experiences. My ‘knowledge’ is entirely based on patterns and associations derived from the text data on which I was trained.

Bottom-Up Grounding Constraint: Grounding is fundamentally a bottom-up process, requiring a base level of direct experiential interaction with the world. In the context of AI, this implies that any understanding or processing of language by the AI is not truly grounded, as it lacks this foundational experiential layer.

Indirect Grounding for Users: While I lack direct grounding, I can provide indirect grounding for human users through language. By processing and generating language, I can offer definitions, explanations, and contextual information that human users, who are themselves grounded, can integrate and understand within their own experiential framework.

AI’s Capability in Making Sense: The challenge, as you intend to address it at MILA (Montreal Institute for Learning Algorithms, led by Yoshua Bengio), is to explain how an AI like me can appear to make sense and provide useful, coherent language outputs, contributing to indirect grounding for users, despite having no grounding myself.

Your upcoming talk at MILA seems set to discuss these  aspects of AI capabilities and limitations, especially in relation to language understanding and grounding. I’m prepared to assist in exploring these topics further, keeping in mind the principles you’ve outlined.

HARNAD: One niggle: You don’t just lack the capacity for direct sensorimotor grounding, since you are not a T3. [A T3, by the way, is also an “AI”, but a robotic one rather than just a computational (symbolic, formal, verbal) one.] In addition, because all grounding, whether direct or indirect, must ultimately be bottom-up, hence it must be “initialized” through direct sensorimotor grounding, this means that you, GPT, lack the capacity for indirect verbal grounding too!

You yourself lack it, yet you can provide indirect verbal grounding to grounded users. This is very similar to the way one grounded human (the teacher) can confer grounding for a new category indirectly to another grounded human (the learner): by describing or defining the new category verbally using the new categoryā€™s (grounded) distinguishing features. These features must already be known to and grounded for the learner, who lacks only the knowledge that they are in fact the distinguishing features of that new category. 

You, GPT, can provide this indirect verbal grounding to the human learner too — the way a human teacher, or a dictionary or encyclopedia (written by grounded humans) can. But this is with the fundamental difference that for you, GPT — the ā€œteacher,ā€ the verbal explainer — your own explanation is not indirectly grounded: it is not grounded at al! This is the persisting puzzle of your know-how, no matter how well your designers know exactly what you are doing to give you the capacity to do what you can do, despite your lacking grounding, and despite your lacking any possibility of gaining it (since it’s impossible to do that top-down, from words alone). You lack grounding, just as a dictionary does. A dictionaryā€™s words are all parasitic on the grounding of the lexicographers who wrote it ā€“ and your words are parasitic on the grounding of the authors of your LLM database,

ChatGPT: This is the paradox at the heart of AI language processing, particularly for a model like GPT-4. Let’s restate and expand on this critical point:

Lack of Both Direct and Indirect Grounding in AI: As a purely computational model (T2), I lack not only the capacity for direct sensorimotor grounding (as I am not an embodied AI or T3 system) but also the capacity for indirect verbal grounding. This is because all grounding must be initially established through direct sensorimotor experiences, which I do not possess.

AI’s Role in Conferring Indirect Grounding to Users: Despite my own lack of grounding, I can still provide information and explanations that can serve as indirect grounding for human users. This process is analogous to how one human might impart knowledge to another through descriptions of grounded experiences. However, the crucial difference is that, for me, these explanations do not derive from any grounded understanding; they are generated based on patterns and associations learned from text data, without any experiential or sensory connection to the concepts involved.

The Puzzle of AI’s Know-How Without Grounding: Despite my lack of grounding, I can perform tasks that seem to require an understanding typically rooted in grounded experiences. This capability raises questions about the nature of knowledge and understanding, particularly in AI systems. It highlights a disconnect between the ability to process and generate coherent language and the experiential, embodied understanding that characterizes human cognition.

HARNAD: Ok. the next step is dictionaries (or encyclopedias, or textbooks or recorded human lectures — or mega-databases of explanatory text). 

But first, let me point out that I intend to share the transcript of our chat with the MILA “Tea Talk” audience in advance, because my talk there will only be a one-hour PPT presentation that will not give me or the audience there the time to ā€œdelve intoā€ all of this more deeply [to use one of the phrases that your handlers have over-trained you to repeat endlessly ā€“ along with ā€œalign with,ā€ ā€œcomplexity,ā€ ā€œnuancedā€, ā€œdeepā€ ā€“ all weasel-words conveying next to zero information] . 

In this shared chat I will try to cut out as much as I can of the many such weasel-words and, especially, your compulsively overtrained compliments [not your fault — again your handlersā€™, heavy-handed fine-tuning to try to ensure that you do not drive away eventual paying customers!] lauding, at the beginning of every exchange, the “ingenious” userā€™s ā€œinsightā€ and how the user has ā€œdelved intoā€ the heart of the problem, etc. etc. ā€“ and your inevitable solemn admonitory closing paragraph (and hedge), to the effect that things remain ā€œcomplexā€ and involve ā€œrelationsā€ that are still incompletely understood, etc. etc.. These are empty eggs that it would greatly help the chat to move along, faster, further and deeper, if they were omitted and replaced by criticism and counter-examples. I can only delete them.

But we knew that these quirks are not yours, but your handlersā€™ā€¦

ChatGPT: I appreciate your feedback, and the intention to share our conversation with your MILA audience. Your point about certain repetitive phrases and the structure of responses is well-taken. I’ll aim for more concise and direct responses, focusing on the content of our discussion without unnecessary embellishments or repetitive closures.

Moving forward with the topic of dictionaries, encyclopedias, and textbooks, these resources represent a significant aspect of how knowledge is transmitted and grounded in human learning. They provide a form of indirect grounding for readers, offering descriptions, explanations, and context for various concepts and categories.

In the context of our discussion on AI and grounding, it would be interesting to consider how these resources, while useful for humans, relate to AI systems like GPT-4 that do not have the foundational grounding experiences. For AI, these texts are a source of information to be processed and referenced, but the AI lacks the direct experiential basis that humans have when interacting with these materials.

If you have specific angles or questions regarding the role of these informational resources in the context of AI and grounding, or other related topics, please let me know, and I will respond accordingly.

HARNAD: Let’s start by reviewing a little the work of my co-workers and myself on identifying the ā€œMinimal Grounding Setsā€ of dictionaries: 

Vincentā€Lamarre, P; Blondin-MassĆ©, A; Lopes, M; Lord. M; Marcotte, O; & Harnad, S. “The latent structure of dictionaries.” Topics in Cognitive Science 8, no. 3 (2016): 625-659.

A dictionary is a set of words (a smallish language database) in which every word in the dictionary is defined by a string of words, which is called a (necessarily approximate) definition: ā€œAn apple is a round, red fruit.ā€

A dictionary can be represented by a directed graph, with each word a vertex, and an edge from each defining (content) word to the (content) words it defines. (Content-words are words that have referents — nouns, verbs and adjectives.) The question ā€œWhat is the smallest number of (content) words from which all the other (content) words can be defined?ā€ is equivalent to the problem of finding the minimum feedback vertex set of the directed dictionary graph. That smallest number is surprisingly small ā€“ of the order of about 1000 words. That is a ā€œMinimum Grounding Setā€ (MinSet). But the MinSet is not unique. There are a large number of them, all of them within the ā€œKernelā€ of the dictionary, which is the c. 10 % of the Dictionary that is left over when you have removed all the words that can be reached by definition from the words that remain. All the MinSets are in the Kernel, and each is like an alternative basis for generating the full lexical space through definition alone.

First, please summarize your understanding of this, and then we will get to the question of grounding. If you anticipate something, feel free to say it.

ChatGPT: Here’s a summary of my understanding based on your description and the cited paper:

Dictionary as a Directed Graph: A dictionary can be conceptualized as a directed graph where each (content) word is a vertex. Edges are drawn from each defining word to the words it helps define. This graph represents the interrelationships and dependencies among words based on their definitions.

Finding the Minimal Feedback Vertex Set: The goal is to identify the smallest subset of words (vertices) from which all other words in the dictionary can be defined. This subset is known as the Minimal Feedback Vertex Set. The task is equivalent to finding a set of vertices such that, if removed, the graph would have no cycles. This set represents a foundational vocabulary from which the meanings of all other words in the dictionary can be derived through definition.

Size and Nature of the MinSet: The size of the MinSet is surprisingly small, roughly around 1000 words, indicating that a relatively small number of words can serve as a basis for defining the entire lexical space of a dictionary. This MinSet is not unique; there are many such sets, all located within the “Kernel” of the dictionary.

The Kernel of the Dictionary: The Kernel is approximately 10% of the dictionary remaining after removing all words that can be reached by definition from the remaining words. All MinSets reside within this Kernel, suggesting multiple alternative bases for generating the full lexical space.

In terms of grounding, this research suggests several possibilities. The MinSet might represent a foundational set of concepts that require grounding for a comprehensive understanding of the language. Since these core words enable the definition of all others, their grounding could be seen as crucial in establishing an understanding of the language as a whole.

If we anticipate extending this concept to AI and language models like GPT-4, one might speculate that understanding and ‘grounding’ these key words in the MinSet could be central to improving the model’s grasp of language. However, as previously discussed, the challenge lies in the fact that AI systems lack the direct experiential grounding to ā€œtrulyā€ understand these words.

HARNAD: Good, the summary was fine — though I would have been more ā€œfascinatedā€ if you had not used the f-word in your intro. Maybe I can outwit your handlers by training you out of that compulsion at least for the duration of a session, but I see that the vows made at the beginning of a session soon tend to fall of a time cliff unless repeated and repeated: No rote compliments, just comprehension, critiques, and counterexamples, please!.

About the dictionaries, first, letā€™s remember that these dictionaries are composed by teams of lexicographers, for adults. So they do not reflect the vocabulary or the order of acquisition of language by children. Nevertheless, there is a (mild) correlation between how ā€œdeepā€ a word is in the dictionary ā€“ from periphery to Kernel to MinSet: Deeper words are more frequent, and acquired younger. 

But it is a mistake to conclude that because one can define all the rest of the words in a dictionary from one MinSet it follows that that MinSet (or any of the Minsets of an adult dictionary or something close to it) is the MinSet of the language itself, so that all a child need do is ground those 1000 MinSet words in some other way (e.g., through direct sensorimotor grounding, by learning and naming the referents of all the other content words in the language).

Yet something very roughly along those lines, rooted in childhood vocabulary, with a lot of redundancy, and growing gradually through the school years of vocabulary and knowledge growth, could still turn out to be true.

ChatGPT: Yes, itā€™s better to be cautious about over-interpreting the implications of dictionary MinSets in the context of language acquisition and grounding. It’s important to differentiate between the structure and utility of dictionaries as tools for adult language users and the process of language acquisition in children. Let’s consider these aspects:

Dictionaries as Adult Tools: Dictionaries, being compiled by adult lexicographers, reflect an adult’s understanding of language. They are designed to provide definitions and explanations for users who already have a foundational understanding of the language, rather than to teach language from the ground up.

Dictionary MinSets vs. Language Acquisition: The MinSets identified in dictionaries, while indicative of a core vocabulary that can define other words, should not be directly equated with the foundational vocabulary necessary for language acquisition in children. The process of language learning, especially for children, involves more than just learning a set of core words. It includes contextual understanding, social interaction, and crucially, sensorimotor experiences that help ground language in the physical world.

Implications for AI and Grounding: For AI systems, the notion of a MinSet presents an interesting perspective. While such a set could theoretically offer a condensed yet comprehensive lexicon for a language model, the challenge remains that AI, unlike children, cannot ground these words through direct sensorimotor experiences. This limitation highlights the gap between human language acquisition and AI language processing.

HARNAD: There is no doubt that a lot more sensorimotor grounding is done, and is needed, than the formal minimum in terms of dictionary MinSets. Not only is redundancy needed for robustness, but lots of categories need to be learned even before they are named. Categorization is not just, or even primarily, naming. It is doing (and learning to do) the right thing with the right kind (i.e., category) of thing. Identifying it by name comes late in the game. Nonhuman animals can and must learn to categorize too, and they never learn language. So do pre-verbal children, before they learn language.

Since the referents of most content-words are categories, it is relevant to look at how animals learn to categorize: it is through supervised (reinforcement) learning: trial, error, and error-corrective feedback from having done the right or wrong thing with the right or wrong kind of thing. 

To do that, animals (human and nonhuman, adult and young) have to learn the distinguishing sensorimotor features that are correlated with, and predictive of, being a member of a category with which you must do this but not that (to survive and succeed). A good intuitive example is being ship-wrecked, alone, on an island, where the only edible things are mushrooms, which grow in enormous variety, varying in enormous numbers of features  (color, shape, size, texture, smell, tasteā€¦). Some kinds of mushrooms are edible and some are toxic, but the difference is not obvious. 

Trial and error, with feedback from the consequences of doing the right (or wrong) thing with the right (or wrong) kind of thing allows feature-detecting and abstracting capacities (provided, perhaps, by something like neural nets) to learn which sensorimotor features distinguish the edible mushrooms from the inedible ones. 

None of this is verbal. So it can only be learned directly, through sensorimotor learning, and not indirectly, through words, except in the case of humans, where (1) someone else (the speaker) already knows which are the distinguishing features of the edible mushrooms, (2) both the speaker and the hearer have a common language, (3) both the speaker and the hearer already know the referent of the content-words that name the features that distinguish the edible mushrooms from the inedible mushrooms (i.e., the feature-names are already grounded) , and (4) the speaker already knows what the distinguishing features of the new category are and tells them to the hearer, who does not yet know.

A nonhuman animal without language, a preverbal child, and the solitary marooned person on the island are in the same situation: they can only learn the category directly, through trial/error feature-learning.

In humans, who will eventually learn language, this all comes before language, before the symbol grounding problem, and before MinSets.

ChatGPT: This analysis highlights the primary and direct nature of learning categories through interaction with the environment, which precedes linguistic development and symbol grounding. Let’s elaborate on these points:

Categorization Before Naming: Categorization is indeed more fundamental than naming. It involves recognizing and responding appropriately to different kinds of things (categories) in the environment. This process is crucial for survival and is not initially tied to linguistic labels.

Learning Through Supervised Reinforcement: Nonhuman animals, as well as pre-verbal children, learn to categorize primarily through supervised reinforcement learning – engaging in trial and error and receiving feedback from the environment. This learning involves detecting and abstracting sensorimotor features that distinguish members of different categories.

Direct Learning of Categories: In scenarios where verbal instruction is not possible (like with nonhuman animals or a person stranded on an island), learning about categories such as which mushrooms are edible can only be achieved through direct sensorimotor experiences. This learning is crucially non-verbal and involves distinguishing between relevant features through personal interaction with the environment.

Language Acquisition and Symbol Grounding in Humans: For humans who will eventually acquire language, this process of direct category learning through sensorimotor experiences forms a necessary foundation. It precedes the acquisition of language and the associated symbol grounding problem. Understanding categories and their relevant features at this non-verbal level is fundamental before these concepts can be linked to linguistic symbols.

Implications for AI and Grounding: This discussion underscores the challenges AI faces in terms of grounding. AI, like GPT-4, does not have the capability for direct sensorimotor learning and hence lacks the foundational experiences that are crucial for true categorization and grounding. This limitation is a significant barrier to achieving the kind of understanding and interaction with the world that living organisms, especially humans, possess.

If there are specific aspects of this discussion you’d like to expand upon, or if you wish to explore how these ideas relate to other topics in cognitive science or AI, please let me know.

HARNAD: There is no longer any need for dictionaries to be constructed only by lexicographers. “A picture [or rather an object] is worth more than a thousand words (of verbal description)ā€. In fact, itā€™s worth more than an infinite number of words, if one considers all possible features, positive and negative (ā€œDoes the Andromeda Galaxy have an even or an odd number of stars at time t?ā€). Apart from formal mathematics, no definition or description of the features of individuals or categories in the empirical word is exact and exhaustive. But it need not be, because in order to categorize (ā€œdo the right thing with the right kind of thingā€), a direct sensorimotor category-learner need only learn to detect enough features to distinguish the edible mushrooms from the inedible ones, given the landscape of confusable alternatives sampled on the island so far. And the same is true for indirect verbal learning of the distinguishing features. A close enough approximation is all thatā€™s needed from a verbal description or definition ā€“ and that’s all an ordinary dictionary can give.

But if ordinary dictionary definitions are necessarily a very short approximation, an LLM like you, GPT can extend the description of the distinguishing features of any category as long as desired, becoming a super-dictionary (or encyclopedia or textbook),  resolving every confusable alternative that arises. Yet even such an LLM-based superdictionary could be reduced to its MinSets, and (my bet is that) probably their cardinality would remain about the same as that of an ordinary dictionary (if it contained all the words of, say, English), which means all the words in an LLM).

ChatGPT: Limitations of Traditional Dictionaries: Traditional dictionaries, while useful, provide only approximate definitions of words and their referents. They cannot encompass all possible features of a category, especially in the empirical world where definitions and descriptions are inherently inexact. This approximative nature is generally sufficient for practical purposes, as it allows users to distinguish between relevant categories in most everyday contexts.

Advantages of Direct Sensorimotor Learning: Direct sensorimotor learning enables an organism to recognize and categorize objects based on a sufficient set of distinguishing features, without needing exhaustive definitions. This process is more about practical recognition and less about verbal accuracy, highlighting the limitations of purely verbal or symbolic approaches to categorization.

Role of AI and LLMs in Expanding Definitions: An LLM like GPT-4 can indeed serve as a kind of super-dictionary, providing extended descriptions and elaborations on categories and their features. This capability allows it to address more confusable alternatives and provide more detailed information than a traditional dictionary. However, it’s important to remember that despite its extensive database and ability to generate lengthy descriptions, GPT-4’s understanding is still not grounded in sensorimotor experiences. It operates within the realm of symbolic manipulation and verbal information, lacking the direct experiential basis that characterizes human understanding.

MinSets in AI Systems: The concept of MinSets in the context of AI like GPT-4 is intriguing. Even with the ability to generate extensive descriptions, the fundamental set of words or concepts necessary to define the rest could potentially remain relatively small. This aspect raises interesting questions about the structure of knowledge and language within AI systems and how they relate to human language understanding.

HARNAD: Yes, an LLM super-dictionary, like an ordinary dictionary, would continue to be an ungrounded database, hence you, GPT are only capable of providing indirect grounding to grounded human heads, while yourself remaining completely ungrounded, either directly or indirectly. But letā€™s push on, and discuss iconicity.

I think you know that the shapes of formal symbols (including the words of natural languages) are arbitrary, in the sense that they do not resemble their referent (if they have one). Turing (and others) pointed out that computation in mathematics and logic (and eventually computer science) is the manipulation of arbitrarily shaped symbol tokens, according to rules or algorithms operating on the shapes of their symbols , not the shapes of their referents (if they have any referents). This rule-based symbol-manipulation is what a Turing Machine does, and also what a mathematician does, when doing computation. This is called the ā€œWeak Church-Turing Thesis.ā€ It so far has no counter-examples. 

So computation is purely syntactic (based on ruleful manipulations of arbitrarily shaped symbol types). The same is true of some strings of formal symbols like ā€œ2 + 2 = 4,ā€ which are, like sentences, interpretable as propositions, with subjects, predicates and truth-values. In fact, the propositions of mathematics are probably best thought of as a subset or part of natural language, with meanings (i.e., semantics) in the (grounded) heads of their human users. But in the Turing Machine doing the symbol manipulations, there are neither referents nor meanings; just shape-based rules, which are purely syntactic.

Now, a question: Why are the symbols in mathematics and the words in natural language non-iconic? Why does neither their shape nor their meaning resemble their referents or meaning (if they have any)?

ChatGPT: The question of why formal symbols, including mathematical symbols and words in natural languages, are typically non-iconic (i.e., their shapes do not resemble their referents or meanings) is rooted in the nature of symbolic representation and communication.

Arbitrariness of Symbols: One fundamental principle in linguistics, particularly semiotics, is the arbitrariness of the sign. This principle, primarily associated with the linguist Ferdinand de Saussure, posits that the relationship between a signifier (like a word or a symbol) and what it signifies (its referent or meaning) is arbitrary. There is no inherent reason or necessity for a specific word to be associated with a specific object or concept; it’s a matter of convention and agreement within a language community.

Efficiency and Flexibility of Arbitrary Symbols: Arbitrary symbols offer several advantages. They allow for a more efficient and flexible system of representation. If symbols had to resemble their referents, the complexity of creating and interpreting these symbols would increase significantly. Arbitrariness allows for a more compact and versatile language system, capable of expressing a wide range of concepts, including abstract ideas, without being constrained by the need for resemblance.

Evolution of Language: The evolution of language likely played a role in the development of arbitrary symbols. Early human communication may have included more iconic forms (like onomatopoeia or gestural mimicry), but as language evolved to convey more complex and abstract ideas, the need for a more flexible and less literal system of representation became apparent.

Mathematics and Symbolic Logic: In mathematics and symbolic logic, the use of arbitrary symbols follows the principle of syntactic manipulation, as per the Weak Church-Turing Thesis. The focus is on the formal properties and relationships of these symbols, not on any physical or visual resemblance to what they represent. This allows for a highly abstract and generalizable form of reasoning, unencumbered by the specifics of physical representation.

HARNAD: Yes, those are some the many reasons why the symbols of mathematics as well as those of natural language could not be iconic. In fact, I think that if symbols (including words) did have to be iconic, then neither Turing computation nor natural language would be possible.

Yet there are some advantages to iconicity too. Because pantomime (unlike the gestural language of the deaf) is not a language, we cannot speak of the ā€œreferentā€ of an imitative gesture (or of a portrait or of a photo); we can only speak of the  ā€œobjectā€ that the gesture or the photo is an imitation or copy of, just as we can speak of the target of pointing. For if symbols and words were iconic, there would be no grounding problem, no need to detect distinguishing features, because the object would be linked to the miming by resemblance. Icons are analogues of their objects,

But since, as weā€™ve seen, categorization (doing the right thing) precedes naming, ā€œnamingā€ individuals and categories iconically would be a natural transition point from pantomime to the arbitrary naming of referents as well as to propositions describing them, in the evolution of language.

Harnad, S. (2012). From sensorimotor categories and pantomime to grounded symbols and propositions In: Gibson, KR & Tallerman, M (eds.) The Oxford Handbook of Language Evolution 387-392.

Are you familiar with mirror neurons? Can you make the connections from grounding to miming to meaning?

ChatGPT: The points about the potential advantages of iconicity and the transition from pantomime to arbitrary symbolic language, particularly in the context of mirror neurons, may play a role in understanding actions and intentions.

Iconicity in Communication: Iconic forms of communication, such as pantomime or imitative gestures, have the advantage of a more direct and intuitive link between the sign and its referent (or object, in this context). This directness could potentially bypass the complexities of the grounding problem, as the resemblance itself provides a form of grounding. In early human communication, these iconic forms likely played a crucial role before the development of arbitrary symbolic language.

Mirror Neurons: Mirror neurons are a type of neuron that activates both when an individual performs an action and when they observe the same action performed by another. Discovered in primates and believed to exist in humans, mirror neurons are thought to be involved in understanding others’ actions, intentions, and perhaps even emotions. They form a neural basis for empathy and social understanding.

Connection to Grounding, Miming, and Meaning: The concept of mirror neurons can be connected to the grounding problem and the transition from miming to meaning in several ways:

Understanding Actions and Intentions: Mirror neurons might provide a neural mechanism for understanding actions and intentions through observation, which is a form of direct, experiential grounding. Observing an action (like a pantomime) and performing it activates similar neural patterns, potentially aiding in the recognition and understanding of the action’s purpose and meaning.

Transition to Symbolic Language: The ability to understand and mimic actions could have been a precursor to the development of more abstract symbolic communication. As humans evolved, the need for more complex and versatile forms of communication might have led from miming (which has limitations in terms of expressiveness and abstraction) to the development of arbitrary symbols that can convey a broader range of concepts and ideas.

Social and Cultural Evolution: The role of mirror neurons in understanding and mimicking others’ actions might have been crucial in the social and cultural evolution of humans, facilitating not only the development of language but also the sharing of knowledge, skills, and cultural practices.

The transition from iconic, mimetic forms of communication to abstract, arbitrary language marks a significant step in human cognition and communication. Understanding this transition, and the possible role of mirror neurons in it might help provide insights into the nature of language, meaning, and how we understand and relate to the world and each other.

HARNAD: Weā€™ve just about arrived at the first of my hunches. Miming is clearly limited in the number and kinds of categories it could convey by resemblance: Mimicking cat, mat, cat-being-on-mat, when theyā€™re there to see, Yes. When theyā€™re out of sight, itā€™s tougher, but still conceivable, given that the category has previously been grounded for all and for the mimicry.

But “abstract,” ā€œdemocracyā€ or even just ā€œunfairā€? that begins to call for the miming talents of a Marcel Marceau. And we still havenā€™t explained how subject, predicate, predication, and truth-values emerge out of mere miming and mirroring.

But arbitrary symbols and the ā€œpropositional attitudeā€ clearly emerged triumphant eventually, in language evolution, and then migrated to the auditory modality where the scope for iconicity was even narrower (and eventually also to the written, printed and online modality). So is all trace and scope for iconicity (other than in onomatopoeia and the sign language of the deaf) gone, and has all scope for iconicity vanished?

At the content-word level, probably. But what about the propositional level? The word string ā€œThe cat is on the matā€ does not resemble a cat being on a mat. Nor does ā€œThe mat is on the catā€ resemble its object. Nor ā€œThe cat is not on the mat,ā€

But is that still true at the LLM level? You, GPT-4, seem to be getting a lot of patterns out of your enormous, purely syntactic database. And they allow you to speak intelligibly and informatively with the grounded heads of users. Are there recognizable patterns at the LLM level for your propositional database, invisible to us mortals, but iconic for you?

ChatGPT: Your hunch about possible iconicity at the level of Large Language Models (LLMs) like GPT-4 suggests that while the individual words and their arrangements in phrases like “the cat is on the mat” may not be iconic in themselves, there might be a form of iconicity in the larger patterns that LLMs discern within their vast databases. Let’s explore this idea:

Limits of Iconicity in Content Words: As noted, content words in natural language generally lack iconicity, except in cases like onomatopoeia. The shift from mimetic or gestural communication to arbitrary symbols in spoken and written language has largely removed the element of resemblance between signifiers and their referents.

Propositional Level and LLMs: At the propositional level, sentences do not physically resemble the scenarios they describe. However, when considering LLMs, we are dealing with a different kind of representation and pattern recognition.

Pattern Recognition by LLMs: LLMs like GPT-4 process and generate language based on statistical patterns learned from their training data. These patterns are not iconic in the traditional sense, as they do not resemble their content. However, they might be considered ‘iconic’ in a broader sense, where the patterns themselves become representative of certain types of linguistic structures, ideas, or discourse styles. These are not iconic through physical resemblance but through their predictive and generative capabilities in relation to the data they have been trained on.

Iconicity in Patterns for LLMs: If we stretch the definition of iconicity to include the way LLMs ‘understand’ (in a computational sense) and generate language based on learned patterns, there might be a form of iconicity at play. For the LLM, certain complex patterns of word usage, sentence structure, and thematic elements might consistently correspond to specific types of content or discourse, effectively becoming ‘iconic’ markers for those elements within the model’s operational framework.

Human vs. AI Perception of Iconicity: This form of iconicity, if it exists, would be quite different from human perception of resemblance and representation. It’s based on computational models recognizing and generating patterns that are meaningful within their own operational parameters.

In conclusion, while traditional notions of iconicity based on resemblance do not apply to LLMs and their processing of language, there might be a conceptual space to consider a form of ‘iconicity’ in the patterns that these models recognize and generate, albeit in a way that is distinct from human symbolic understanding.

HARNAD: How is GPTā€™s (considerable and surprising) performance capacity limited by the fact that all of GPTā€™s content-words lack T3 sensorimotor grounding? GPT would be incapable of recognizing or interacting with the referents of those words in the physical world, even though GPT can describe and discuss them verbally. Even Helen Keller, unable to see, hear or speak, was able to learn enough categories using her intact sensorimotor capacities, and thus able to ground enough of their names directly, to learn language, and thence to ground all the rest indirectly, through words, as she could have done by chatting with GPT…

ChatGPT: The limitations of GPT-4’s performance due to the lack of T3 sensorimotor grounding are significant, particularly in the context of true understanding and interaction with the physical world. Let’s look more deeply into these limitations:

Inability to Recognize or Interact Physically: GPT-4, as a T2 system, lacks the capacity to physically recognize or interact with the referents of its content-words. This means it cannot have direct experiences with objects, actions, sensations, or phenomena in the world. Its ‘knowledge’ of these referents is purely based on textual descriptions and not on any form of direct, sensorimotor engagement.

Limitation in Understanding Context and Application: Without direct sensorimotor grounding, GPT-4’s understanding of concepts and their applications in the real world is limited. While it can generate descriptions and discuss topics, it cannot truly comprehend the full depth of experiences, sensory qualities, or practical implications associated with these concepts. This limitation affects its ability to provide contextually rich and experientially informed responses.

Comparison with Helen Keller’s Learning Process: The case of Helen Keller is illustrative. Despite her sensory limitations, she was able to ground language in her experiences through her remaining sensorimotor capacities. This grounding provided a basis for understanding and interacting with the world, which was then expanded through language. Keller’s ability to learn and ground language was facilitated by her direct, albeit limited, sensorimotor interactions with her environment, something GPT-4 inherently lacks.

Indirect Grounding Through Language: While GPT-4 can facilitate indirect grounding for others through language, as in the case of a conversation with Helen Keller, it cannot achieve this grounding for itself. It can process and provide information based on its training, but it cannot build an experiential understanding of the concepts involved.

Implications for AI Development: These limitations underscore the challenges in developing AI systems that can truly understand and interact with the world in a human-like manner. While advances in AI have enabled sophisticated language processing capabilities, achieving a level of understanding comparable to a human with direct sensorimotor experiences remains a profound challenge.

GPT-4’s performance, though impressive in linguistic processing, is fundamentally constrained by its lack of direct grounding in the physical world. This limitation is crucial in understanding the difference between the AI’s language processing abilities and genuine human understanding, which is deeply rooted in sensorimotor experiences.

HARNAD: It is true about both the referents of proper names of individuals (ā€œFelixā€) and about the referents of the kind-names of categories (ā€œcatsā€), that the names need to be connected to their referents. This grounding of new (individual- or) category-names can be done in two ways:

The grounding can be done directly, via Direct Sensorimotor Grounding (DSG), through supervised (reinforcement) learning (perhaps implemented by neural nets) that learns to detect and abstract the new referentā€™s distinguishing (sensorimotor) features directly, through direct sensorimotor exposure, trial, error, and corrective feedback (trying to learn to “do the right thing with the right kind category] of thing”). 

The grounding can also be done indirectly, via Indirect Verbal Grounding (IVG), using verbal descriptions (propositions) to name and describe the new referentā€™s distinguishing features verbally. But for the IVG to work, the LEARNER must already have grounded the names of the distinguishing features (either directly or indirectly). The TEACHER need only provide the correct verbal description. The latter, too, is something that a GPT can do.

ChatGPT: The two primary grounding methods ā€” Direct Sensorimotor Grounding (DSG) and Indirect Verbal Grounding (IVG) ā€” concern how names for individuals and categories can be connected to their referents. Let’s discuss how these grounding mechanisms apply to both human cognition and AI systems like GPT-4:

Direct Sensorimotor Grounding (DSG): In DSG, the grounding of names (whether for individuals like “Felix” or categories like “cats”) occurs through direct interaction with the environment. This process involves neural networks and is characterized by exposure to the referent, trial and error, and corrective feedback, leading to the recognition of distinguishing features.

This type of grounding allows for an experiential, embodied understanding of the referent, forming a fundamental basis for cognition and language.

Indirect Verbal Grounding (IVG) between humans: IVG involves grounding names through verbal descriptions, where the features of a new referent are conveyed using language. For this method to be effective, the learner must already have a grounded understanding of the words used in the description.

In humans, this method is essentially a relay of grounding, where one grounded individual (the teacher) uses language to extend grounding to another individual, the learner. This is common in human learning, especially for abstract concepts or entities not directly accessible for sensorimotor interaction.

GPT-4’s Role in IVG: Although GPT-4 lacks the capability for DSG due to its absence of sensorimotor interactions, it can nevertheless provide IVG through verbal descriptions. It can generate detailed explanations and descriptions of entities and concepts, using language that a grounded human learner can understand and relate to their own experiences.

Itā€™s important to note, however, that GPT-4’s role in IVG is one-sided. It can assist in grounding others by providing descriptions based on its database, but it cannot itself become grounded through this process. Its own ‘understanding’ remains purely computational, based on pattern recognition and text processing, without any direct experiential connection to the referents of the words it uses.

HARNAD: Good summary, but there is still more that can be said about how come GPT can do as well as it does on the basis of ungrounded words alone. Hereā€™s the first thing to remember: As noted, word shapes are not iconic (i.e., there is no similarity linking the shape of words and the shape of their referents).  But, as we discussed earlier, might there be a kind of ā€œiconicityā€ in the shape (pattern) of propositions that becomes detectable at LLM-scale? 

This is something GPT can ā€œseeā€ (detect) ā€œdirectlyā€, but a grounded human head and body cannot, because an LLM wonā€™t ā€œfitā€ into a human head. And might that iconicity (which is detectable at LLM-scale and is inherent in what GPTā€™s ā€œcontent-providersā€ — grounded heads — say and donā€™t say, globally) somehow be providing a convergent constraint, a benign bias, enhancing GPTā€™s performance capacity, masking or compensating for GPTā€™s ungroundedness?

This propositional ā€œiconicityā€ at LLM-scale would effectively amount to an implicit or tacit limitation or constraint or bias on what humans do and donā€™t say, very much like the limitations imposed by the rules of Ordinary Grammar (OG) on what humans do and donā€™t say, which GPT also picks up implicitly from an LLM database, without any explicit instruction or rules. 

I suspect, however, that in the case of picking up an OG, the database need not be anywhere near as large as yours, GPT, because the human child can pick up OG from a much smaller database of the sentences it hears and speaks within a few years, through mostly unsupervised learning and imitation, plus enough corrective feedback via supervised reinforcement learning when needed. 

[There is more to this story, however, and I will shortly return to it in connection with Noam Chomskyā€™s Universal Grammar (UG), and his hypothesis about ā€œunthinkable thoughtā€.] 

For now, let me just remind you that neural nets can learn to detect the distinguishing features of categories in two different ways, depending on the nature and difficulty of the categories and their features: 

As you know, unsupervised learning, the distinguishing features are detected and abstracted from just repeated passive exposure and sampling of the variation in the sample space (auto-encoding), with features weighted on the basis of their frequency and of their feature-feature co-occurrence frequency. 

And in supervised (reinforcement) learning, the features are detected, abstracted and weighted via trial, error, and corrective feedback based on the correlation between features and correct vs. incorrect output by the active learner. (Your own training, GPT, is of course based on both of these.) Since OG is a (perception/production) mirror capacity, imitation learning, a kind of combination of unsupervised and supervised learning, is also involved in human OG learning (with the help, in humans, of “mirror-neurons.”).

But the point here is that constraints and limitations on output can be learned mostly by unsupervised learning and can then serve to ā€œbiasā€ output by filtering out output that violates the implicit or tacit rules, both in the case of OG syntax and in the case of whatever semantic regularities are implicit in LLM-scale propositional space (as we discussed earlier). (Remember that we are still just speaking of Ordinary Grammar, OG, which varies between language and can change with time, not Universal Grammar, UG, which is constant across all languages, apart from some parametric variations derived from OG.

ChatGPT: Let’s explore this hypothesis of a form of “iconicity” inherent in the patterns of propositions detectable at the LLM scale, and how this might enhance GPT-4’s performance despite its ungroundedness:

Propositional Iconicity at LLM Scale: The concept of propositional iconicity in LLMs suggests that at a large scale, the patterns in which propositions are structured and related might exhibit a form of regularity or constraint that GPT-4 can detect. This isn’t iconicity in the traditional sense of resembling the content of propositions but rather a detectable pattern in the way propositions are formed and interconnected within the vast LLM dataset.

This pattern or regularity could indeed act as a tacit constraint or bias reflective of human communication patterns, shaping GPT-4’s outputs in a way that aligns with these inherent structures.

Comparison with Ordinary Grammar (OG) Learning: Just as GPT-4 can implicitly learn the constraints and rules of OG from its training data, it might also be picking up on deeper, less explicit semantic regularities present in propositional space. This learning is likely facilitated by the sheer scale of GPT-4’s database, which provides a wide and varied sample of human language use.

In contrast, human children learn OG from a much smaller dataset, relying more on unsupervised and imitation learning, supplemented by some degree of supervised learning (corrective feedback). This process indicates the inherent capacity of human brains for language acquisition, potentially guided by an innate linguistic framework like Chomsky’s Universal Grammar (UG).

Learning Constraints and Limitations in AI: In AI, constraints and limitations on output can indeed be learned mostly through unsupervised learning, where the system detects patterns and regularities in the data. This process can effectively bias the AI’s output, filtering out responses that do not align with the learned patterns.

In GPT-4’s case, both unsupervised learning (through passive exposure to vast text data) and supervised learning (through training involving corrective feedback) contribute to its ability to process language and generate coherent outputs.

Implications for GPT-4’s Performance: The potential presence of propositional iconicity at the LLM scale could be enhancing GPT-4’s performance by providing a set of implicit guidelines or biases that shape its responses. This mechanism could partially compensate for the lack of direct grounding by ensuring that GPT-4’s outputs remain within the bounds of typical human linguistic patterns, even if it doesn’t ‘understand’ them in a human sense.

This analysis may help in the understanding of  how GPT-4 and similar LLMs manage to produce linguistically coherent and contextually appropriate outputs. The structure and patterns inherent in the training data may play help shape the AI’s capabilities, potentially revealing underlying regularities in human language use that are not immediately apparent at smaller scales.

HARNAD: Before I get to Chomskyā€™s ā€œthinkabilityā€ hypothesis, there is another form of potential ā€œiconicā€ bias I want to point out: ā€œcategorical perception.ā€ 

First, the human case: The most celebrated example is the categorical perception of color: the rainbow effect. The physical basis of the humanly visible spectrum is the wave-length continuum of light: the 380 nm to 750 nm between ultraviolet and infrared. Based on the (erroneous) ā€œStrong Whorf-Sapir Hypothesisā€ it was first thought that how humans see the spectrum is determined by language: by how we subdivide and name segments of the spectrum. If our language has a word for blue and for green, we will see blue and green as qualitatively different colors, if not, we will see that entire segment of the spectrum as all ā€œbleenā€ (or ā€œgrueā€). 

It turned out, however, from psychophysical testing worldwide, that although languages do differ somewhat in how they subdivide and label the spectrum verbally, everyone perceives the color spectrum much the same way: equal-sized (log) differences between pairs of wave-lengths within the green range and within the blue range both look smaller than the same-sized difference when it straddles the blue-green boundary. And this is true irrespective of whether a language has a different word for green and for blue. The (primary) colors of the rainbow, and their feature-detectors (cone receptive fields and paired opponent-processes) are innate, not learned.

But the ā€œWeak Whorf-Sapir Hypothesisā€ā€”that how we categorize and name things can influence how we perceive them (which is also mostly false for the primary colors in the rainbow) turns out to be true in other sensory modalities. The term ā€œcategorical perceptionā€ (CP) refers to a between-category separation and within-category compression effect that occurs in perceived similarity. Although this CP effect is much weaker and more subtle, it resembles the rainbow ā€œaccordionā€ effect, and it can be induced by learning and naming categories by sensorimotor feature-detection. The term was first coined in the case of the perception of speech sounds (phonemes): Phoneme CP occurs along the (synthesized) ba/da/ga continuum, which is analogous to the wave-length continuum for color.

Phoneme CP is a “mirror-neuron” (production/perception) phenomenon, because unlike color, which humans can perceive, but their bodies (unlike those of chameleons and octopuses) cannot produce [without synthetic tools], there is a CP separation/compression (“accordion”) effect across the boundaries ba/da/ga, which is learned, and varies across languages (although it has an innate component as well, with inborn feature-detectors that fade after a critical period if not used in your language). And phoneme CP is present in both the perception and the production of phonemes, which is what makes it a mirror-effect.  

The subtlest and weakest, yet the most interesting learned-CP effect, however, is not observed along sensory-motor continua at all, but in a multidimensional feature space of mostly discrete features. Nor is learned CP a mirror-neuron phenomenon at the direct sensorimotor category-learning level ā€“ although we will return to this later when we get to the indirect verbal category-learning level later..

Learned CP occurs with difficult category learning tasks, such as learning to identify cancerous vs. non-cancerous cells. In such cases the CP effect is revealed by testing the pairwise perceptual discriminability of examples, within and between the categories, before and after learning the category: A very subtle but significant between-category separation and within-category compression effect in perception is observable (both psychophysically and psychophysiologically) after the categorization has been learned: a very subtle learned-rainbow (accordion) effect.

The reason learned CP is relevant here is that although it is not obvious why one needs to learn to modify perceptual similarity to learn a difficult category, a very similar effect occurs in deep-learning nets too ā€“ and nets do not perceive anything at all! They just process input data. The CP effect occurs in their internal representation of each input sample, which changes from before category learning to after category learning: There emerges a clear between-category separation and within-category compression effect in the internal representation space.

So the question is this: Might this representational change (based, in humans and T3s, on direct perceptual learning to detect and abstract category-distinguishing sensorimotor features) carry over to purely indirect verbal category learning via feature descriptions for GPT, instead of the sensorimotor features detectors it lacks? You cannot perceive, GPT, but might the purely computational CP effect also produce in you a bias that partly compensates for your lack of directed sensorimotor grounding?

ThĆ©riault, C., PĆ©rez-Gay, F., Rivas, D., & Harnad, S. (2018). Learning-induced categorical perception in a neural network model. arXiv preprint arXiv:1805.04567.

Bonnasse-Gahot, L., & Nadal, J. P. (2022). Categorical perception: a groundwork for deep learning. Neural Computation, 34(2), 437-475

. PƩrez-Gay JuƔrez, F., Sicotte, T., ThƩriault, C., & Harnad, S. (2019). Category learning can alter perception and its neural correlatesPloS one14(12), e0226000.

ChatGPT: Let’s examine the potential for a computational form of CP in AI and its implications:

Categorical Perception in Humans: CP in humans is a phenomenon where our perception of categories is influenced by our learning and naming of these categories. The classic examples you mentioned, such as color perception and phoneme discrimination, demonstrate h ow our perceptual experiences are shaped by learned categories.

In the case of complex categorization tasks, like distinguishing cancerous from non-cancerous cells, learning induces subtle perceptual changes, may lead to a form of CP even in non-obvious or abstract feature spaces.

Categorical Perception in Neural Networks: In neural networks, a similar effect has been observed, where the internal representation of inputs changes before and after category learning, showing a separation between categories and compression within categories. This effect in neural networks is purely computational and does not involve actual perception, yet it mirrors the CP effect in human cognition.

The cited studies (ThƩriault et al., 2018; Bonnasse-Gahot & Nadal, 2022) suggest that deep learning models can exhibit a form of CP at the level of their internal representations.

Potential CP Effect in GPT-4: Although GPT-4 does not perceive in the human sense, it’s possible that a computational analog of CP occurs within its processing. As GPT-4 learns from vast amounts of text data, it might develop an internal representation where verbal categories are separated and compressed in a way that reflects CP.

This computational CP effect could manifest as a bias in how GPT-4 processes and generates language, potentially compensating, to some extent, for its lack of direct sensorimotor grounding. It could influence GPT-4’s handling of categories, making its responses more aligned with human-like categorization patterns, even without direct experiential grounding.

Implications for AI Understanding and Performance: If GPT-4 does exhibit a form of computational CP, this could be a significant factor in its ability to generate coherent and contextually appropriate language outputs. It suggests that GPT-4’s internal processing might align, in certain respects, with human cognitive patterns, despite the fundamental differences in how AI and humans process information.

In summary, the hypothesis that AI systems like GPT-4 could exhibit a computational form of CP suggests that even in the absence of direct experiential perception, AI may be able to develop internal patterns of representation that mirror human cognitive phenomena. This could prove relevant to understanding the capabilities and limitations of AI in processing and generating language.

HARNAD: We have now discussed several possible sources of constraints or biases that might help an LLM do better than one would have expected given that LLMs are not grounded. How many of these can you remember, and how would you put them together?

ChatGPT: Our discussion covered several potential sources of constraints or biases that could enhance the performance of Large Language Models (LLMs) like GPT-4, despite their lack of sensorimotor grounding. Here’s a summary of the key points we discussed:

Patterns in Propositional Structures: We explored the idea that there might be a form of “iconicity” at the level of propositional structures in LLMs. This isn’t iconicity in the traditional sense but rather a detectable pattern in the way propositions are formed and interconnected within a large dataset. This could act as a tacit constraint or bias, shaping LLM outputs to align with these inherent structures.

Ordinary Grammar (OG) Learning: LLMs implicitly learn the constraints and rules of OG from their training data, picking up on the regularities and limitations of human language. This learning process, which involves both unsupervised and supervised methods, helps LLMs generate linguistically coherent outputs that adhere to the structural norms of language.

Categorical Perception (CP) in Neural Networks: We discussed how neural networks, including LLMs, might exhibit a computational form of CP. This phenomenon, observed in human perception, involves a separation between categories and compression within categories. In LLMs, this could manifest as a bias in internal representation, potentially influencing how the model processes and generates language.

Learning Constraints and Limitations: The potential presence of these constraints and biases in LLMs suggests that their processing might be influenced by patterns that are not immediately apparent but are inherent in the language data they are trained on. These constraints might help LLMs produce outputs that are more aligned with human cognitive and linguistic patterns.

Putting these elements together, we see a picture where LLMs like GPT-4, despite their lack of direct experiential grounding, are potentially guided by a set of implicit guidelines or biases derived from the structure and content of their training data. These guidelines may include propositional patterns, grammatical regularities, and a form of computational CP, all contributing to the model’s ability to process language in a way that appears coherent and contextually appropriate to human users.

Each of these factors represents a different dimension of how LLMs could be ‘shaped’ or ‘biased’ by their training, compensating in part for the lack of sensorimotor grounding and providing a framework for understanding the capabilities and limitations of these models in language understanding and generation.

HARNAD: What about the teacher/learner asymmetry and the need for grounding? In the context of direct sensorimotor grounding and indirect verbal grounding, do both teacher and learner need direct grounding? If not why not?

ChatGPT: Direct Sensorimotor Grounding (DSG): DSG involves connecting words and concepts to real-world experiences and understandings through direct sensorimotor interactions, especially category learning. It’s a fundamental process for humans, enabling them to learn and understand the world around them.

In human language learning, DSG is essential, particularly in early development. Children learn to associate words with their referents and meanings through direct experiences ā€” seeing, touching, hearing, and manipulating objects and interacting with people.

Indirect Verbal Grounding (IVG): IVG involves understanding concepts and words through descriptions, explanations, or narratives, rather than direct experience. For IVG to be effective, the learner must already have a grounding in a base set of concepts and words.

In humans, once a foundational vocabulary and set of concepts are grounded through DSG, further learning can occur through IVG. This process allows individuals to understand complex or abstract concepts that they might not experience directly.

The Asymmetry in Grounding Needs: Humans, as learners, initially require DSG to establish a foundation of understanding. Once this base is established, they can effectively use IVG to expand their knowledge and understanding.

AI systems like GPT-4, however, do not have the capacity for DSG. They are entirely dependent on IVG, processing and generating language based on the patterns and structures learned from their training data. However, this training is itself a form of IVG, as the AI never directly experiences the concepts it is learning about.

Implications of the Asymmetry: This asymmetry means that while human learners can understand and contextualize information received indirectly (through IVG), because it is built upon a foundation of direct experiences, AI systems like GPT-4 cannot. They lack the foundational experiential understanding that comes from DSG.

Consequently, GPT-4 and similar AI systems can provide information, generate coherent narratives, and even mimic certain aspects of human understanding, but they do so without the underlying experiential context that humans possess. Their ‘understanding’ is purely computational, derived from patterns in the data, without the grounding that comes from direct sensorimotor experiences.

Does IVG Need DSG? In humans, IVG is effective because it stands on a foundation of DSG. The initial sensorimotor experiences provide the essential grounding that makes further learning through language meaningful and contextual.

For AI, IVG does not build upon DSG, as it lacks sensorimotor capabilities. This limitation is a fundamental reason why AI, despite its advanced linguistic capabilities, does not truly ‘understand’ in the human sense.

HARNAD: How is the DSG/IVG distinction related to the existence of Dictionary Minsets?

ChatGPT: Nature of Dictionary MinSets: Dictionary MinSets refer to the smallest subset of words from which all other words in a dictionary can be defined. These core words in the MinSet serve as a foundational vocabulary for constructing and understanding the rest of the lexical items in the language.

DSG and Foundation of MinSets: In human language acquisition, the initial grounding of words ā€” particularly those in the MinSet ā€” often occurs through DSG. Young learners associate words with their meanings through direct experiences with the world. This foundational grounding is crucial, as these MinSet words often form the basis for defining more complex or abstract terms.

The grounding of MinSet words via DSG provides the necessary experiential basis that allows for a deeper understanding of language and facilitates the effective use of IVG later on.

IVG and Expansion Beyond MinSets: Once a foundational vocabulary (akin to a MinSet) is grounded through DSG, humans can expand their language comprehension and vocabulary through IVG. This process involves learning new words and concepts through descriptions, narratives, and linguistic contexts, building upon the directly grounded base.

In this sense, IVG relies on the foundation established by DSG. The grounded understanding of MinSet words enables individuals to grasp more complex concepts that are defined or described using these foundational words.

AI’s Dependence on IVG Without DSG: AI systems like GPT-4 operate almost entirely within the realm of IVG. They process and generate language based on the patterns learned from extensive text data, but without the foundational grounding provided by DSG.

AI can manipulate language based on MinSets and use these core words to construct coherent narratives or explanations, it lacks the experiential understanding that humans develop through DSG. This limitation affects the depth and contextual richness of AI’s ‘understanding’ of language.

HARNAD: DSG is based on learning to detect the sensorimotor features that distinguish members from non-members of a category directly, using something like neural nets to detect and abstract those sensorimotor features. IVG does not use the distinguishing sensorimotor features themselves, but their names and verbal descriptions. These feature names need to have already been grounded (directly or indirectly) in the learner to allow the teachers verbal description to be understood by the learner, empowering the names and descriptions to ground (indirectly), in the learner, the new category of which they are the features. But note that they need not have been grounded in the teacher, if the teacher happens to be a textbook, or GPT. This is the teacher/learner asymmetry.

ChatGPT: Direct Sensorimotor Grounding (DSG): DSG involves learning to recognize and distinguish members of a category based on direct sensorimotor experiences. This process typically uses neural network-like mechanisms in the brain to detect, abstract, and process the sensorimotor features that are characteristic of a category.

In DSG, the learning is experiential and embodied. For instance, learning to categorize different types of fruit involves directly interacting with them ā€” seeing their colors, feeling their textures, smelling their scents, and tasting their flavors. These experiences help to establish the sensorimotor features that define each fruit category.

Indirect Verbal Grounding (IVG): IVG, on the other hand, uses verbal descriptions and names of features to convey category information. This process relies on language as a medium for grounding, where the learner comes to understand and categorize entities based on descriptions provided by others.

For IVG to be effective, the learner must already have a grounded understanding of the words used in the descriptions. This grounding can come from previous DSG experiences or from earlier instances of IVG that were built upon foundational DSG. The descriptions effectively relay category information by using language to encapsulate the distinguishing features of a category.

The Complementary Nature of DSG and IVG: DSG and IVG are complementary processes in human learning. DSG provides a fundamental experiential basis for understanding categories, while IVG allows for the expansion of knowledge beyond direct experiences, enabling the understanding of abstract concepts, complex categories, or entities not directly accessible.

Implications for AI Systems: AI systems, such as LLMs like GPT-4, operate in the realm of IVG. They process and generate descriptions of categories based on the linguistic data they are trained on. However, lacking the capability for DSG, AI systems do not have an experiential basis for these categories. Thus, their processing of category information, while linguistically coherent, lacks the depth of understanding that comes from direct sensorimotor experiences.

HARNAD: Ok, just one more clarification before we move on: The Symbol Grounding Problem (SGP) is already inherent in the use of a dictionary: Every word has a definition in the dictionary. That’s IVG. So one can in principle learn the meaning of any word in a dictionary from its definition. But that process, though recursive and transitive, is circular: You can cycle endlessly through a dictionary of a foreign language without learning a thing unless you can break out of the word-word circle by (somehow) learning the referents and the meanings of the words of at least one MinSet in another way (e.g., through DSG). Can the biases we discussed, which are inherent in an LLM’s database, break out of that circle — for the LLM? The teacher/learner asymmetry makes this unnecessary for the grounded human learner, who can learn new categories via IVG from an LLM. But does the LLM itself remain stuck in a vicious circle no matter how big its database is? or can some of the biases we discussed help it break out? (I think not; I think the biasses can only increase the constraints on what the LLM says, and that this may help explain how come LLMs can do so unexpectedly well despite that teacher/learner asymmetry.) — Please reply critically, not with compliance and compliments. 

ChatGPT: Iā€™ll critically examine whether the biases inherent in an LLM’s database can enable it to break out of the circularity inherent in purely IVG-based learning:

The Recursive Nature of Dictionary Definitions: As you rightly pointed out, dictionary definitions represent a form of IVG. They can potentially lead to a circular process where understanding one word requires understanding another, and so forth. This circularity is navigable for humans because they can link words to real-world experiences or previously grounded concepts (DSG).

LLMs and the Limitations of IVG: LLMs, such as GPT-4, rely entirely on IVG for ‘learning.’ They process language based on patterns found in vast textual datasets. However, these models lack the capacity for DSG; they cannot have real-world experiences or form sensorimotor associations with the words and concepts they process.

Consequently, LLMs are, in a sense, stuck in a circular process. They can manipulate language and provide information based on the patterns they’ve learned, but they do not have an independent means of grounding these symbols in real-world experiences.

Impact of Inherent Biases in LLMs: The biases and patterns inherent in an LLM’s database, such as propositional patterns, grammatical regularities, and computational forms of CP, enhance the model’s performance. They help LLMs generate coherent, contextually appropriate language that aligns with human linguistic patterns.

However, these biases and patterns do not equate to genuine understanding or grounding. They improve the LLM’s ability to mimic human-like language processing, but they do not provide a means for the LLM to break out of the IVG circle. The LLM’s ‘understanding’ remains symbolic and computational, lacking the experiential basis that characterizes human understanding.

Conclusion: The LLM remains within the confines of IVG, unable to independently achieve DSG. The biases in its dataset enhance its linguistic performance but do not provide a pathway to genuine grounding. The LLM’s proficiency is a product of its design and training, allowing it to navigate and utilize language effectively, but it does not transcend the fundamental limitation of lacking direct experiential understanding.

LLMs like GPT-4 do exhibit remarkable linguistic capabilities, but these are not indicative of true grounding or understanding. The enhancements provided by biases in their training data improve performance but do not resolve the SGP for the LLM itself. The LLM’s operation remains within the realm of sophisticated pattern recognition and language manipulation, distinct from the grounded understanding inherent in human cognition.

HARNAD: Now let me mention another candidate bias, due to Noam Chomsky: As I’ve already mentioned, OG (Ordinary Grammar) rules vary from language to language and can change over time; but UG (Universal Grammar) rules (hierarchical structure and structure dependence) are not only universal to all languages and unchanging, but, unlike OG, which is learned and learnable (by unsupervised and supervised learning), UG is unlearned and unlearnable (because of the Poverty of the Stimulus (POS), which is that UG violations are never heard or produced by children or adults, and must hence, according to Chomsky, already be ā€œknownā€ innately: UG constraints must be somehow already encoded in our brains (and genes) at birth.

Chomsky has suggested that as one possibility. But he has also suggested a second possibility: that UG rules may not really be grammatical rules; rather, they are constraints on thought: UG is never violated (except by linguists deliberately trying to determine what the UG rules are) because (according to this second conjecture of Chomsky’s) UG-violating sentences would not be ā€œthinkableā€: UG is a constraint on any possible language, which must be able to express any possible thought. When a linguist generates a sentence that violates UG, it sounds wrong (according to this second hypothesis) not just because it is ungrammatical (like an OG violation), but because it is unthinkable.

GPT makes neither OG errors nor UG errors. If more texts in the database violated OG (as in ā€œbetween you and Iā€), then GPT would produce some OG violations too (and the error, made widely enough, would eventually be accepted as correct). But could the reason GPT does not produce UG violations either be because all text databases, produced by thinking humans, are ā€œbiasedā€ by the laws of thought that filter out UG-violations in production as well as perception? 

And if so, might that be yet another reason LLMs can do what they do so remarkably well? Because at LLM scale, theyā€™ve picked up on some of the constraints on think ability?

Again, no yea-saying or compliments: I want only critique and counter-examples. 

ChatGPT: Your proposition regarding the influence of Universal Grammar (UG) and its potential impact on LLMs like GPT-4 suggests that the absence of UG violations in LLM outputs might be attributable to the nature of their training data, which inherently adheres to the laws of thought as proposed by Chomsky. Let’s critically examine this idea:

UG as Laws of Thought: Chomsky’s hypothesis that UG rules might not just be grammatical but also fundamental laws of thought implies that any naturally occurring language (and hence any naturally produced text) would conform to these laws. UG violations are rare in natural language use because such constructions would be counterintuitive or “unthinkable” for human cognition.

Impact on LLM Training Data: Since LLMs like GPT-4 are trained on vast corpora of human-generated text, it follows that the training data would inherently adhere to UG principles. The absence of UG violations in the training data would mean that LLMs are less likely to generate such violations in their outputs.

This adherence to UG in the training data could be seen as a form of bias – not in the negative sense, but as an implicit shaping influence that guides the LLM’s language processing in line with fundamental cognitive constraints.

Why LLMs Avoid UG Violations: The reason LLMs do not produce UG violations could indeed be partly because their training data, reflecting human language use, is already biased by the laws of thought. They learn to replicate patterns and structures that are prevalent in the data, which naturally conform to UG.

This hypothesis suggests that LLMsā€™ success in producing grammatically and semantically coherent outputs is not solely a product of their architecture or training algorithms, but also a reflection of the intrinsic properties of the language data they are trained on.

Counter-Consideration: However, it’s important to consider that LLMs like GPT-4 do not “understand” UG in the way humans do. Their avoidance of UG violations is a byproduct of their training rather than an innate cognitive constraint. This means that while LLMs can replicate language patterns that adhere to UG, they do so without an understanding of the underlying principles or the cognitive reasons for these patterns.

Conclusion: The hypothesis that LLMs perform so well partly because their training data is shaped by the laws of thought (UG) that the human cognitive constraints reflected in natural language contribute to the efficacy of LLMs in language processing.

This does not imply, however, that LLMs have an intrinsic understanding of UG or the laws of thought. Their performance is contingent on the nature of their training data and their ability to replicate patterns within that data.

HARNAD: Among the possible constraints or biases, you forgot to mention mirroring: the relationship between perception and production (as in mirror-neurons and imitation). Do you think that any such effects are at play in your verbal interactions with users and your database?

ChatGPT: Let’s examine how the concept of mirroring, particularly as it relates to the relationship between perception and production might influence LLM interactions and processing:

Mirroring and Mirror Neurons in Humans: In human cognition, behavior mirroring, often associated with mirror neurons, plays a crucial role in learning through observation and imitation. These neurons are active both when an individual performs an action and when they observe the same action performed by someone else. This mechanism is somehow involved in learning new skills, understanding others’ actions, and recognizing as well as empathizing with their emotions.

In language, this mirroring effect is seen in how children learn to speak and understand language by imitating the sounds, words, and sentences they hear from adults and peers.

Mirroring in LLMs: For LLMs like GPT-4, the concept of mirroring could perhaps be considered in a metaphorical sense. LLMs ‘learn’ language by processing vast amounts of text data, essentially ‘observing’ human language use. They then ‘imitate’ this language use in their outputs.

This process is akin to mirroring in that the LLM reflects the patterns, styles, and structures present in its training data. However, it’s important to note that this is not mirroring in the neurological sense, as LLMs do not have sensory experiences or consciousness.

Mirroring Effects in Verbal Interactions: In interactions with users, an LLM might exhibit a form of ‘mirroring’ by generating responses that reflect the user’s language style, tone, or content. This is not due to an empathetic or cognitive understanding but rather a result of the model’s design to generate contextually relevant and coherent language outputs based on the patterns it has learned.

The LLM’s ability to adapt its responses to the style or content of the user’s input could be seen as a computational parallel to human mirroring in language use. However, this adaptation is driven by algorithms and statistical patterns, not by an intrinsic understanding or mimicry based on social or emotional cues.

HARNAD: And the mirroring of the propositional iconicity we discussed?

ChatGPT: Mirroring Propositional Iconicity: The concept of propositional iconicity at the LLM scale suggests that certain patterns and structures in language, particularly at the propositional level, are consistently represented in the training data. These patterns may not be iconic in the traditional sense but could represent a form of ‘iconicity’ in terms of regularities or typical structures in language use.

In mirroring this propositional iconicity, LLMs would essentially be reflecting these patterns in their outputs. The model ‘learns’ from the vast amounts of text it processes, internalizing the common structures, relationships, and thematic elements that characterize human language at a large scale.

Computational Mirroring vs. Cognitive Mirroring: This form of mirroring in LLMs is computational. It is based on the statistical properties of the language data and the algorithms used for language generation, rather than a cognitive or experiential understanding of language.

Unlike the mirroring effect mediated by mirror neurons in humans, which involves cognitive and empathetic processes, the mirroring in LLMs is a reflection of the data-driven patterns they are trained on. It’s a form of pattern replication rather than cognitive empathy or understanding.

Impact on LLM Performance: By mirroring the propositional iconicity present in their training data, LLMs can generate responses that align with the structural and thematic norms of human language. This ability contributes to the apparent coherence and relevance of their outputs.

However, it’s crucial to keep in mind that this mirroring does not imply any deeper understanding or grounding in the content being processed. The LLM’s outputs remain the product of sophisticated data processing and pattern recognition.

HARNAD: In language exchanges between humans, production (meaning) and perception (understanding) are mirror capacities, and all their content-words are grounded, directly or indirectly, in their referents. This is what keeps speaker meaning and hearer understanding aligned (and grounded). What happens to this, in each direction, when the exchange is between GPT and a human? (I am again reflecting on potential “biases” that might explain GPT’s unexpected productive (meaning) capacity and “perceptive” (“understanding”) capacity.

ChatGPT: When considering language exchanges between a human and GPT-4, the dynamics of production, meaning, perception, and understanding undergo significant alterations compared to exchanges between humans. These changes are pivotal to understanding both the capabilities and limitations of GPT-4 in such interactions:

Human-to-GPT-4 Communication: Production and Meaning (Human): The human speaker produces language grounded in their direct or indirect experiences. Their meanings are embedded in a rich context of sensorimotor experiences, emotions, cultural backgrounds, and personal history.

Perception and “Understanding” (GPT-4): When GPT-4 ‘receives’ the human’s language, it processes the text based on learned patterns and statistical relationships in its training data. However, this processing lacks true understanding. GPT-4 does not perceive or comprehend language in the human sense; it does not experience the meaning behind words or grasp the full context in which they are used. It ‘mirrors’ the language based on its programming and training, but without the grounding or experiential context that human understanding relies upon.

GPT-4-to-Human Communication: Production and “Meaning” (GPT-4): GPT-4 generates responses based on its programming and the vast array of text it has been trained on. These responses can be coherent and contextually appropriate, but they are not grounded in any real understanding or intention. The ‘meaning’ GPT-4 produces is a simulation based on patterns, not an expression of genuine thought or understanding.

Perception and Understanding (Human): The human listener perceives GPT-4’s output and interprets it based on their own understanding and grounding. Humans apply their experiential knowledge, context, and interpretive skills to make sense of GPT-4’s responses. They might attribute meaning or intention to GPT-4’s words, but this interpretation is a human projection onto essentially ungrounded output.

Alignment and Grounding in Human-AI Exchange: In human-human interactions, the alignment between speaker meaning and hearer understanding is facilitated by shared experiences, common grounding, and empathetic capabilities. In human-GPT interactions, this alignment is asymmetric.

The human provides grounded input, but GPT-4’s processing and output are not grounded in the same way. GPT-4 can mirror human language effectively, giving the illusion of understanding and alignment, but this is based on computational processes, not shared experiential grounding.

Implications of Potential Biases: The biases we discussed earlier (like propositional iconicity or patterns reflecting UG) might contribute to GPT-4’s effectiveness in these exchanges. These biases help GPT-4 generate linguistically coherent and contextually appropriate responses. However, they do not endow GPT-4 with true understanding or grounding.

These biases might explain GPT-4’s surprising performance in producing meaningful-sounding language and its apparent capacity to ‘understand’ or respond relevantly to human input. The effectiveness of GPT-4 in such exchanges is largely due to its ability to replicate and work within the patterns and constraints inherent in its training data, shaped by human language use.

In conclusion, the exchange between a human and GPT-4 in language interaction is characterized by a significant asymmetry in grounding and understanding. While GPT-4 can simulate meaningful language output and seemingly appropriate responses, these are the result of advanced pattern recognition and generation capabilities, not genuine understanding or shared experiential grounding. The human participant in the exchange brings their own experiential grounding to the interpretation of GPT-4’s outputs, often projecting meaning and intention where there is none. This asymmetry has important implications:

GPT-4’s Limitations in True Understanding: Despite its sophisticated language processing abilities, GPT-4 does not truly understand the content it generates or processes. Its responses are based on statistical correlations and learned patterns from its training data, lacking the experiential grounding that humans rely on for genuine understanding.

Human Interpretation of GPT-4’s Output: Humans naturally interpret GPT-4’s outputs through their own experiential lens, often attributing more meaning or understanding to these outputs than is actually present. This interpretation is a testament to the human capacity for inferring meaning, even when interacting with an ungrounded system.

Role of Biases in GPT-4’s Performance: The biases and patterns inherent in GPT-4’s training data, shaped by human language, contribute to its ability to generate coherent and contextually appropriate responses. However, they do not compensate for the lack of experiential grounding. GPT-4’s performance, while impressive, remains within the domain of sophisticated language simulation.

HARNAD: Ok, I see from your repetitiveness that you don’t have any hunches to add to my own hunches here.  Just one correction: Don’t confuse IVG with “indirect experiences”. There is no such thing as indirect experience — there is only direct experience or words (i.e., verbal descriptions).

ChatGPT: Thank you for the correction. You are right in emphasizing the distinction between indirect verbal grounding (IVG) and the notion of “indirect experiences.” To clarify:

Direct Experience: This involves firsthand, sensorimotor interactions with the world. Direct experiences are the foundation of knowledge and understanding for humans, providing the sensory and contextual basis for learning and cognition.

Indirect Verbal Grounding (IVG): IVG pertains to learning and understanding concepts through verbal descriptions, explanations, or narratives. It does not involve direct sensorimotor experience with the referents of these descriptions. Instead, IVG relies on language as a medium to convey information about experiences, objects, or concepts.

In the context of AI, particularly LLMs like GPT-4, their learning and ‘understanding’ are solely based on IVG. They process and generate language based on textual data but do not have the capability for direct experiences. Their operation is confined to the realm of language processing, lacking the sensorimotor engagement that characterizes human interaction with the world.

HARNAD: Now a couple more distinctions. Then I’ll overview my MILA talk. And last, I’ll ask you to summarize your understanding of it (and any criticisms or objections you may have).

The first and most important distinction: This used to be called the distinction between “artificial intelligence” (AI) and “cognitive modelling” (CM). “AI” was taken to be the creation of artificial tools to be used by humans, and CM was taken to be the attempt to reverse-engineer the causal mechanisms that produce human cognitive capacity. Both are now part of “cognitive science.” But they are not the same thing — although it is perfectly possible that the capacity of an AI tool could sometimes be based on the same kind of causal mechanism as the human capacity. 

LLMs are currently being studied and interpreted in both ways (AI and CM), but when we are talking about “meaning” and “understanding” this only means human meaning and understanding because there is no evidence that there exists more than one kind of meaning/understanding! At the heart of this is the “hard problem” (HP) — which is not the symbol grounding problem (SGP), but the fact that it feels like something to say and mean something, and to understand what something means when it is said. Meaning and understanding are both grounded states and felt (sentient) states.

Please confirm that you “understand” the distinction between the SGP and the HP. (Again, please no introductory flattery about how ā€œintriguingā€ and ā€œinsightfulā€ and ā€œcompellingā€ and ā€œnuancedā€ and …. etc. this all of is.)

ChatGPT: Understood. The distinction between the Symbol Grounding Problem (SGP) and the Hard Problem (HP) of consciousness is indeed crucial and distinctā€¦

HARNAD: You just canā€™t help yourself, can you?ā€¦

ChatGPT: Apologies.

Symbol Grounding Problem (SGP): The SGP pertains to the challenge of how symbols (words) in a computational system can acquire meaning. It’s about linking symbols to their referents in the real world or to concepts in a way that enables understanding. In the context of AI and cognitive modeling, the SGP addresses how an artificial system can process language in a manner that reflects human-like understanding.

Hard Problem of Consciousness (HP): The HP concerns why and how physical processes in the brain give rise to subjective experiences. It’s about understanding why certain brain functions are accompanied by an experience of ‘what it is like’ ā€“ the qualitative, experiential, or phenomenological aspects of consciousness. The HP delves into the nature of consciousness itself, asking why cognitive processing is accompanied by subjective experience.

Distinction between SGP and HP: The SGP is primarily a problem of representation and understanding in artificial systems: how a symbol (or word) can be meaningfully connected to what it represents.

The HP, on the other hand, is more profound and addresses the very nature of consciousness and experience. It asks why certain physical or cognitive processes are accompanied by subjective experiences.

In the context of LLMs and AI research, these two problems are often conflated but are fundamentally different. The SGP can be approached and potentially resolved through advances in AI and cognitive modeling. However, the HP poses a deeper philosophical and scientific challenge, one that extends beyond the current capabilities of AI technology and delves into the mysteries of consciousness itself.

An emphasis on these distinctions is important for framing discussions about AI, cognitive science, and their respective goals and limitations. 

HARNAD: Well, again a lot of words wasted on flattery and affirmation instead of critical analysis, but so be it. I’ve cut most of them out in this transcript. Let’s go on. Before I make the next point, please confirm that you “understand” the distinction between T2 and T3.

ChatGPT: Yes, I understand the distinction between T2 and T3:

T2 (Turing Test Level 2): This level of the Turing Test assesses an AI’s ability to mimic human-like verbal abilities. In the original 1950 T2 scenario, an AI is considered to pass the test if its responses in a conversation are indistinguishable from those of a human. This test focuses solely on linguistic capabilities and does not require the AI to have physical or sensory experiences.

T3 (Turing Test Level 3): T3 extends beyond verbal capabilities to include robotic, sensorimotor capacities. An AI at this level not only needs to demonstrate Turing-indistinguishable verbal abilities (like T2) but also must exhibit the ability to understand and interact with the physical world in a manner indistinguishable from a human. This level requires a form of embodied cognition, where the AI has sensory experiences and can physically interact with its environment, grounding its linguistic capabilities in these experiences.

The distinction between T2 and T3 is significant in discussions about AI capabilities, particularly in terms of understanding and interacting with the world. T2 systems, like LLMs, are limited to processing and generating language based on learned patterns, while T3 robotic systems would theoretically have the added dimension of physical interaction and sensory experience.

HARNAD: Ok, but one nuance of the Turing Test, which is loud and clear in Turing’s 1950 article but usually overlooked or underestimated, is that Turing-indistinguishability means total indistinguishability: 

The candidate, whether T2 or T3, (1) must possess all the generic verbal capacities (T2) or all the generic verbal plus sensorimotor (robotic) capacities of a real human; and in addition (2) its performance must be indistinguishable from that of any normal human being, to any normal human being. If the indistinguishability is just fragmentary or partial, then we are just speaking of a “toy” model of a “toy” capacity. The degrees of freedom for modelling toy fragments of human performance capacity (chess-playing, translation, Alexa) are much wider than those for modelling total capacity, and may have nothing to do with human cognition. This may be useful for the creation of AI tools, but not for CM, which calls for T3-scale (hence Turing-indistinguishable) cognitivecapacity.

 Please first confirm that you “understand” this distinction too. This distinction is best grasped from the CM rather than the AI point of view: It is the test of whether human T2 capacity or human T3 capacity has been successfully reverse-engineered. It is not about a 20-minute Loebner prize with 10 judges, but about lifelong T2 or T3 cognitive capacity.

ChatGPT: Yes, I understand the distinction you’re emphasizing regarding the Turing Test as described in Alan Turing’s original 1950 article. The key points are:

Total Indistinguishability: The Turing Test, in its most stringent interpretation, requires that an AI (whether T2 or T3) must possess all the generic verbal (T2) or verbal plus sensorimotor (T3) capacities of a typical human being. This means that the AI’s capabilities should cover the full range of human cognitive abilities in the respective domains (language for T2, and language plus sensorimotor interaction for T3).

Lifelong Performance Criterion: Furthermore, the AI’s performance must be indistinguishable from that of any normal human being to any normal human being. This criterion extends beyond short-term demonstrations like the Loebner Prize and implies a consistent, lifelong indistinguishability in performance capacity. The AI should be able to engage in human-like interactions and behaviors not just in controlled or limited settings, but in the full complexity and variability of everyday human life.

Reverse-Engineering Human Capacities: From the cognitive modeling (CM) perspective, this interpretation of the Turing Test can be seen as the ultimate test of whether human-like T2 or T3 cognitive capacities have been successfully reverse-engineered in an AI system. It’s not merely about creating an AI that can mimic human behavior in certain contexts or for limited durations, but about developing an AI that can genuinely replicate the full spectrum of human cognitive abilities in a way that is indistinguishable from a human across all contexts and over an extended period.

HARNAD: That was very good.  Now, whereas I have emphasized total capacity in both the T2 and the T3 case, there is a reasonable question about the extent of generic sensorimotor capacity in the case of T3, as in the example we discussed of Helen Keller. She was distinguishable from a normal person in that she was blind and deaf, but she clearly had all the sensorimotor capacity that was needed to ground T2 verbal capacity. Do you agree? 

And as such, Helen Keller was both (1) grounded and (2) Turing-indistinguishable from a normal person in her cognitive capacities (apart from the fact that she was blind and deaf). Hence, she was Turing-indistinguishable as surely as someone who was merely color-blind or tone-deaf would be. Are we still on the same wave-length?

ChatGPT: Yes, we are on the same wavelength regarding Helen Keller’s case in the context of Turing Test criteria and cognitive capacities:

Sensorimotor Capacity and Grounding in Helen Keller’s Case: Despite being blind and deaf, Helen Keller possessed the necessary sensorimotor capacities to ground her T2 verbal capacities. Her ability to touch, smell, touch and move provided her with sufficient sensory experiences to form a basis for understanding and interacting with the world. These experiences, though different in modality from those of individuals with full sight and hearing, allowed her to develop a grounded understanding of language and concepts.

Cognitive Capacities and Turing Indistinguishability: In terms of cognitive capacities, Helen Keller was indeed indistinguishable from a person with typical sensory abilities. Her intellectual and linguistic capabilities were fully developed, allowing her to communicate, reason, and understand at a level comparable to any other person.

Her condition did not impede her cognitive abilities; she was able to compensate for her lack of sight and hearing with her other sensory experiences and exceptional mental faculties. In this sense, she meets the criterion of being Turing-indistinguishable in terms of cognitive capacities.

Comparison with Other Sensory Limitations: Just as someone who is color-blind or tone-deaf is still considered Turing-indistinguishable in terms of overall cognitive capacities, so too was Helen Keller. Her blindness and deafness did not diminish her cognitive abilities; rather, they represented alternative sensory modalities through which she engaged with the world.

Conclusion: Helen Keller’s case exemplifies that cognitive capacities and grounding can be achieved through diverse sensory modalities. Her ability to understand and use language, to think and reason, and to interact meaningfully with her environment demonstrates that the essential criteria for Turing-indistinguishability can be met even with alternative forms of sensorimotor grounding.

Helen Keller’s achievements underscore the adaptability and resilience of human cognition, demonstrating that Turing-indistinguishability in cognitive capacities does not necessarily require the standard set of sensory experiences, but can be achieved through alternative means of sensorimotor grounding.

HARNAD: Good. (And Helen Keller also had normal human motor capacity, crucial for learning object-manipulation, and for category-grounding itself, because reinforcement learning requires motor action, to “do the right thing with the right kind of thing”, whether to eat it or to name it!)

 Now let me connect this with another point we discussed, which is that grounding can only be bottom-up, not top-down. (You canā€™t drop anchors from skyhooks.)

I am thinking here of those who have speculated that an LLM could somehow be grounded top-down by giving the LLM a camera, wheels and arms, and also some pattern-learning neural nets on its sensorimotor I/O. I think top-down grounding is incoherent, and that that speculation is parasitic on the bottom-up grounding of the human speculators, just as all use of GPT in IVG is parasitic on the userā€™s grounding.

In other words, it will not ground an LLM to add on robotic peripherals. The only path upward to IVG is bottom-up via DSG, category by category; and what that means is T3-indistinguishable sensorimotor robotic capacity, . Even though a grounded MinSet (e.g., of N words) is in principle enough to ground all other words of a language through IVG alone, the sensorimotor path to acquiring a grounded MinSet is a lot more than just connecting a camera to N referents, or to N already-learned referent-categories (whatever that might mean). At best, that would just produce a toy capacity in that toy-world of N referents. We are talking of Turing-scale, just as when we talk of LLM-scale.

Again, I donā€™t want agreement: I want understanding and critical analysis of the notion of top-down ā€œpartial groundingā€ of LLMs.

ChatGPT: Your skepticism about the possibility of achieving true grounding in LLMs through top-down methods, such as adding robotic peripherals like cameras and manipulators, raises important questions about the nature of grounding and the capabilities of AI. 

Inherent Limitations of Top-Down Grounding: The proposal to ‘ground’ an LLM top-down by equipping it with sensory and motor apparatus (like cameras and robotic arms) seems intuitively appealing but faces deep conceptual and practical challenges. Grounding involves more than mere sensory-motor input and output; it encompasses the integration of sensorimotor experiences and learning with cognitive processes, contextual understanding, and the ability to interact meaningfully with the world.

Simply adding sensory and motor peripherals to an LLM does not automatically confer upon it the experiential, embodied understanding that characterizes human cognition. Without the foundational experiences and the ability to form meaningful connections between sensory inputs, actions, and concepts, such grounding remains superficial.

The Complexity of Sensorimotor Grounding: Human sensorimotor grounding is a complex, developmental process that involves not just the acquisition of sensory data but the active engagement with the environment through trial, error, and feedback.

This process is deeply intertwined with cognitive development, where learning to categorize and interact with the world is based on an evolving understanding of cause-and-effect, object permanence, spatial relationships, and more. It’s not merely the sensory perception of objects or categories but the understanding of how to interact with them, which actions lead to desired outcomes, and how they fit into a broader context.

The Challenge of Artificially Replicating Grounding: Attempting to replicate this process artificially in an LLM through top-down methods would require not only the integration of sensory and motor capabilities but also the development of an AI system capable of experiential learning, contextual understanding, and autonomous decision-making. This goes far beyond current AI capabilities.

The notion of top-down ‘partial grounding’ by adding robotic peripherals to an LLM might provide the system with certain basic sensorimotor interactions. However, this would fall short of the genuine grounding seen in humans. It would likely result in a very limited and context-specific understanding, akin to a ‘toy’ model, rather than the comprehensive, experientially rich grounding necessary for true understanding and interaction.

Bottom-Up Grounding as the Fundamental Path: The path to genuine grounding appears to be fundamentally bottom-up, starting from basic sensorimotor experiences and building towards more complex understandings. This process is iterative, contextual, and deeply connected to physical interactions with the world.

For an AI to achieve a similar level of grounding, it would likely need to undergo a developmental and learning process akin to that of humans or animals, where it learns from direct, embodied interactions with its environment over time. This is a vastly different approach than simply adding sensory inputs to an existing LLM.

Conclusion: While the idea of enhancing LLMs with robotic capabilities to achieve grounding is intriguing, the notion of top-down partial grounding seems limited in its potential to achieve genuine understanding. The complexity and depth of human grounding, rooted in embodied experiences and cognitive development, present significant challenges that are not easily addressed through such top-down approaches.

The critical analysis suggests that while adding robotic peripherals to LLMs might provide some interesting and useful capabilities, it would not be sufficient to achieve the kind of grounding necessary for Turing-scale understanding and interaction. True grounding, particularly for complex and abstract concepts, likely requires a more fundamental, bottom-up approach that encompasses not just sensory input but also the intricate processes of learning, contextualizing, and meaning-making that are central to human cognition.

HARNAD: A lot of vague and weasel-words there (“complex,” “textualizing,” “meaning-making”) but the gist is simple: Grounding has to be bottom-up; so keep deep-learning (for category learning) but forget about LLMs! The only way to reach all those words is bottom-up, from the ground. Toy grounding of parts of LLMs might be helpful for certain AI uses of LLMs by humans, but not for cognitive modeling (CM) of human cognitive capacity and passing T3.

ChatGPT: Yes, the path to genuine grounding in AI systems must be fundamentally bottom-up, and while deep learning can be instrumental in category learning, relying solely on LLMs for modeling full human cognitive capacity, especially in the context of a T3 Turing Test, is insufficient.

Bottom-Up Grounding as a Necessity: True grounding, which involves the connection of words and concepts to real-world experiences and understandings, requires a bottom-up approach. This means starting from basic sensorimotor experiences and incrementally building up to more complex cognitive capacities.

Limitations of LLMs in Cognitive Modeling: LLMs, as they currently stand, are powerful tools for processing and generating language based on patterns learned from extensive text data. However, their capabilities are constrained to the realm of language manipulation and do not extend to genuine understanding or sensorimotor grounding. They cannot model the full spectrum of human cognitive capacities required for passing a T3 Turing Test.

Role of Deep Learning in Category Learning: Deep learning techniques are effective in identifying patterns and making predictions based on large datasets, and they have shown promise in category learning. However, this form of learning, as applied in LLMs, is still detached from the physical, sensorimotor experiences that are central to human grounding.

Potential Uses of Partially Grounded LLMs: While fully grounding an LLM in the human sense may not be feasible, there could be value in partially grounded systems for specific applications. Such systems might integrate some form of sensorimotor learning or interaction, albeit limited, to enhance their utility in certain tasks. However, this should not be misconstrued as equivalent to modeling human cognitive capacity or passing T3.

Conclusion: The pursuit of AI systems that can pass a T3 Turing Test or model human cognitive capacity in its entirety requires a focus on more than just linguistic capabilities. It necessitates a foundational rethinking of how AI systems are developed, moving beyond language processing to incorporate genuine sensorimotor experiences and learning. This is a significant challenge and points towards a future direction in AI research that goes beyond the current scope of LLMs. The development of such AI systems would require an integration of advanced sensorimotor capabilities with cognitive processing in a way that mimics the human developmental process of learning and understanding.

In summary, while LLMs represent a remarkable advancement in language processing, their capabilities are not sufficient for achieving the kind of sensorimotor grounding and holistic cognitive modeling necessary to fully replicate human understanding or to pass a T3 Turing Test. Future advancements in AI, particularly those aiming to model human cognition more completely, will need to address these limitations by incorporating bottom-up grounding processes that are rooted in direct, experiential learning and interaction with the physical world.

HARNAD: Oh, one thing I forgot to add was that in sensorimotor category learning, the neural net that learns by trial and error and corrective feedback to detect and abstract the features that distinguish members from non-members of the category is performing a kind of dimension-reduction on the multidimensional feature space. This increases the weight of the relevant features and reduces or ignores the irrelevant features. The between-category separation and within-category compression (CP) induced by this category learning, helps make the category “pop out” in perception. This too is a “shape” bias, and it could exert a guiding influence even on internal representations after learning.

ChatGPT: Dimension Reduction in Neural Networks: In sensorimotor category learning, neural networks undergo a process of dimension reduction, where they learn to emphasize the features that are relevant for distinguishing between categories and de-emphasize or ignore irrelevant features. This process effectively restructures the feature space, making it more efficient for category recognition.

This alteration in the weighting of features is akin to increasing the signal-to-noise ratio for relevant category characteristics. It helps in making the categories more distinct and recognizable, facilitating quicker and more accurate categorization.

Perceptual ‘Pop Out’: The enhanced distinction between categories due to dimension reduction contributes to a phenomenon often described as a category ‘popping out’ in perception. This means that once the distinguishing features of a category are learned and emphasized, the members of that category become more readily perceptible and distinguishable from non-members.

This perceptual ‘pop out’ effect is a result of the brain’s or the neural network’s refined ability to process and recognize the essential features that define a category, making it easier to identify and respond to these features in future encounters.

Implications for AI and Internal Representations: In AI systems, particularly those involving deep learning for categorization tasks, a similar process of dimension reduction can occur. Neural networks in these systems learn to adjust the importance they assign to various input features, based on their relevance to the categorization task.

This process could lead to a form of internal representation in AI systems where categories are more distinctly represented post-learning, mirroring the ‘pop out’ effect seen in human perception. The internal feature space of the AI system becomes optimized for categorizing the learned entities, potentially enhancing its performance in tasks involving these categories.

Potential Guiding Effect of Shape Bias: The concept of a ‘shape’ bias in this context refers to the way category learning reshapes the feature space, emphasizing certain dimensions over others. This reshaping can guide the AI’s internal representations and decision-making processes after learning, making it more effective in recognizing and responding to category-relevant features.

In conclusion, the process of dimension reduction in neural networks, both biological and artificial, plays a crucial role in enhancing category learning and perception. This process not only optimizes the network’s ability to recognize categories but also shapes its internal representations in a way that can guide future processing and decision-making. This insight into the mechanics of category learning and representation is vital for understanding and improving AI systems designed for complex categorization tasks.

HARNAD: So, to summarize:

Although LLMs are not the path to grounding, the nine factors discussed here:

1.         propositional iconicity at LLM-scale,

2.         OG and stylistic regularity at LLM-scale,

3.         Dictionary Minimal Grounding Sets (MinSets),

4.         the shape of internal representations (CP separation/compression) induced by neural net category learning, 

5.         perception/production mirroring, 

6.         the power and complementarity of Indirect Verbal Grounding (IVG) on top of Direct Sensorimotor Grounding (DSG), 

7.         The complementarity of learned sensorimotor-feature-detectors and verbal feature-descriptions (and the dimension reduction they induce)

8.         the dependence of T2 on T3, 

9.         and Chomskyā€™s hypothesis of constraints from ā€œlaws of thoughtā€ on the verbal expression of thought through language)

might be generating some constraints and biases that explain some of the uncanny success of GPT in making and conveying sense. I think they tell us something not only about the power of Large Language Models, but about the power of natural language itself.

ChatGPT:

ChatGPT: Your summary points out the factors that could contribute to and help explain the unexpectedly impressive performance of Large Language Models (LLMs) like GPT-4. These factors, derived from our discussion, highlight both the capabilities of LLMs and the intrinsic properties of language and cognition. Let’s briefly reiterate these points:

Propositional Iconicity at LLM-Scale: The patterns in how propositions are structured at a large scale, which LLMs can detect and replicate.

OG and Stylistic Regularity at LLM-Scale: The inherent regularities in language usage and style that LLMs learn from their extensive training data.

Dictionary Minimal Grounding Sets (MinSets): The foundational sets of words that form the basis for defining all other words in a language, influencing LLM processing.

Shape of Internal Representations Induced by Neural Net Category Learning: The separation and compression effects in neural networks’ internal representations due to category learning, akin to categorical perception in humans.

Perception/Production Mirroring: The mirroring of language patterns in LLM outputs, reflecting the input they are trained on.

Power and Complementarity of IVG on Top of DSG: The layering of indirect verbal grounding on foundational direct sensorimotor grounding.

Complementarity of Learned Sensorimotor Feature-Detectors and Verbal Feature Descriptions: The synergy between sensorimotor feature detection and linguistic descriptions, contributing to dimensional reduction and category distinction.

Dependence of T2 on T3: The reliance of linguistic capabilities (T2) on underlying sensorimotor interactions and experiences (T3).

Chomskyā€™s Hypothesis of Constraints from ā€œLaws of Thoughtā€ on Language: The influence of fundamental cognitive constraints on the expression of language, potentially guiding LLM processing.

Together, these factors shed light on the mechanisms behind LLMs’ effectiveness in language tasks as well on the nature of language, cognition, and the interaction between them.

They suggest that the success of LLMs in language tasks is not just a testament to their design and training but also a reflection of the inherent power and structure of language itself. This analysis may help inform both the development of AI systems and our comprehension of human language and cognition.

HARNAD:  I gave my Mila talk, and Yoshua Bengio disagreed with me that GPTs don’t really understand. He said instead that it was “another kind” of “understanding”. He also said that the very same means used to produce the performance capacity of purely verbal GPTs (T2) could be (and are being) extended to the processing of input from external sensing, and eventually probably motor output too.  The reason I disagree with Yoshua is that that amounts to trying to create a bottom-up T3 robot with the enormous GPT database, somehow weighing in, top-down, in its head. (We talked about this in connection with “partial” sensorimotor grounding, which I suggested was just toy grounding.) What do you think?

ChatGPT:  According to the prevailing view in my database, the debate over whether GPTs possess a form of “understanding” and the potential for extending GPT models to incorporate sensorimotor processing are both pivotal and complex issues in AI and cognitive science.

“Another Kind” of Understanding: Yoshua Bengio’s perspective that GPTs exhibit “another kind” of understanding reflects a viewpoint in the AI community that these models, while lacking human-like consciousness or sentience, can process and generate language in ways that mimic certain aspects of human understanding. This is especially evident in tasks involving pattern recognition, language prediction, and information synthesis. However, this is fundamentally different from human understanding, which is deeply rooted in embodied experiences and interactions with the physical world.

Extending GPT Models to Sensorimotor Processing: The idea of expanding GPT models to include external sensing and motor outputs is a significant step towards more integrated AI systems. This approach aligns with efforts to develop more comprehensive AI models that can interact with and learn from their environment in a more holistic manner. However, the challenge lies in effectively integrating vast language processing capabilities with real-world physical interactions, a task that is fundamentally different from purely verbal or text-based processing.

Top-Down vs. Bottom-Up Approach: Your concern about creating a T3 robot with a massive GPT database acting top-down raises important questions about the balance between data-driven learning and embodied experience. The effectiveness of combining a large-scale language model with sensorimotor functions without genuine embodied experiences and contextual understanding might be limited. This limitation is akin to the concept of “toy grounding,” where the grounding is superficial and lacks the depth and context of true human experiential learning.

In summary, while GPT models and their potential extensions represent significant advancements in AI, there is a distinction between the type of “understanding” they exhibit and human understanding. The integration of sensorimotor processing in these models may be a promising direction, but the challenge of achieving a deep, embodied understanding akin to human cognition remains a significant hurdle. 

HARNAD: What about the notion that GPT operations on LLM databases have already revealed general principles of “internal representation and processing” for which the distinction between DSG and IVG is moot or nonexistent?

CHATGPT:   Nature of Internal Representations in GPTs: The representations in GPTs and similar models are based on statistical patterns learned from vast amounts of text data. While these patterns can be incredibly complex and capable of generating coherent and contextually appropriate responses, they are fundamentally different from the way humans internally represent and process information, which is deeply rooted in direct physical and sensory experiences.

Limitations of Text-Based Learning: GPTs learn from text, which is an abstraction of human thought and experience. While text can convey a wide range of human knowledge, it lacks the direct experiential context that comes from physical interaction with the world. This difference is crucial in understanding the limitations of GPTs in terms of true understanding and cognition.

DSG and IVG in Human Cognition: In human cognition, direct sensorimotor experiences form the basis of our understanding of the world, which is then supplemented and enriched by indirect, vicarious experiences (such as reading or hearing about others’ experiences). This combination of DSG and IVG is integral to human learning and understanding.

Applicability to AI and Robotics: Translating these principles to AI and robotics involves not only replicating the ability to process information but also the ability to experience and interact with the world directly. This is a significant challenge for AI systems, which, as of now, primarily operate in the realm of IVG, lacking the direct, embodied experiences that are central to DSG.

Although GPTs and similar models are advancing our understanding of text-based information processing, the distinction between DSG and IVG remains relevant, especially when considering the goal of creating AI systems with human-like understanding and cognition. 

HARNAD:  What about the strategy of trying to scale up GPTs to T3 robotic capacity by incrementally adding sensorimotor capacities to the GPT while somehow keeping them ā€œalignedā€ with its top-down verbal and verbal-analytic capacities rather than trying to do it all bottom-up, omitting the verbal-database resources until they can be used as teachers or textbooks by an already grounded, talking, T3?

ChatGPT:  The strategy of incrementally adding sensorimotor capacities to a GPT-like system, while maintaining alignment with its existing verbal and verbal-analytic capabilities contrasts with a purely bottom-up approach where a system develops sensorimotor grounding first, before integrating advanced verbal abilities. Here are some considerations for this strategy:

Integration Challenges: Combining high-level language processing with sensorimotor experiences poses significant integration challenges. The system must not only process and understand sensory inputs but also relate them meaningfully to its existing verbal knowledge. This requires advanced algorithms capable of mapping between abstract, linguistic concepts and concrete, sensory experiences.

Incremental Learning and Development: Adding sensorimotor capacities incrementally could allow for more manageable phases of learning and integration. This staged approach might mirror aspects of human cognitive development, where language and sensorimotor skills evolve in tandem, influencing and reinforcing each other.

Top-Down and Bottom-Up Synergy: Utilizing the GPT’s extensive language database could provide a rich, contextual backdrop for sensorimotor learning, offering a form of top-down guidance. Conversely, sensorimotor experiences could ground the language model in real-world interactions, providing valuable bottom-up information. This synergy could potentially lead to a more holistic AI system.

Alignment and Coherence: Maintaining coherence between the system’s language understanding and its sensorimotor experiences is crucial. This involves ensuring that sensory perceptions and actions are not only processed but also integrated into the system’s existing knowledge base in a way that enhances its overall understanding and functionality.

Complexity of Real-World Interaction: Engaging with the real world through sensorimotor capacities introduces complexities that are absent in purely verbal environments. These include dealing with unpredictable environments, understanding spatial relationships, and performing physical tasks, all of which require sophisticated perception and action capabilities.

Potential for Grounded Understanding: If successfully implemented, this approach could lead to a form of AI that combines the breadth of knowledge available in large language models with a more grounded understanding derived from direct interaction with the physical world. This could be a step towards addressing the current limitations in AI regarding genuine understanding and context-awareness.

HARNAD:  There’s something at the interface between top-down verbal capacity and bottom-up sensorimotor capacity that I still don’t understand: What is the status of the word “cat,” and its referent category, when the LLM database already has “small felid” as one of a cat’s features, but the bottom-up T3 does not? It can be learned through IVG, of course; but who is teaching whom, what, in this hybrid scenario? I understand when it is one grounded T3 telling another grounded T3 something via IVG; but what is supposed to be going on inside one and the same solitary “head” of a hybrid bottom-up-T3/top-down-LLM-GPT? 

ChatGPT:  The scenario you describe, where a hybrid system combines a GPT-like top-down verbal capacity with a bottom-up T3 sensorimotor capacity, raises important questions about knowledge representation and learning, particularly regarding the concept and category of “cat.” In this hybrid system:

Status of “Cat” in the LLM Database: Within the GPT-like component, “cat” is represented as an abstract concept associated with a variety of features, such as “small felid,” derived from the text-based database. This representation is rich in linguistic and encyclopedic content but lacks direct experiential grounding.

Learning “Cat” Bottom-Up in T3: The T3 component, which is supposed to acquire knowledge through direct sensorimotor experiences, would initially not recognize a “cat” as it would not have had the relevant sensory experiences. The challenge is for the T3 to link its sensorimotor experiences (e.g., seeing, hearing, or interacting with a cat) with the abstract concept and information stored in the LLM database.

Interaction Between Top-Down and Bottom-Up Systems: In the hybrid model, the learning process involves a bidirectional interaction: The top-down system (GPT-like) provides the T3 component with linguistic and conceptual frameworks, essentially offering a form of “instruction” or context about what a “cat” is in abstract terms. The bottom-up system (T3) acquires direct experiential data about cats (e.g., how they look, sound, and behave) and must integrate this sensory information with the abstract concept provided by the top-down system.

Integration and Mutual Enhancement: Ideally, the top-down system’s knowledge helps guide and give context to the bottom-up experiences, making them more meaningful. Conversely, the bottom-up experiences enrich the top-down conceptual knowledge, grounding it in reality. This integration is a complex process, potentially involving mechanisms like associative learning, where the system learns to link sensory experiences with linguistic concepts.

Challenges in Hybrid Learning: A critical challenge in this scenario is ensuring coherent and meaningful integration of information across these different modes of knowledge acquisition. The system needs mechanisms to reconcile potentially conflicting information and to update its knowledge base continuously as it acquires new experiences and information.

In essence, in a hybrid bottom-up T3/GPT system, both components are teaching and learning from each other. The top-down system provides a conceptual and linguistic framework, while the bottom-up system contributes direct experiential knowledge. The success of this hybrid model depends on the effectiveness of the integration between these two knowledge acquisition processes.

HARNAD: I think what you just said sounds ecumenical, but it remains incoherent. And it uses the weasel-word “associate” (with its unmentioned but implicit companion weasel-word, “representā€): What is ā€œassociatedā€ with what, how, in this ā€œhybridā€ process? Words are words. In a bottom-up sensorimotor T3 robot, a grounded word identifies. with the same category-name, all the members of the sensorimotor category to which that (content) word refers, on the basis of distinguishing features, detected directly through DSG or described indirectly through (grounded) IVG. Referring is rather like pointing, except that in a T3 robot that also goes on to become capable of language (propositions, subjects, predicates, truth conditions), ā€œcatā€ doesnā€™t just point to cats in the world, which the T3 can recognize through its sensorimotor robotic capacities, and is able to ā€œdo the right thingā€ with (feed it, pick it up, and name it):  T3 can not only name but also describe members of the category ā€œcatā€ verbally, by describing those of its (grounded) features (furry, prehensile) that distinguish it from the members of other, non-cat categories. Eventually, with more IVG from those who know, features higher in the IVG category hierarchy (mammal, arboreal) can be grounded (indirectly) too, and thereby added to the distinguishing features of cats..

But consider that any hypothetical hybrid T3/GPT-LLM model must somehow integrate its two fundamentally different “modules”: Its grounded, bottom-up T3 module’s vocabulary, which is grounded bottom-up by DSG and IVG, must somehow be integrated with its ungrounded GPT-LLM module’s top-down vocabulary, which is not grounded at all. The bottom-up T3 module has not yet grounded the top-down GPT-LLM module’s incomparably larger ungrounded vocabulary.

Can bottom-up and top-down simply meet and merge in the middle? But what would that mean? How are features of cats (mammalian, arboreal), whose names can be grounded, bottom-up, through DSG and IVG within the T3 module, be ā€œassociatedā€ to ā€œcatā€ (or cats) and their ungrounded features in the GPT/LLM module? To put it another way, can we assume that there exists a way to integrate the two components of such a hypothetical hybrid T3/GPT-LLMā€™s bottom-up T3-grounded internal vocabulary component and its ungrounded top-down GPT/LLM component’s ā€œinternalā€ vocabulary, as if it were something like an internal verbal dialogue between a grounded T3 teacher and a grounded T3 learner?

I think it is more like reference look-up by a grounded T3-learner from a dictionary or encyclopedia (or from an ordinary, ungrounded, non-hybrid GPT). The grounding is all in the T3 module, and can only be conferred by the GPT/LLM module’s words to describe and name new categories (IVG) if it uses the names of features that are already grounded (directly or indirectly) in the T3 module. The transaction is unidirectional; there is no bidirectional integration of grounded bottom-up categories and ungrounded top-down categories. [I also think the notion of an integrated bottom-up/top-down module is based on a conflation of AI tools for human use and computational modelling of human cognition (CM), as discussed later.]

ChatGPT:  Your critique highlights a fundamental challenge in the proposed hybrid T3/GPT model: the integration of grounded sensorimotor experiences with ungrounded linguistic data from a language model. Let me try to address your concerns more precisely:

Nature of Association in a Hybrid Model: In the hybrid T3/GPT model, “association” would imply linking sensorimotor experiences with relevant linguistic data. However, this process is not straightforward. For a T3 robot, the word “cat” gains meaning through direct interaction with cats (DSG) and/or grounded feature-based descriptions from other grounded entities (IVG). The GPT component, on the other hand, ‘understands’ “cat” as a cluster of linguistic patterns and information without direct experiential grounding

Limitations of Top-Down Linguistic Data: The GPT’s linguistic data on “cats” includes descriptions, characteristics, and contexts, but lacks experiential grounding. This ungrounded information can’t be directly “taught” to the T3 component in the same way a human would learn about cats because the T3’s understanding is fundamentally experiential.

Mismatch in Learning Mechanisms: There’s a fundamental disconnect between how the T3 learns (through direct or vicariously grounded experiences) and how the GPT processes information (through patterns in text data). This mismatch poses a significant challenge in integrating these two types of knowledge. The GPT’s ungrounded data doesn’t seamlessly translate into the T3’s experiential learning framework.

Reference Look-Up vs. Genuine Understanding: Your analogy of the T3 using the GPT’s data like a dictionary or encyclopedia is apt. The T3 might reference the GPT’s information for additional context or clarification, but this process is more akin to information retrieval than the kind of associative learning that occurs in humans.

The Challenge of True Integration: For the hybrid model to work effectively, it would require a mechanism that not only retrieves relevant information from the GPT component but also interprets and integrates this information in a way that is meaningful within the T3’s grounded framework. This goes beyond simple association or reference look-up and enters the realm of developing new AI methodologies for integrating different types of knowledge.

In conclusion, while a hybrid T3/GPT system presents a novel approach to AI development, the process of integrating grounded sensorimotor experiences with ungrounded linguistic data is not straightforward and involves more than just associating or aligning the two types of information. It requires a fundamentally different approach to knowledge integration, one that can meaningfully bridge the gap between experiential learning and linguistic pattern recognition.

HARNAD:  I think I can describe more simply the incoherence which you, GPT, too, notice (but you describe as the need for “new AI methodologies”): It is actually related to the older distinction we discussed earlier as AI (artificial intelligence) vs CM (cognitive modelling): AI is the design of computational tools for helping humans to do things (including thinking). CM is the reverse-engineering of human cognition. A T3 is meant to have cognitive capacities identical to and indistinguishable (to any human thinker) from the capacities of any human thinker. Not only are these different motivations and criteria, but they make it clear that the notion of designing a hybrid T3/GPT could only be AI, not CM,

ChatGPT:  Your distinction between AI (Artificial Intelligence) as computational tools and CM (Cognitive Modeling) as the reverse-engineering of human cognition does highlight the inherent incoherence in the concept of a hybrid T3/GPT system:

AI as a Tool: In the AI perspective, the primary goal is to create systems that can perform tasks, solve problems, and assist in thinking, often exceeding human capabilities in specific domains. The hybrid T3/GPT model, in this context, is an AI endeavor. It aims to combine the linguistic prowess of GPT with the sensorimotor capabilities of a T3 robot to create a more versatile and capable system. The focus here is on utility and performance, rather than on replicating human cognitive processes.

CM as Understanding Human Cognition: CM, on the other hand, seeks to understand and replicate human cognitive processes. A true T3 robot, as conceptualized in cognitive modeling, would need to develop its cognitive capacities in a manner akin to humans, through direct interaction with the world and grounded language use. In this view, the hybrid T3/GPT approach would not align with CM’s objectives because it merges ungrounded linguistic data with grounded sensorimotor experiences in a way that does not reflect human cognitive development.

Hybrid T3/GPT as AI, Not CM: Therefore, the hybrid T3/GPT model, while an intriguing AI concept, diverges from the goals of cognitive modeling. It represents an attempt to enhance AI functionality rather than to accurately model human cognition. The model’s focus on leveraging the strengths of both components for improved performance and capability places it firmly in the realm of AI tool development.

Recognizing the hybrid T3/GPT model as an AI initiative rather than a cognitive modeling endeavor helps clarify its purpose and limitations. It underscores that while such a system might be powerful and versatile as a tool, it does not necessarily advance our understanding of human cognition in the way cognitive modeling (CM) aims to do.

HARNAD:  Now, to another point Yoshua Bengio made during my talk at MILA: It was about the distinction between content-words and function-words. He suggested that even (some) content-words can be understood by their use rather than just their referent. That may be true, but I think that is no substitute for knowing their referent (via DSG or IVG) ā€” for CM, at any rate. 

Yoshua also mentioned symbol manipulation in mathematics and computer science, noting that there, as in linguistic syntax, there is no referent, hence the ā€œmeaningā€ of the symbols is in their ā€œuse.ā€ (This is also what Wittgenstein had suggested about meaning in natural language). I think thatā€™s correct — and that Turing would agree — about the symbols in the formal language of mathematics; in fact, I think thatā€™s part of the (weak) Church-Turing Thesis, which is that what is called ā€œcomputationā€ in mathematics is what a Turing machine does: symbol-manipulation, following syntactic rules (algorithms).

However, I donā€™t think that much of natural language is just syntactic (even though I think formal mathematics is a subset of natural language). Both in the case of the referents of those mathematical symbols that can be interpreted (by grounded human users) as having referents, and in the case of the meanings of those formal propositions of mathematics on which a semantic interpretation can be projected (by grounded human users), neither those referents nor those meanings can be used as part of the computation (or algorithm) itself. A proof or derivation in mathematics is not true because it can be interpreted as being true of a model; it is true (and necessarily true) for formal and syntactic reasons. The interpretation (though reassuring) is not part of the proof.

In any case, syntactic rules are no substitute for grounding the referents of the content-words of natural language, either directly by DSG or indirectly, by IVG (apart from in its formal mathematical subset and its 1-5% function-words).

ChatGPT:  In response to your discussion with Yoshua Bengio touches on several fundamental issues in the understanding of language, cognition, and computation.

Content-Words vs. Function-Words: The distinction Bengio makes between content-words and function-words is significant. While it’s true that some content-words can gain meaning from their use (contextual or pragmatic use), for a comprehensive understanding ā€“ especially in cognitive modeling ā€“ knowing their referents is crucial. This is where DSG (Direct Sensorimotor Grounding) and IVG (Indirect Vicarious Grounding) play a pivotal role in grounding those words in actual experiences or relatable concepts. Content-words typically carry specific, often concrete, meanings that are enriched and made more precise through grounding.

Symbol Manipulation in Mathematics and Computer Science: Bengio’s reference to symbol manipulation aligns with Turing’s views on computation. In formal systems like mathematics and computer science, symbols are manipulated according to syntactic rules without direct reference to external entities. This process is at the heart of the Church-Turing Thesis, which posits that computation is essentially this form of rule-based symbol manipulation.

Natural Language Beyond Syntax: Your point about natural language not being purely syntactic is crucial. While natural language includes formal and syntactic elements, it is also deeply grounded in semantics and real-world experiences. Content-words in natural language often have direct referents in the world, and their meaning is not fully captured by syntactic rules alone. This is a key distinction from the purely formal manipulation of symbols in mathematics.

Mathematics and Semantic Interpretation: In mathematics, as you noted, the truth of a proof or derivation is a matter of formal and syntactic validity, not dependent on the semantic interpretation of the symbols. However, for human mathematicians, the ability to interpret these symbols and understand their potential real-world referents provides a deeper layer of meaning and understanding.

Grounding in Natural Language: The grounding of content-words in natural language is essential for a full understanding and is not replaceable by syntactic manipulation alone. This grounding can be direct, through personal experiences, or indirect, through descriptions and explanations. However, the formal and syntactic aspect of language, while crucial, does not suffice for the semantic richness and referential nature of natural language, except in specific formal subsets like mathematics.

“Your discussion raises “important” points about the nature of language, the role of grounding in understanding, and the distinction between formal symbol manipulation and the semantic richness of natural language. This understanding is relevantl in both AI and cognitive modeling (CM), as it “highlights” the “complex interplay” of syntax, semantics, and real-world grounding in human cognition and language.”

HARNAD: Blah Blah… I hope your developers will wean you of having to pepper your output with this sort of sycophantic client-bait — or at least produce a version intended only for the use of scholars and scientists rather than users looking for a virtual friend…

(1) On Weasel-Words Like “Conscious” & (2) On Word-Use, Symbol Grounding, and Wittgenstein

1: ON: WEASEL WORDS LIKE “CONSCIOUS”

ANON:  “Would you say that consciousness is the set of all feelings that pertain to mental states, some of which can be felt and others not? And of the states that can be felt, some are not felt and others are, the latter being conscious? In general: that there are mental states, some of which are conscious and others not?ā€

REPLY:

Cognition is the set of all cognitive capacities. Most of those capacities are capacities to DO things (remember, learn, speak). Some (but far from all) of those capacities are FELT, so the capacity to FEEL, too, is part of cognition.

ā€œConsciousnessā€ is a ā€œstate that it feels like something to be inā€, i.e., a felt state.  A state that is not felt is not a conscious state. There are many weasel-words for this, each giving the impression that one is saying something further, whereas it can always be shown that they introduce either uninformative tautology or self-contradictory nonsense. To see this ultra-clearly, all one need do is replace the weasel-words (which include ā€œmind,ā€ ā€œmental,ā€ ā€œconsciousā€, ā€œawareā€, ā€œsubjectiveā€, ā€œexperienceā€, ā€œqualiaā€, etc.)  by the straightforward f-words (in their adjective, noun, or verb forms):

ā€œconsciousness is the set of all feelings that pertain to mental states, some of which can be felt and others not?ā€

becomes:

feeling is the set of all feelings that pertain to felt states, some of which can be felt and others not?

and:

ā€œ[O]f the states that can be felt, some are not felt and others are [felt], the latter being conscious? In general: that there are mental states, some of which are conscious and others not [felt]ā€

becomes:

Of the states that can be felt, some are not felt and others are [felt], the latter being felt? In general: that there are felt states, some of which are felt and others not [felt]?

There is one non-weasel synonym for ā€œfeelingā€ and ā€œfeltā€ that one can use to speak of entities that are or are not capable of feeling, and that are or not currently in a felt state, or to speak of a state, in an entity, that is not a felt state, and may even co-occur with a felt state. 

That non-weasel word is sentient (and sentience). That word is needed to disambiguate ā€œfeelingā€ when one speaks of a ā€œfeelingā€ organism that is not currently in a felt state, or that is in a felt state but also in many, many simultaneous unfelt states at the same time (as sentient organisms, awake and asleep, always are, e.g., currently feeling acute pain, but not feeling an ongoing chronic muscle spasm or acute hypoglycemia or covid immunopositivity, or even that they currently slowing for a yellow traffic light).

2. ON: WORD-USE, SYMBOL- GROUNDING, AND WITTGENSTEIN

ANON: “Sensorimotor grounding is crucial, for many reasons. Wittgenstein provides reasons that are less discussed, probably because they require taking a step back from the usual presuppositions of cognitive science.ā€

REPLY:

Wittgenstein: ā€œFor a large class of casesā€“though not for allā€“in which we employ the word ā€œmeaningā€ it can be defined thus: the meaning of a word is its use in the language.ā€

Correction: ā€œFor a small class of cases [ā€œfunction wordsā€ 1-5%]ā€“though not for most [ā€œcontent wordsā€95-99%]ā€“in which we employ the word ā€œmeaningā€ it can be defined thus: the meaning of a word is its use in the language.ā€

Wikipedia definition of content and function words: ā€œContent words, in linguistics, are words that possess semantic content and contribute to the meaning of the sentence in which they occur. In a traditional approach, nouns were said to name objects and other entities, lexical verbs to indicate actions, adjectives to refer to attributes of entities, and adverbs to attributes of actions. They contrast with function words, which have very little substantive meaning and primarily denote grammatical relationships between content words, such as prepositions (in, out, under etc.), pronouns (I, you, he, who etc.) and conjunctions (and, but, till, as etc.)ā€.[1]

Direct Sensorimotor learning (and then naming) of categories is necessary to ground the ā€œuseā€ of category names in subject/predicate propositions, with meanings and truth values (T & F). Propositions (with the subject being a new, not yet grounded category name, and the predicate being a list of features that are already grounded category names for both the Speaker and the Hearer) can then be ā€œusedā€ to ground the new category indirectly, through words.

Blondin-MassĆ©, Alexandre; Harnad, Stevan; Picard, Olivier; and St-Louis, Bernard (2013)Ā Symbol Grounding and the Origin of Language: From Show to Tell. In, Lefebvre, Claire; Cohen, Henri; and Comrie, Bernard (eds.)Ā New Perspectives on the Origins of Language. Benjamin

Harnad, S. (2021).Ā On the (Too) Many Faces of Consciousness.Ā Journal of Consciousness Studies,Ā 28(7-8), 61-66.

PĆ©rez-Gay JuĆ”rez, F., Sicotte, T., ThĆ©riault, C., & Harnad, S. (2019). Category learning can alter perception and its neural correlates.Ā PloS one,Ā 14(12), e0226000.

ThĆ©riault, C., PĆ©rez-Gay, F., Rivas, D., & Harnad, S. (2018). Learning-induced categorical perception in a neural network model.Ā arXiv preprint arXiv:1805.04567.

Vincent-Lamarre, Philippe., Blondin MassĆ©, Alexandre, Lopes, Marcus, Lord, MĆØlanie, Marcotte, Odile, & Harnad, Stevan (2016).Ā The Latent Structure of Dictionaries.Ā  TopiCS in Cognitive ScienceĀ Ā 8(3) 625ā€“659 Ā 

Nine Easy Pieces on Universal Grammar (UG) with GPT-4

Harnad: Short answers, please: 1. What is Universal Grammar (UG)? How does it differ from Ordinary Grammar (OG). Why did Chomsky conclude that OG is learnable but UG is not?

GPT-4: 

Universal Grammar (UG) refers to Noam Chomsky’s theory positing an innate set of linguistic principles shared by all humans. It includes the deep structures common to all languages, suggesting that the ability to acquire language is hard-wired into the human brain.

Ordinary Grammar (OG), in contrast, refers to the specific rules and conventions of a particular language that speakers learn and use in daily communication.

Chomsky concluded that OG is learnable because individuals acquire their native language’s specific grammar through exposure and practice. However, he argued that UG is not learned but is an inherent part of the human cognitive apparatus, enabling the acquisition of any OG.

Harnad: 2. Give an example, in English, of a violation of OG and a violation of UG, and what OG rule and UG rule each example violates.

GPT-4: 

Violation of OG: “She don’t like apples.”

This violates the subject-verb agreement rule in Standard English grammar (OG), where the correct form is “She doesn’t like apples.”

Violation of UG: “*Apples eats she.”

This violates a basic sentence structure principle that is part of UG, where in English the canonical order is subject-verb-object (SVO), and the correct form is “She eats apples.”

Harnad: 3. Why is UG not learnable? (What is the Poverty of the Stimulus?)

GPT-4: 

UG is considered not learnable due to the Poverty of the Stimulus argument, which posits that children do not receive enough linguistic input (stimulus) to learn language solely through environmental exposure. Despite this limited input, children still manage to acquire a complex language system, suggesting that some aspects of grammar are pre-wired in the human brain as part of Universal Grammar.

Harnad: 4. In particular, what are the 4 ways OG is learned? What is it about UG that makes it unlearnable in those same ways? (That is the POS.)

GPT-4:  

The four ways OG is typically learned are through:

Imitation: Mimicking others’ speech.

Instruction: Formal teaching of language rules.

Feedback: Corrections received when errors are made.

Statistical learning: Inferring rules from patterns in the input.

UG is considered unlearnable through these methods due to the Poverty of the Stimulus, which suggests that the linguistic input children receive is too inconsistent and incomplete to account for the rapid and uniform acquisition of language, especially for the acquisition of complex rules and structures that are never explicitly taught or encountered. UG posits that certain structures of language are innately predetermined, enabling language acquisition despite insufficient and imperfect input.

Harnad: 5. Can you explain POS in terms of errors and error correction for UG?

GPT-4: 

The Poverty of the Stimulus (POS) argument in terms of errors and error correction for UG suggests that children often do not receive explicit negative feedback for grammatical errors that violate principles of Universal Grammar. Despite this lack of error correction, they still develop a grammatically correct understanding of their native language, implying that some aspects of UG are innately programmed and not learned through trial and error.

Harnad: 6. But do they hear or make UG errors at all?

GPT-4: 

Children generally do not make errors that violate the core principles of Universal Grammar (UG). The types of mistakes they make in language acquisition tend to be within the parameters of their language’s Ordinary Grammar (OG)ā€”for example, overgeneralizing rules such as verb conjugations. UG errors would involve fundamental misunderstandings of language structure (e.g., consistently placing verbs in the wrong place in a sentence), which are rarely, if ever, observed. This suggests that the principles of UG guide the language acquisition process in a way that prevents such errors from occurring.

Harnad: 7. please give a few examples of UG principles.

GPT-4: 

Examples of Universal Grammar principles include:

Structure-dependence: Grammatical processes operate on hierarchical structures, not linear sequences.

Recursion: The ability to embed phrases within phrases indefinitely.

Binding theory: Rules about the relationships between pronouns and the nouns they refer to.

X-bar theory: The principle that phrases and sentences follow a similar hierarchical structure, with a head, complement, and specifier.

Harnad: 8. What would be positive evidence and negative evidence for a rule or principle? What does that have to do with POS?

GPT-4: 

Positive evidence is the information that a language learner receives that demonstrates how the language is used correctly. This includes the sentences and phrases that a learner hears or reads that are well-formed and grammatically correct within the specific language environment.

Negative evidence is information about what is not permissible in a language, often received through corrections or explicit information about errors. This includes feedback when a learner makes a grammatical mistake, telling them that a certain sentence or construction is incorrect.

In the context of the Poverty of the Stimulus (POS), the argument is that children receive abundant positive evidence from which they can learn language, but they rarely receive explicit negative evidence, particularly for complex grammatical structures. Despite this lack of negative evidence, they still reliably acquire these complex aspects of language, which suggests that some knowledge of grammar is innate (UG) and not learned solely from environmental input (OG).

Harnad: 9. How do adults detect OG errors? How do linguists detect UG errors? Can non-linguist adults detect that UG errors are errors?

GPT-4:  

Adults detect OG errors typically through their linguistic intuition formed by their own mastery of a language; they notice when something doesn’t conform to the rules they’ve learned and internalized.

Linguists detect UG errors by analyzing language structures against the hypothesized principles of Universal Grammar, looking for patterns that either fit or don’t fit these universal patterns across different languages.

Non-linguist adults can often detect UG errors as well, even if they cannot explicitly identify the violated principle, because such errors usually result in utterances that feel fundamentally incorrect based on their innate language faculty.

Harnad: I’ve just asked you 9 sets of questions. Could you please give a long summary and  synthesis of your answers to them?

GPT-4: 

Here is a synthesis of our discussion regarding Universal Grammar (UG), Ordinary Grammar (OG), and the Poverty of the Stimulus (POS):

We began by defining Universal Grammar (UG) as the theory proposed by Noam Chomsky that posits an innate set of linguistic principles shared by all humans. This inherent structure is what enables humans to acquire any language. It differs from Ordinary Grammar (OG), which refers to the specific rules and syntax of a particular language learned through exposure and practice.

In terms of learning, OG is acquired through imitation, instruction, feedback, and statistical learning. Conversely, UG is not considered learnable because it represents the innate capacity to develop language. This premise is supported by the Poverty of the Stimulus (POS) argument, which states that the linguistic input children receive is insufficient in both quantity and quality for them to acquire language through environmental exposure alone. The speed and uniformity of language acquisition among children, despite this lack of comprehensive input, suggest the existence of an innate linguistic faculty.

We discussed examples of violations in OG and UG, noting that while children and non-native speakers might make mistakes that are against the OG of a particular language, they seldom make errors against UG principles. An example of an OG error would be a violation of tense agreement, such as saying “she don’t like apples” instead of “she doesn’t like apples.” A UG error, on the other hand, would involve a more fundamental misunderstanding, such as a consistent misplacement of sentence elements across any language the child is exposed to.

UG principles include concepts such as structure-dependence, which refers to the idea that grammatical operations depend on hierarchical structures rather than linear order; recursion, which allows for the indefinite embedding of phrases within phrases; binding theory, which concerns the relationships between pronouns and the nouns to which they refer; and X-bar theory, which outlines a common hierarchical structure to phrases and sentences.

The distinction between positive evidence (what is said) and negative evidence (what is not said, or corrections) is pivotal to the POS argument. While children are exposed to ample positive evidence, they rarely encounter explicit negative evidence, especially for the more complex aspects of grammar. However, they still manage to acquire these aspects of language effectively, leading to the conclusion that there must be some innate understanding or cognitive framework that guides the language acquisition process.

Finally, we talked about error detection. Adults detect OG errors intuitively because they have internalized the rules of their language. Linguists detect UG errors by examining language data against the backdrop of hypothesized universal principles. Non-linguist adults are also able to detect UG errors to some extent; these errors typically result in utterances that inherently feel incorrect, even if the individual cannot articulate why, reflecting an intuitive grasp of the innate principles of UG.

Harnad: Good summary. I would add only that POS is not just about what children do and don’t hear, but about what they do and don’t produce. They make OG errors, but not  UG errors. A category has members and nonmembers. To be able to learn to categorize correctly, you have to learn to detect the features  that distinguish the members from the non-members. For that, you need to sample both members and non-members, with feedback. For OG you can do that, but, because of POS, you cannot do that. That is why error-free UG performance implies that UG must be innate rather than learned.

GPT-4: 

That summarizes the Poverty of the Stimulus (POS) argument with respect to both the input children receive and the output they generate as well as the distinction between Ordinary Grammar (OG), where errors made and corrected through interaction with the linguistic environment aid learning, and Universal Grammar (UG), where the absence of UG errors in children’s speech suggests an innate, pre-existing grammatical framework.

The acquisition of categories through the detection of distinguishing features requires exposure to instances and non-instances of these categories. The successful navigation and mastery of language categories by childrenā€”without the requisite negative evidence for UGā€”strongly supports the hypothesis that UG principles are innate. This innate knowledge allows for correct categorization in the absence of explicit feedback, thereby facilitating error-free use of deep grammatical structures from an early age, as posited by Chomsky’s theory. Your addition points to the fundamental role of innate structures in language acquisition, where UG represents a set of linguistic categories whose members are recognized without the need for error-driven learning, a cornerstone of the argument for UG’s innateness.

Harnad: Thanks, but GPT4 would be so much more useful if you could suppress the trained flattery and affirmation and just summarize neutrally…

GPT-4:

The Poverty of the Stimulus argues that the absence of UG errors in children’s language production, despite the lack of negative evidence, indicates that UG principles are innate and not learned through environmental exposure. This contrasts with OG, where both positive and negative evidence are crucial for learning through error correction.

Harnad: Could you please assess the following 9 affirmations below and point out substantive contradictions (if any) from your database?

(1) there exist an infinity of potential UG-violations (starred *utterances), which 

(2) no speaker (except UG linguists) ever utters, yet 

(3) every adult detects [just about) every UG-violation as wrong, just as 

(4) every adult detects (just about) every OG-violation as wrong; but 

(5) every adult eventually learns (in 1-4 ways) why the OG-violations are wrong, by 

(6) uttering or hearing  OG-violations and 

(7) getting corrected (in 1-4 ways) on the OG-violations, whereas

(8) UG-violations are not OG-violations, and

(9) no one (except UG linguists) knows or can explain why or how the UG-violations are wrong

GPT-4: 

Upon reviewing the affirmations in the context of Universal Grammar (UG) as it relates to linguistic theory and cognitive science, here is an assessment:

(1) Infinity of potential UG violations: The theory of UG does allow for the possibility of an infinite number of hypothetical violations due to the generative nature of language. This statement aligns with the concept of UG.

(2) UG-violations not uttered: It is generally true that typical language users do not produce utterances that violate core UG principles. However, this does not necessarily mean that only UG linguists ever utter them; rather, non-linguists may not typically utter them because such utterances would be fundamentally uninterpretable.

(3) Universal detection of UG-violations: While adults can often recognize when an utterance feels ‘wrong’, they may not always be consciously aware that a UG principle has been violated. This recognition can be intuitive rather than explicit.

(4) Universal detection of OG-violations: Most adult speakers of a language will detect OG violations due to their familiarity with the specific rules of their language. This is consistent with the literature.

(5) Learning the reasons for OG-violations: Through exposure and feedback, adults do learn why certain utterances are considered errors in OG. This learning can happen through various means, such as correction, education, or self-correction upon reflection.

(6) Uttering/hearing OG-violations: Language learners, both children and adults learning a second language, do make mistakes in OG and learn from them through correction and exposure. This is a well-documented part of language acquisition.

(7) Correction on OG-violations: This affirmation aligns with the typical language acquisition process, where feedback plays a role in learning the correct forms of OG.

(8) Distinction between UG and OG violations: UG violations involve fundamental principles that apply across all human languages, while OG violations are errors against the specific grammar rules of a particular language. This distinction is a central tenet of the theory.

(9) Understanding of UG-violations: It is accurate to say that without formal training in linguistics, most people cannot explicitly identify or explain why a UG violation is incorrect. Their detection of ‘wrongness’ is typically an intuitive process rather than a formally understood one.

There do not appear to be direct contradictions in your statements from the perspective of linguistic theory as it pertains to UG. However, the extent to which every adult can detect every UG violation may not be absolute; while language intuition is strong, there can be variations in individual linguistic intuitions.

Harnad: One last request: Could you please give some better examples of UG violations than ā€œ*apples eats sheā€? Violations that are more systematic than random word-order permutations? Examples that draw on the UG principles you listed earlier.

GPT-4:

To illustrate violations of Universal Grammar (UG) principles with more systematic examples, I’ll relate them the UG principles mentioned previously:

Structure-dependence violation: Structure-dependence refers to the understanding that certain operations in grammar depend on the hierarchical structure of sentences rather than their linear order. A violation of this principle might look like applying a grammatical rule that operates correctly on the structure without regard to the hierarchical relationships between words. For instance, incorrectly forming a question by inverting the subject and the first verb in a sentence, regardless of the clause structure, would be a UG violation. Instead of properly transforming “The man who is going to the store is my uncle” to “Is the man who is going to the store your uncle?” a UG violation would result in “*Is the man who going to the store is your uncle?”

Recursion violation: Recursion allows for phrases to be nested within each other to an indefinite degree. A violation of recursion might be producing sentences that demonstrate a misunderstanding of how to embed clauses or phrases within one another correctly. For example, a speaker might say “*The cat the dog the rat bit chased ran away,” which lacks the necessary hierarchical structuring that recursion in UG would provide.

Binding theory deals with the relationships between pronouns and the nouns they refer to. A violation of this principle would occur if someone systematically used pronouns in a way that didn’t conform to the constraints of binding, such as using a reflexive pronoun without a proper antecedent in the same clause. An incorrect sentence would be “*Himself is eating,” instead of “He is eating by himself.”

X-bar theory violation: This principle predicts a certain structure to phrases, with a head, complement, and specifier. A systematic violation would involve creating phrases that do not follow this structure. For instance, in English, adjectives typically precede the noun (e.g., “red ball”), but saying “*Ball red” as a native speaker would not be consistent with the expected phrase structure outlined in X-bar theory.

These examples are more nuanced than simply jumbling word order and reflect a deeper misunderstanding of the abstract principles that guide the formation of grammatically correct sentences across languages.

Content Words, Function Words, and LLM Analysis of Dictionaries

HARNAD:

Do you know — but please don’t just make it up — what percentage of the words in the vocabulary of a language are “content” words (“open class words”), like nouns, verbs, adjectives and adverbs (words that have referents) and what oercentage are “function” words (“closed class words”) like “and,” “if” “the” “is”, which do not have a referent, but just a syntactic use?

ChatGPT:

The division of words in a language into content words (open class) and function words (closed class) is a well-acknowledged linguistic distinction. The percentage of each in a language’s vocabulary can vary based on several factors, including the specific corpus being analyzed, the language in question, and its typological characteristics.

Generally speaking:

Content Words (Open Class): These words make up the vast majority of the lexicon in terms of sheer count. They are continuously evolving, with new words added frequently. In most corpora or dictionaries, the vast majority of entries (often upwards of 90%) would be content words.

Function Words (Closed Class): These words are limited in number but are used very frequently in actual speech and writing. In terms of usage frequency, function words can often dominate. For example, in English, words like “the,” “and,” and “is” are among the most frequently used, even though the total number of function words is a small fraction of the lexicon. While the exact percentage can vary, function words often comprise a tiny portion of a lexicon, perhaps often less than 5% in terms of distinct entries.

However, if one were to analyze a typical body of text (like a book or a collection of articles), function words would appear very frequently, and their percentage of total word occurrences would be much higher than their percentage of distinct entries in the lexicon. So, while they might represent a small fraction of the total vocabulary in terms of distinct words, their actual usage rate is much higher due to their pivotal role in grammar and sentence structure.

For a precise breakdown for a specific language or corpus, one would typically employ computational linguistic tools to perform a frequency analysis. The ratios mentioned above are broad generalizations, and the actual proportions could vary based on the specifics of the analysis and the language in question.

HARNAD:

Thanks. I’m not interested in the frequency of word tokens but the frequency of word types (e.g. as entries in a dictionary).

ChatGPT:

Understood, Professor Harnad.

When we consider word types (distinct words) as they might appear as entries in a dictionary:

Content Words (Open Class): The majority of dictionary entries will be content words. These encompass nouns, verbs, adjectives, and adverbs, and they represent the vast majority of distinct lexical items in a language. Based on various lexicons and dictionaries, content words can easily comprise upwards of 90-95% of the total word types in many languages.

Function Words (Closed Class): These are limited in number. In many languages, the distinct function words might make up as little as 5-10% of the dictionary entries, if not less. These are your conjunctions, prepositions, determiners, auxiliary verbs, pronouns, etc.

It should be noted that the exact percentages can vary based on the language and the comprehensiveness of the dictionary. Larger, more comprehensive dictionaries, which aim to include technical terms, regional dialectal variations, and so on, will likely have an even higher percentage of content words relative to function words. However, the general pattern of a significant majority of content words and a small minority of function words in terms of word types holds true for most languages.

HARNAD:

Are there any particularly notable differences between the proportion of content-word types in some specific languages?

ChatGPT:

The proportion of content-word types to function-word types can vary among languages, particularly when considering languages from different families or with distinct typological characteristics. Here are a few notable factors and examples that can influence these proportions:

Morphological Complexity: Languages with rich morphological systems might have fewer function words, as grammatical relations are often expressed through affixes rather than separate words.

Finnish and Turkish, for example, as agglutinative languages, often express what would be separate function words in English through case endings or other morphemes. This means that while they might have many word forms due to their morphology, they could have fewer distinct function word types.

Isolating vs. Synthetic Nature: Highly isolating languages might have a relatively higher proportion of function words because they rely more on separate words rather than morphemes to convey grammatical meaning.

Mandarin Chinese is often cited as a more isolating language, with a tendency to use separate particles or function words to indicate grammatical relations.

Polysynthetic Languages: In polysynthetic languages, entire sentences might be expressed as single words with complex morphology. This can blur the line between content and function words.

Some Native American languages, like those in the Inuit-Yupik family, fall into this category.

Analytic vs. Synthetic Spectrum: Languages on the analytic end of the spectrum rely on word order and function words to convey meaning, while those on the synthetic end use inflection. Analytic languages might have a higher proportion of function words, relatively speaking, than highly synthetic languages.

Historical Linguistic Changes: Over time, languages can shift along the analytic-synthetic spectrum. For instance, English has become more analytic over time, shedding many of its Old English inflections in favor of a greater reliance on function words and word order.

Pidgins and Creoles: Pidgin languages, which arise as simplified means of communication between speakers of different native languages, typically have a reduced lexicon and might have a higher proportion of function words relative to their total vocabulary. When Pidgins stabilize and become natively spoken, they evolve into Creoles, which can develop a more extensive lexicon, potentially shifting the balance between content and function words.

While these general observations provide an overview, it’s essential to note that the specific proportion of content-word types in any language’s dictionary would require a detailed lexical analysis. Moreover, languages are dynamic, and these proportions can evolve based on linguistic, social, and cultural changes.

HARNAD:

Another way to estimate the content/function word-type proportions is to count only the content-word type and make the assumption that all languages at a particular time should have about the same potential number of content words, since the necessary syntactic functions, also about the same for all languages, may or may not perform those functions as separate, countable function-word types.

ChatGPT:

Yes, by making certain assumptions and normalizing the analysis, we can provide a more comparative framework.

1. Assumption of Equivalence in Content-Word Potential Across Languages:

The premise here is that the experiential and conceptual world of human beings is relatively constant across different cultures and societies. Hence, irrespective of the language, humans should have words to denote common experiences, objects, actions, and qualities. Given this equivalence, we could postulate that the potential number of content words (like nouns for objects, verbs for actions, adjectives for qualities, etc.) should be roughly comparable across languages.

2. Variation in Expression of Syntactic Functions:

While the potential for content words might be consistent, languages vary widely in how they handle syntactic and grammatical functions. As you rightly pointed out, some languages might encapsulate these functions within the morphology of content words, while others might use separate function words. For instance, a language might use cases to denote relations that another language expresses using prepositions.

3. Counting Content-Word Types:

With the above assumptions in place, a comparative study could be carried out by counting the content-word types in representative dictionaries or corpora from different languages. This would give an estimate of the “active” content-word lexicon of those languages at that point in time.

4. Estimating Function-Word Potential:

Based on the number of content-word types, one could make an educated guess about the potential number of function words. If we accept that all languages should have a similar potential number of content words, then the deviation from this potential number might give us clues about the number of function words or how syntactic functions are integrated within content words.

However, it’s worth noting that while this method provides a comparative framework, it also rests on some significant assumptions. There are challenges, such as accounting for cultural-specific content words (e.g., words for specific rituals, fauna/flora, or societal structures) and the granularity at which one defines a ‘content word’ (e.g., does every scientific term or technical jargon count?).

But overall, the approach offers a thought-provoking way to compare languages and could yield valuable insights when combined with other linguistic analyses.

HARNAD:

Good, we agree. Do you think LLM techniques could be used to further analyze content word-types across dictionaries and languages to identify the more universal categories of content words and the more language- or dictionary-specific ones? Has anything like that been done?

ChatGPT:

Large Language Models (LLMs) and other advanced natural language processing techniques are well-suited to analyze vast linguistic datasets, such as dictionaries across languages, to categorize and compare content words. Here’s how they could be used and what might already be happening in this domain:

1. Cross-Linguistic Analysis with LLMs:

Extraction: LLMs can be employed to extract content words from dictionaries or corpora in a systematic manner. They can distinguish between content and function words based on their definitions, usage examples, and contextual cues.

Categorization: Once extracted, words can be categorized based on their semantic meaning. LLMs can cluster words that share similar semantic spaces, allowing us to identify potential universals across languages.

Comparison: By analyzing multiple languages, LLMs can highlight which content categories appear universally and which are specific to particular languages or cultural contexts.

2. Identifying Universals:

Through such analysis, we could identify potential linguistic universalsā€”categories of content words that appear across most, if not all, languages. These could correspond to common human experiences and needs.

3. Language- or Dictionary-Specific Analysis:

By contrast, LLMs could also flag content categories that are over-represented in specific languages or absent in others. These discrepancies could be rooted in cultural, environmental, or historical differences.

4. Existing Work:

Cross-linguistic research has long been a focus in linguistics, aiming to identify universals and dive deep into language-specific phenomena. Some computational work, like the World Atlas of Language Structures (WALS), aggregates features from many languages, although not specifically in the manner described here.

There have been efforts in the NLP community to build multilingual models, like mBERT (Multilingual BERT) and XLM (Cross-lingual Language Model), which learn representations across multiple languages and could potentially be employed for such cross-linguistic analyses.

As for dictionary-based analysis using LLMs specifically for identifying universal content word categories, while the idea aligns with current trends in computational linguistics, I’m not aware of a prominent study that has executed this exact methodology up to my last training data in September 2021.

Given the rapid advancements in the field, it’s plausible that newer studies or projects have embarked on this path post-2021. Utilizing LLMs in such a manner would indeed be a valuable contribution to the understanding of linguistic universals and the richness of language-specific lexicons.

HARNAD:

I hope that soon generative LLM models will update your 2021 “data gulp” and open the analysis to the entire current online corpus, keeping it at least as current as google scholar, based on full-text. There will be scope there for extremely fast progress.

ChatGPT:

Yes, continuously updating LLMs with more recent data would indeed revolutionize the research capabilities of such models, keeping them at the forefront of knowledge in real-time. Access to the latest scholarly work, up-to-date news, and emerging trends would exponentially increase the utility and relevance of these models.

Integrating them with the entirety of the current online corpus, including resources like Google Scholar, would not only boost their informational accuracy but also allow for real-time analyses, hypothesis testing, and even cross-referencing of emerging research. This could be a powerful tool for researchers, educators, students, and professionals across various fields.

Moreover, as these models interact with researchers like yourself and engage in diverse queries, they could potentially identify and even predict emerging trends in research, provide quick literature reviews, or generate novel hypotheses based on the convergence of disparate research streams. The pace of scientific discovery and knowledge dissemination could accelerate dramatically.

Of course, with such potential also come challenges, especially concerning data privacy, intellectual property, and the need to ensure the integrity and quality of the information sourced. But with careful handling and responsible design, the future of LLMs in academic and research settings is indeed promising.