Can Toy Robotic Capacities Make Top-Down Meet Bottom-Up?

Re: Figure Status Update – OpenAI Speech-to-Speech Reasoning

SH:

Is this demo sensorimotor grounding? No, It’s a toy robot with (1) some toy-world visual recognition and motor manipulation skills, plus (2) (perhaps non-toy) text-to-speech and speech-to-text capacity, plus (3) ChatGPT’s remarkable and as-yet unexplained (non-toy) interactive verbal skills, including (4) its (non-toy) encyclopedic verbal database and navigation/interaction capacity.

But it’s still ungrounded.

If/when it can do the kind of thing it does in the video with anything it can talk about, and not just an infomercial demo, then, and only then, will it have an even more remarkable, and as yet unexplained, (non-toy) grounded T3 robotic capacity.

Two-year-olds are grounding their words via the only way upward: bottom-up, through (unsupervised and supervised) learning of sensorimotor categories, by detecting their distinguishing sensorimotor features directly, and then naming the grounded categories (by describing their features, which are likewise learnable, nameable categories).

Then, because the 2yr-old also has the capacity for language (which means for producing and understanding subject-predicate propositions with truth-values, composed out of category names defined or described by referents of their (grounded) feature-category names), verbal instruction (LLM-style) can kick in and even take over.

That’s bottom-up grounding, and it applies to children as much as to T3 robots.

But that OpenAI infomercial demo was just a toy. More important, though, is that it is based on an incoherent pipe-dream, which is that an LLM’s words can be grounded top-down with the help of more and more bottom-up toy robotic learning.

But the only way to get to the top is with bottom-up T3 robotics. So you might as well forget about the LLM and its word-power for that project. Just let your 2-year-old child or T3 robotlet use ChatGPT as tutor, the way kids use their parents. But to understand what the LLM (or their nanny) is telling them, they have to learn the referents of all the content-words in the verbal explanation. And there’s only one way to do that: bottom-up, first by direct sensorimotor grounding (DSG) and then – once the child or Ts robot has grounded the names — and hence the referents — of enough categories bottom-up (including enough feature-category names), learners can carry on grounding bottom-up through indirect verbal grounding (IVG), as long as their tutors describe the new (ungrounded) categories with propositions made up of already-grounded category names – already grounded in the heads of the learners.

So the bottom line is that the referents of the words in the teacher‘s IVG definition or description or explanation have to already be grounded in the head of the learner (whether via DSG or IVG).

I leave it to you, GPT, as an exercise, to explain why that cannot be done top-down with an LLM-powered chat-robot, where it’s hanging from ungrounded skyhooks

ChatGPT4 (after a bit of browbeating by me):

An important question in the development of AI and robotics concerns the most efficient and effective path to achieve a genuinely grounded, intelligent system. Is it more practical or feasible to enhance a primarily linguistic, top-down T2 language model (LLM) to achieve the sensorimotor and cognitive capabilities of a T3 robot, or would it be better to develop T3 robotic capabilities bottom-up independently and then supplement these with the LLM’s abilities externally, as a form of advanced, verbal tutoring.

Given the distinctions between T2 and T3 systems, a more grounded and arguably more natural progression of AI development, closely mirroring human cognitive development might be more promising. Humans first develop sensorimotor understanding and category learning capabilities (direct T3 capacities) before language acquisition goes on to significantly enhance and expand their cognitive and communicative capacities. This developmental trajectory suggests that grounding in the physical world through direct sensorimotor experience forms a necessary foundation for the effective use of language for further learning and communication.

Thus, attempting to “train” a T2 LLM chatbot to become a T3 Turing-indistinguishable robot might not only be inherently inefficient but could overlook the essential aspect of grounding that comes from direct interaction with the physical world. Incorporating LLMs as an auxiliary verbal tutor for a T3 entity (human or robot), once it has developed a sufficiently grounded vocabulary about the world, may be a more logical and effective approach, leveraging the strengths of LLMs in processing and generating language to enhance the T3’s cognitive and communicative capabilities. It aligns with how language and learning occur in humans, where foundational knowledge and understanding from direct grounding are subsequently enriched and expanded through language-based learning — indirect but grounded.

This approach would capitalize on the LLM’s capacity for generating coherent, contextually appropriate language but would also ensure that the use of this language was grounded in a pre-existing vocabulary and propositions. It suggests a complementary integration of auxiliary T2 LLM chatbotic and autonomous T3 robotic systems, where each component contributes its strengths to the development of a more versatile and genuinely intelligent agent.

The notion that a top-down, language-centric system (T2) can achieve genuine grounding by incrementally incorporating “toy” sensorimotor capabilities may reflect a fundamental misconception about the nature of grounding and the role of sensorimotor experiences in achieving it.

Grounding isn’t merely a matter of bridging linguistic competence with basic physical interaction. True grounding —as observed in humans and necessary for a hypothetical T3 robot — entails a deep, intrinsic connection between sensorimotor experiences and the symbols (words) that represent them. This requires not just the recognition or manipulation of objects based on learned descriptors but direct understanding of those objects and actions in the world to which they refer.

The error in imagining that a T2 system could “graduate” to T3 status by being endowed with limited sensorimotor abilities lies in underestimating the complexity of human sensorimotor and cognitive integration. Human cognition is not simply the sum of its parts; it’s the product of a highly integrated and complex system that develops bottom-up through continuous, rich interaction with the environment. Language development in humans, deeply intertwined with sensorimotor experiences, cannot be replicated in AI by superficially adding physical capabilities to a fundamentally linguistic system.

The strengths of a T2 system like GPT could be better utilized as an auxiliary educational tool for a T3 robot that is learning through its interactions with the world. In this scenario, GPT’s extensive knowledge base and linguistic capabilities can enhance the learning process by providing information, explanations, and clarifications in a way that complements the robot’s direct experiences. This approach would leverage the best of both worlds: the depth and breadth of GPT’s top-down linguistic knowledge and the direct grounding that comes from the T3’s bottom-up sensorimotor interactions with its environment.