Vision-Language-Action Models for Quadruped and Humanoid Robots

Project Mission

This project develops Vision-Language-Action models for quadruped and humanoid robots. We focus on building embodied AI systems that can understand visual scenes, interpret natural-language instructions, reason about physical environments, and generate robot-specific actions for legged mobility, whole-body movement, and long-horizon planning.

The goal is to move beyond isolated perception, navigation, or control modules toward a unified robot intelligence framework. In this method, quadruped robots and humanoid robots use vision-language reasoning to understand what is happening in the environment, decide what actions are physically possible, and execute specific tasks.

Scientific Motivation

Recent advances in vision-language models (VLMs) have shown strong capabilities in recognizing objects, describing scenes, and reasoning over images and text. However, robots require more than visual reasoning. They must connect perception and language to physical action. A robot must know not only what an object is, but also whether it can approach it, grasp it, avoid it, open it, move it or use it as part of a larger task.

Quadruped and humanoid robots make this problem especially important. A quadruped robot may be able to traverse stairs, uneven terrain, narrow passages, and large indoor or outdoor spaces, but it has limited manipulation capability. A humanoid robot may be able to open doors, operate tools, pick up objects, and interact with human-designed environments, but it requires more complex whole-body balance, motion planning, and manipulation control. The same language instruction may therefore require different interpretations depending on the robot body.

For example, the instruction “check the object on the upper shelf” may require a quadruped robot to navigate to the area, inspect the shelf from multiple viewpoints, and report the object state. For a humanoid robot, the same instruction may involve walking to the shelf, adjusting body posture, reaching with an arm, grasping the object, and possibly relocating it. A general VLA system must understand both the shared task meaning and the embodiment-specific action requirements.

Research Approach

VLA Robot Intelligence

Language-Guided Visual Understanding: Developing models that ground natural-language instructions in robot camera observations, object locations, spatial relations, and task-relevant visual cues.
Embodied Action Reasoning: Connecting visual-language understanding to physically executable robot behaviors such as walking, turning, inspecting, reaching, grasping, and interacting.
Long-Horizon Task Decomposition: Translating high-level human instructions into structured action sequences that can be executed over time by quadruped or humanoid robots.
Affordance-Aware Perception: Estimating what can be walked through, avoided, climbed, reached, grasped, opened, moved, or manipulated from visual observations.
Failure-Aware VLA Control: Detecting when a task cannot be completed from the current observation and triggering replanning, additional perception, or alternative robot actions.

Quadruped Robot Intelligence

Language-Guided Locomotion: Enabling quadruped robots to follow natural-language commands for navigation, inspection, search, and spatial exploration.
Terrain-Aware Navigation: Combining visual perception, proprioception, and environmental cues to move across stairs, slopes, cluttered spaces, narrow paths, and uneven terrain.
Active Visual Inspection: Allowing quadruped robots to move their body and camera viewpoint to inspect objects, rooms, structures, obstacles, or uncertain regions.
Semantic Exploration: Mapping and exploring environments not only by geometry, but also by object categories, room functions, landmarks, and task-relevant regions.
Human-Centered Mobility: Developing navigation behaviors that allow quadruped robots to move safely around people, furniture, doors, corridors, and dynamic obstacles.

Humanoid Robot Intelligence

Whole-Body VLA Control: Developing VLA models that connect language and visual perception to coordinated head, torso, arm, hand, and leg movements.
Bimanual Manipulation: Training humanoid robots to use both hands for object handling, tool use, carrying, opening, pushing, pulling, and coordinated manipulation.
Human-Scale Environment Interaction: Enabling humanoids to interact with doors, handles, shelves, switches, cabinets, tables, chairs, tools, and other objects designed for human bodies.
Posture and Balance-Aware Action: Generating actions that account for center of mass, reachability, foot placement, support surfaces, and whole-body stability.
Socially Situated Robot Action: Studying how humanoid robots should move, gesture, approach, hand over objects, and interact with people in shared spaces.

SLAM and Spatial Understanding

Semantic SLAM for VLA Robots: Using SLAM as one component of the VLA robot system to support localization, spatial memory, object mapping, and long-horizon task execution.
Language-Conditioned Map Querying: Connecting language expressions such as “the room near the stairs,” “the object behind the table,” or “the door on the left” to map-based spatial representations.
Multi-View Scene Understanding: Combining observations from different robot viewpoints to improve object localization, scene reconstruction, and environmental awareness.
Map-Based Task Planning: Using spatial maps to support navigation, inspection, object search, path planning, and task sequencing for both quadruped and humanoid robots.
Dynamic Map Updates: Updating spatial and semantic maps when objects move, doors open, paths become blocked, or robot actions change the environment.

Current Implementation

At the current stage, this project focuses on defining the VLA framework for quadruped and humanoid robot intelligence. The initial implementation is organized around robot vision, natural-language instruction grounding, semantic mapping, and embodiment-specific action generation.

For quadruped robots, the near-term focus is on language-guided navigation, active visual inspection, scene exploration, terrain-aware movement, and semantic SLAM integration. These capabilities provide a foundation for robots that can move through complex environments while interpreting instructions in relation to physical space.

For humanoid robots, the near-term focus is on vision-language-guided reaching, object interaction, whole-body task execution, and manipulation-oriented scene understanding. These capabilities provide a foundation for robots that can operate in human-designed environments and perform tasks requiring arms, hands, posture control, and physical interaction.

Future Research Directions

Future work will extend this project toward general-purpose VLA robot systems that can operate across different embodiments, environments, and task domains. One direction is to develop shared robot foundation models that can interpret language and vision across quadruped and humanoid platforms while producing actions through embodiment-specific control heads.

A further direction is to build simulation-to-real training pipelines for VLA robot policies. Simulation can provide diverse environments, rare failure cases, and scalable task variations, while real-robot experiments can test whether the learned policies remain robust under physical constraints, sensor noise, contact uncertainty, and dynamic environments.