<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>SLAM | Research Lab</title><link>https://sd-lab-page.github.io/tags/slam/</link><atom:link href="https://sd-lab-page.github.io/tags/slam/index.xml" rel="self" type="application/rss+xml"/><description>SLAM</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 30 May 2026 00:00:00 +0000</lastBuildDate><image><url>https://sd-lab-page.github.io/media/icon_hu_77cf8b59efcb710e.png</url><title>SLAM</title><link>https://sd-lab-page.github.io/tags/slam/</link></image><item><title>Vision-Language-Action Models for Quadruped and Humanoid Robots</title><link>https://sd-lab-page.github.io/projects/vla/</link><pubDate>Sat, 30 May 2026 00:00:00 +0000</pubDate><guid>https://sd-lab-page.github.io/projects/vla/</guid><description>&lt;h2 id="project-mission"&gt;Project Mission&lt;/h2&gt;
&lt;p&gt;This project develops Vision-Language-Action models for quadruped and humanoid robots. We focus on building embodied AI systems that can understand visual scenes, interpret natural-language instructions, reason about physical environments, and generate robot-specific actions for legged mobility, whole-body movement, and long-horizon planning.&lt;/p&gt;
&lt;p&gt;The goal is to move beyond isolated perception, navigation, or control modules toward a unified robot intelligence framework. In this method, quadruped robots and humanoid robots use vision-language reasoning to understand what is happening in the environment, decide what actions are physically possible, and execute specific tasks.&lt;/p&gt;
&lt;!-- Quadruped robots provide robust mobility, terrain traversal, active inspection, and spatial exploration. Humanoid robots provide human-scale interaction, bimanual manipulation, tool use, and whole-body task execution in environments originally designed for people. This project studies how Vision-Language-Action models can support both robot types while respecting their different embodiments, sensors, action spaces, and physical constraints. --&gt;
&lt;h2 id="scientific-motivation"&gt;Scientific Motivation&lt;/h2&gt;
&lt;p&gt;Recent advances in vision-language models (VLMs) have shown strong capabilities in recognizing objects, describing scenes, and reasoning over images and text. However, robots require more than visual reasoning. They must connect perception and language to physical action. A robot must know not only what an object is, but also whether it can approach it, grasp it, avoid it, open it, move it or use it as part of a larger task.&lt;/p&gt;
&lt;p&gt;Quadruped and humanoid robots make this problem especially important. A quadruped robot may be able to traverse stairs, uneven terrain, narrow passages, and large indoor or outdoor spaces, but it has limited manipulation capability. A humanoid robot may be able to open doors, operate tools, pick up objects, and interact with human-designed environments, but it requires more complex whole-body balance, motion planning, and manipulation control. The same language instruction may therefore require different interpretations depending on the robot body.&lt;/p&gt;
&lt;p&gt;For example, the instruction &amp;ldquo;check the object on the upper shelf&amp;rdquo; may require a quadruped robot to navigate to the area, inspect the shelf from multiple viewpoints, and report the object state. For a humanoid robot, the same instruction may involve walking to the shelf, adjusting body posture, reaching with an arm, grasping the object, and possibly relocating it. A general VLA system must understand both the shared task meaning and the embodiment-specific action requirements.&lt;/p&gt;
&lt;!-- This project treats VLA as a bridge between robot vision and embodied control. The central research problem is not only how to generate robot actions from images and language, but how to make those actions appropriate for different robot bodies, different physical environments, and different levels of task complexity. --&gt;
&lt;h2 id="research-approach"&gt;Research Approach&lt;/h2&gt;
&lt;h3 id="vla-robot-intelligence"&gt;VLA Robot Intelligence&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language-Guided Visual Understanding&lt;/strong&gt;: Developing models that ground natural-language instructions in robot camera observations, object locations, spatial relations, and task-relevant visual cues.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embodied Action Reasoning&lt;/strong&gt;: Connecting visual-language understanding to physically executable robot behaviors such as walking, turning, inspecting, reaching, grasping, and interacting.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Long-Horizon Task Decomposition&lt;/strong&gt;: Translating high-level human instructions into structured action sequences that can be executed over time by quadruped or humanoid robots.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Affordance-Aware Perception&lt;/strong&gt;: Estimating what can be walked through, avoided, climbed, reached, grasped, opened, moved, or manipulated from visual observations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Failure-Aware VLA Control&lt;/strong&gt;: Detecting when a task cannot be completed from the current observation and triggering replanning, additional perception, or alternative robot actions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="quadruped-robot-intelligence"&gt;Quadruped Robot Intelligence&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language-Guided Locomotion&lt;/strong&gt;: Enabling quadruped robots to follow natural-language commands for navigation, inspection, search, and spatial exploration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Terrain-Aware Navigation&lt;/strong&gt;: Combining visual perception, proprioception, and environmental cues to move across stairs, slopes, cluttered spaces, narrow paths, and uneven terrain.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Active Visual Inspection&lt;/strong&gt;: Allowing quadruped robots to move their body and camera viewpoint to inspect objects, rooms, structures, obstacles, or uncertain regions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic Exploration&lt;/strong&gt;: Mapping and exploring environments not only by geometry, but also by object categories, room functions, landmarks, and task-relevant regions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human-Centered Mobility&lt;/strong&gt;: Developing navigation behaviors that allow quadruped robots to move safely around people, furniture, doors, corridors, and dynamic obstacles.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="humanoid-robot-intelligence"&gt;Humanoid Robot Intelligence&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Whole-Body VLA Control&lt;/strong&gt;: Developing VLA models that connect language and visual perception to coordinated head, torso, arm, hand, and leg movements.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bimanual Manipulation&lt;/strong&gt;: Training humanoid robots to use both hands for object handling, tool use, carrying, opening, pushing, pulling, and coordinated manipulation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human-Scale Environment Interaction&lt;/strong&gt;: Enabling humanoids to interact with doors, handles, shelves, switches, cabinets, tables, chairs, tools, and other objects designed for human bodies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Posture and Balance-Aware Action&lt;/strong&gt;: Generating actions that account for center of mass, reachability, foot placement, support surfaces, and whole-body stability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Socially Situated Robot Action&lt;/strong&gt;: Studying how humanoid robots should move, gesture, approach, hand over objects, and interact with people in shared spaces.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="slam-and-spatial-understanding"&gt;SLAM and Spatial Understanding&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Semantic SLAM for VLA Robots&lt;/strong&gt;: Using SLAM as one component of the VLA robot system to support localization, spatial memory, object mapping, and long-horizon task execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Language-Conditioned Map Querying&lt;/strong&gt;: Connecting language expressions such as &amp;ldquo;the room near the stairs,&amp;rdquo; &amp;ldquo;the object behind the table,&amp;rdquo; or &amp;ldquo;the door on the left&amp;rdquo; to map-based spatial representations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-View Scene Understanding&lt;/strong&gt;: Combining observations from different robot viewpoints to improve object localization, scene reconstruction, and environmental awareness.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Map-Based Task Planning&lt;/strong&gt;: Using spatial maps to support navigation, inspection, object search, path planning, and task sequencing for both quadruped and humanoid robots.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic Map Updates&lt;/strong&gt;: Updating spatial and semantic maps when objects move, doors open, paths become blocked, or robot actions change the environment.&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ### Cross-Embodiment Learning
- **Shared VLA Representations**: Learning common vision-language representations that can be used by both quadruped and humanoid robots.
- **Embodiment-Specific Action Heads**: Designing separate action interfaces for quadruped locomotion and humanoid whole-body manipulation while preserving a shared reasoning backbone.
- **Skill Transfer Across Robot Bodies**: Studying which parts of robot intelligence can transfer across embodiments and which require body-specific training.
- **Simulation-Based Data Generation**: Using simulated environments to generate diverse visual scenes, language commands, robot trajectories, and task variations for both robot types.
- **Sim-to-Real Adaptation**: Transferring VLA policies trained or evaluated in simulation to real quadruped and humanoid platforms under sensor noise, actuation limits, and environmental variation. --&gt;
&lt;!-- ## System Design
The proposed system is structured around a shared VLA reasoning layer and embodiment-specific robot execution modules. The VLA layer receives visual observations, language instructions, robot state information, and spatial context. It interprets the task, identifies relevant objects or regions, reasons about the environment, and produces an action plan suitable for the available robot platform.
For quadruped robots, the execution module converts VLA outputs into locomotion goals, navigation waypoints, inspection poses, camera viewpoints, terrain-aware motion commands, and exploration behaviors. The quadruped system emphasizes mobility, spatial coverage, environmental observation, and safe movement through complex terrain.
For humanoid robots, the execution module converts VLA outputs into whole-body actions, arm trajectories, hand poses, grasp targets, tool-use sequences, walking motions, and interaction behaviors. The humanoid system emphasizes manipulation, human-scale physical interaction, balance-aware movement, and coordinated use of the body.
The SLAM and spatial understanding module supports both robot types by maintaining localization, scene structure, semantic object locations, and spatial relations. It is not treated as the only core of the project, but as an important enabling component for long-horizon VLA robot behavior. The map provides continuity across time, allowing robots to remember previously observed places, revisit task-relevant areas, and plan beyond the current camera frame.
The system is designed to operate across three levels of autonomy. At the perception level, robots recognize objects, spaces, terrain, and human-relevant cues. At the reasoning level, the VLA model interprets instructions and selects body-appropriate strategies. At the execution level, quadruped and humanoid controllers convert high-level actions into stable, safe, and physically feasible robot motion. --&gt;
&lt;h2 id="current-implementation"&gt;Current Implementation&lt;/h2&gt;
&lt;p&gt;At the current stage, this project focuses on defining the VLA framework for quadruped and humanoid robot intelligence. The initial implementation is organized around robot vision, natural-language instruction grounding, semantic mapping, and embodiment-specific action generation.&lt;/p&gt;
&lt;p&gt;For quadruped robots, the near-term focus is on language-guided navigation, active visual inspection, scene exploration, terrain-aware movement, and semantic SLAM integration. These capabilities provide a foundation for robots that can move through complex environments while interpreting instructions in relation to physical space.&lt;/p&gt;
&lt;p&gt;For humanoid robots, the near-term focus is on vision-language-guided reaching, object interaction, whole-body task execution, and manipulation-oriented scene understanding. These capabilities provide a foundation for robots that can operate in human-designed environments and perform tasks requiring arms, hands, posture control, and physical interaction.&lt;/p&gt;
&lt;!-- The project initially treats quadruped and humanoid systems as separate but related embodiments. The shared research layer is the VLA model, while the robot-specific layers handle locomotion, manipulation, balance, and control. This allows the project to study both common embodied intelligence and robot-specific action constraints. --&gt;
&lt;h2 id="future-research-directions"&gt;Future Research Directions&lt;/h2&gt;
&lt;p&gt;Future work will extend this project toward general-purpose VLA robot systems that can operate across different embodiments, environments, and task domains. One direction is to develop shared robot foundation models that can interpret language and vision across quadruped and humanoid platforms while producing actions through embodiment-specific control heads.&lt;/p&gt;
&lt;!-- Another direction is to integrate richer spatial memory through semantic SLAM, object-level mapping, and dynamic environment modeling. This will allow robots to reason over longer time horizons, remember previously observed places, and perform tasks that require navigation, search, inspection, and manipulation across multiple locations. --&gt;
&lt;p&gt;A further direction is to build simulation-to-real training pipelines for VLA robot policies. Simulation can provide diverse environments, rare failure cases, and scalable task variations, while real-robot experiments can test whether the learned policies remain robust under physical constraints, sensor noise, contact uncertainty, and dynamic environments.&lt;/p&gt;
&lt;!-- The long-term objective is to develop embodied VLA systems that allow quadruped and humanoid robots to understand human instructions, perceive complex environments, coordinate perception and action, and execute useful tasks in real-world spaces. Rather than treating robot vision, navigation, manipulation, and SLAM as isolated modules, this project aims to study how these capabilities can be integrated into practical robot intelligence for legged and humanoid platforms. --&gt;</description></item></channel></rss>