Related works:

CLIPort: What and Where Pathways for Robotic Manipulation

Main Source:

🪁 KITE

Grounding: Grounding an instruction means understanding what to manipulate based on language

Scene semantics: Which object

Object semantics: What part of the object

6 DOF reasoning + precision

CLIPort has keypoint-based grounding, only simple planar manipulation

RobotMoo VLM (vision language model) grounding: Only pick and place bc only scene semantics, and less on object semantics

BC-Z/SayCan excessive data collection

PerAct: voxelized - discretization

Words

Language-Based Manipulation:

Skill-based Manipulation: