Prereq/Plug: CLIPort is a combination of CLIP and Transporter networks.
If you haven’t already, I highly recommend looking at the following articles before continuing 🩵
CLIP: Learning Transferable Visual Models From Natural Language Supervision
Transporter Networks: Robot actions - Embeddings → Spatial displacements
Cool, after two papers, we finally get to the robotics part…
So! the elephant in the room:
CLIP: enabling a broad semantic understanding
Transporter: spatial precision
The goal is to learn a goal-conditioned policy $\pi$ that outputs an action given input $\gamma = (o_t, l_t)$, where o is the observation and l is the language instruction
Semantic stream (yellow and blue): First we encode the RGB image into an image embedding, then we multiply it by our language embeddings
Spatial stream (green): Decoder layers are simply concatenated with the semantic decoder