Prereq/Plug: CLIPort is a combination of CLIP and Transporter networks.

If you haven’t already, I highly recommend looking at the following articles before continuing 🩵

CLIP: Learning Transferable Visual Models From Natural Language Supervision

Transporter Networks: Robot actions - Embeddings → Spatial displacements

Cool, after two papers, we finally get to the robotics part…

So! the elephant in the room:

CLIPort: CLIP + Transporter

CLIP: enabling a broad semantic understanding

Transporter: spatial precision

The goal is to learn a goal-conditioned policy $\pi$ that outputs an action given input $\gamma = (o_t, l_t)$, where o is the observation and l is the language instruction

Untitled

Semantic stream (yellow and blue): First we encode the RGB image into an image embedding, then we multiply it by our language embeddings

Spatial stream (green): Decoder layers are simply concatenated with the semantic decoder