Transporter Networks: Robot actions - Embeddings → Spatial displacements

Main Source:

Transporter Networks: Rearranging the Visual World for Robotic Manipulation

https://github.com/google-research/ravens

Goal: rearrange deep features to infer spatial displacements from visual input→which are used to parameterise robot actions

Intuition:

https://transporternets.github.io/images/animation.mp4

attend to a local region - which visual cues are important
predict its target spatial displacement via deep feature template matching - how these cues should be rearranged in a scene, learned from demonstrations
1. parameterise robot actions

Method

Untitled

Transport

Untitled

Top: real world scenario at initial time

Bottom: top down view of the goal

The aim of transporter networkers is to recover the underlying distribution (affordance) of successful picks and their corresponding places just from visual observations, and with no assumptions of “objectness” to handle unseen objects

Untitled

First, we explain how we learn $f_{pick}$,

Suppose we have $o_t$ which is the observation defined on a regular grid $\{(u, v)\}$ at timestep t, and possible poses ${\Large \tau}_{pick} \sim (u,v) \in o_t$. This distribution of successful picks over pixels in $o_t$ can be multi-modal (esp. when there are same objects to pick).

Fully convolutional networks are used to model the action-value function $\mathcal{Q} _{pick} ((u, v)| o_t)$.

Given an observation, we want to get the pose that maximizes the action-value function.

Untitled

FCN is translationally equivariant

This means that if the object to be picked in the scene is translated, the learned picking pose is also translated. This equivalence can be formulized by $f_{pick}(g \circ o_t) = g \circ f_{pick}(o_t)$, where g is translation.

Spatial equivariance works best when the appearance of an object remains constant across different camera views, this is known as spatial consistent

Extension: (Future blog?)

Why not visual transformers?