Main Source:
Transporter Networks: Rearranging the Visual World for Robotic Manipulation
https://github.com/google-research/ravens
Goal: rearrange deep features to infer spatial displacements from visual input→which are used to parameterise robot actions
Intuition:
https://transporternets.github.io/images/animation.mp4
Method
Top: real world scenario at initial time
Bottom: top down view of the goal
The aim of transporter networkers is to recover the underlying distribution (affordance) of successful picks and their corresponding places just from visual observations, and with no assumptions of “objectness” to handle unseen objects
Suppose we have $o_t$ which is the observation defined on a regular grid $\{(u, v)\}$ at timestep t, and possible poses ${\Large \tau}_{pick} \sim (u,v) \in o_t$. This distribution of successful picks over pixels in $o_t$ can be multi-modal (esp. when there are same objects to pick).
Fully convolutional networks are used to model the action-value function $\mathcal{Q} _{pick} ((u, v)| o_t)$.
Given an observation, we want to get the pose that maximizes the action-value function.
FCN is translationally equivariant
This means that if the object to be picked in the scene is translated, the learned picking pose is also translated. This equivalence can be formulized by $f_{pick}(g \circ o_t) = g \circ f_{pick}(o_t)$, where g is translation.
Spatial equivariance works best when the appearance of an object remains constant across different camera views, this is known as spatial consistent
Extension: (Future blog?)
Why not visual transformers?