Asymmetric selfplay

Main Idea: Since Alice (an identical setup) set up the goal, Bob must be able to solve it too.

Untitled

The objective for bob is $L = L_{RL} + \beta L_{abc}$

where $L_{RL}$ is an RL objective and $L_{abc}$ is the Alice Behavioural Cloning Loss, where $\beta$ is an HPO controlling the relative importance of BC