Main Idea: Since Alice (an identical setup) set up the goal, Bob must be able to solve it too.
The objective for bob is $L = L_{RL} + \beta L_{abc}$
where $L_{RL}$ is an RL objective and $L_{abc}$ is the Alice Behavioural Cloning Loss, where $\beta$ is an HPO controlling the relative importance of BC