Online RL - deployment & learning tgt
cannot have an arbitrary policy as it can perform random actions
Offline RL - separate behavioural policy for deployment
Surprising finding: ATAC performs better than BC/behavioral, and sometimes even correct reward labels, even if wrong data reward labels were provided (applies to other RL algos and tasks as well)
So its objective is not to maximize cumulative reward?
Explanation
Pessimism
optimal policy wants to stay within seen state action distribution - $\delta$-approximmately optimal
Positive data-bias
If the trajectories collected terminate when it fails, then there exists a positive data-bias since it never fails in the observed trajectory. Even when reward is wrongly assigned, the RL algorithm learns from positive trajectories and perform surprisingly optimally
Further work might investigate removing rewards in labelling intensive situations, and how to best make use of positive data bias