Online RL - deployment & learning tgt

cannot have an arbitrary policy as it can perform random actions

Offline RL - separate behavioural policy for deployment

Surprising finding: ATAC performs better than BC/behavioral, and sometimes even correct reward labels, even if wrong data reward labels were provided (applies to other RL algos and tasks as well)

So its objective is not to maximize cumulative reward?

Explanation

Pessimism

optimal policy wants to stay within seen state action distribution - $\delta$-approximmately optimal

Positive data-bias

If the trajectories collected terminate when it fails, then there exists a positive data-bias since it never fails in the observed trajectory. Even when reward is wrongly assigned, the RL algorithm learns from positive trajectories and perform surprisingly optimally

Further work might investigate removing rewards in labelling intensive situations, and how to best make use of positive data bias