Goal-conditioned Imitation Learning

value function is designed with DG(s, a, g), where DG is the expected number of steps to goal
- idea - optimize the DG to be $DG^*{(s, a, g)} = 0$, where state = goal and
T is transitional probability

Screenshot 2024-07-31 at 8.56.33 am.png

seeks to obtain the indicator reward of having the observation exactly match the goal
no need for extra instrumentation to determine reward
Hard to do in reality as the exact observation is never observed twice in robotics, esp in continuous spaces
- However, HER (Hindsight Experience Replay) allows us to use off-policy RL, and “relabel” a trajectory by replacing its goal by the actual state visited
  - e.g. Say we collected 40 episodes, and there are no successes and all with reward -1. How do we make use of these trajectories? Well, we can change the goal of those trajectories (perhaps uniformly random or more strategically) and recalculate the rewards, then we could have some successes to learn from

Method

Relabelling trajectories
1. Relabel to consider transitions (s, a, s’, g), we can consider the transitions $(s, a, s’, g = s_{t + k})$