https://openreview.net/pdf?id=HkglHcSj2N
Goal-conditioned policy (Kaelbling, 1993; Schaul et al., 2015)
-
value function is designed with DG(s, a, g), where DG is the expected number of steps to goal
- idea - optimize the DG to be $DG^*{(s, a, g)} = 0$, where state = goal and
T is transitional probability

- seeks to obtain the indicator reward of having the observation exactly match the goal
- no need for extra instrumentation to determine reward
- Hard to do in reality as the exact observation is never observed twice in robotics, esp in continuous spaces
- However, HER (Hindsight Experience Replay) allows us to use off-policy RL, and “relabel” a trajectory by replacing its goal by the actual state visited
-
e.g. Say we collected 40 episodes, and there are no successes and all with reward -1. How do we make use of these trajectories?
Well, we can change the goal of those trajectories (perhaps uniformly random or more strategically) and recalculate the rewards, then we could have some successes to learn from


Method
- Relabelling trajectories
- Relabel to consider transitions (s, a, s’, g), we can consider the transitions $(s, a, s’, g = s_{t + k})$