T-REX and D-REX

Overall goal: performance of IL is upper-bounded by performance of demonstrator — how can we extrapolate reward function better, in order to perform better?

T-REX (Trajectory-ranked Reward EXploration)

ranking demonstrations (with human effort) —> infer high-quality reward functions from poor demonstrations

Method of reward inference

Goal: rank preferences automatically with few labelled rankings

Define general loss function

Recall goal of IRL: minimise error of expected features

$\tau_1, ..., \tau_m$

trajectories ranked from worst to best

We want to approximate reward such that the better trajectories have higher rewards

$\sum_{s \in \tau_i} r_\theta(s) < \sum_{s \in \tau_j} r_\theta(s), \ i < j$

Generalized loss function would therefore look like this

$\mathcal{L}(\theta) = \mathbb{E}{\tau_i, \tau_j \sim \Pi}(\xi(J\theta(\tau_i) < J_\theta(\tau_j)), \tau_i \prec \tau_j )$

$\Pi$

$\xi$

distribution of all demonstrations

binary classification loss function

To implement $\xi$, which predicts if one trajectory is preferable to another, we use a cross entropy loss from Bradley-Terry and Luce-Shephard models of preferences.

Review: score functions Score functions

We represent the probability $J_\theta(\tau_i) < J_\theta(\tau_j)$ as a softmax-normalized distribution.

$P(J_\theta(\tau_i) < J_\theta(\tau_j)) \approx \frac{\exp{\sum_{s \in \tau_j }r_\theta(s)}}{\sum_{s \in \tau_i }r_\theta(s) + \sum_{s \in \tau_j }r_\theta(s)}$