Overall goal: performance of IL is upper-bounded by performance of demonstrator — how can we extrapolate reward function better, in order to perform better?
Method of reward inference
Goal: rank preferences automatically with few labelled rankings
Define general loss function
Recall goal of IRL: minimise error of expected features
$\tau_1, ..., \tau_m$
trajectories ranked from worst to best
We want to approximate reward such that the better trajectories have higher rewards
$\sum_{s \in \tau_i} r_\theta(s) < \sum_{s \in \tau_j} r_\theta(s), \ i < j$
Generalized loss function would therefore look like this
$\mathcal{L}(\theta) = \mathbb{E}{\tau_i, \tau_j \sim \Pi}(\xi(J\theta(\tau_i) < J_\theta(\tau_j)), \tau_i \prec \tau_j )$
$\Pi$
$\xi$
distribution of all demonstrations
binary classification loss function
To implement $\xi$, which predicts if one trajectory is preferable to another, we use a cross entropy loss from Bradley-Terry and Luce-Shephard models of preferences.
Review: score functions Score functions
We represent the probability $J_\theta(\tau_i) < J_\theta(\tau_j)$ as a softmax-normalized distribution.
$P(J_\theta(\tau_i) < J_\theta(\tau_j)) \approx \frac{\exp{\sum_{s \in \tau_j }r_\theta(s)}}{\sum_{s \in \tau_i }r_\theta(s) + \sum_{s \in \tau_j }r_\theta(s)}$