Inverse Reinforcement Learning is essentially trying to inferring reward function from demonstrations

CS 285: Lecture 20, Inverse Reinforcement Learning, Part 1

Initialize some policy $\pi$ and reward function $r$

Screenshot 2024-05-23 at 9.25.12 PM.png

Screenshot 2024-05-23 at 6.53.08 PM.png

We can express the optimization function $J(\pi)$ as a linear combination of state features.

Screenshot 2024-05-23 at 9.30.50 PM.png

We can then learn what the parameters should be based on how different the expected features are. We want the expected features to be identical to the demonstration. Thus, we update our weights with the difference in the expected features.

Screenshot 2024-05-23 at 9.54.33 PM.png

We can assume that expert demonstrations have good rewards, therefore, the probability of expert demonstration is proportional to the exponential of the reward, (then normalize it with scores of all possibilities (can be rewritten as $p(s | \psi) r_\psi (s)$) to get probabilitity)

Untitled

Feature matching - max margin

Untitled

Find reward function that maximizes the margin between correct action and wrong action

It will be similar the expert actions still since we want to match as many features of the expert as possible

Leaning optimality

Since expert/human actions are not entirely optimal, we should learn how optimal the actions is as well, in order to ensure the correctness of the reward function

Untitled