Inverse Reinforcement Learning is essentially trying to inferring reward function from demonstrations
CS 285: Lecture 20, Inverse Reinforcement Learning, Part 1
Initialize some policy $\pi$ and reward function $r$
We can express the optimization function $J(\pi)$ as a linear combination of state features.
We can then learn what the parameters should be based on how different the expected features are. We want the expected features to be identical to the demonstration. Thus, we update our weights with the difference in the expected features.
We can assume that expert demonstrations have good rewards, therefore, the probability of expert demonstration is proportional to the exponential of the reward, (then normalize it with scores of all possibilities (can be rewritten as $p(s | \psi) r_\psi (s)$) to get probabilitity)
Feature matching - max margin
Find reward function that maximizes the margin between correct action and wrong action
It will be similar the expert actions still since we want to match as many features of the expert as possible
Leaning optimality
Since expert/human actions are not entirely optimal, we should learn how optimal the actions is as well, in order to ensure the correctness of the reward function