One key problem:

Two proposed (combined) ideas: Distribution of environments can easily contain errors

Unsupervised Environment Design - tunable environments to provide valid, solvable environments

PAIRED - Protagonist Antagonist Induced Regret Environment Design

PAIRED

Regret

Assume we are given a fixed environment with parameters $\theta$, a fixed policy for the protagonist agent, $\pi^P$ , and we then train a second antagonist agent, $\pi^A$, to optimality in this environment. Then, the difference between the reward obtained by the antagonist, U(A), and the protagonist, U(P ), is the regret:

Screenshot 2023-08-31 at 6.47.07 PM.png

adversary and antagonists goal is to maximize regret

while protagonist tries to minimize regret

Screenshot 2023-08-30 at 8.55.19 PM.png

In bookshelf problem, environment parameters can be