One key problem:
- Unsupervised learning is extremely fragile, especially in terms of environment design and auto curriculum methods have been based on the agent’s learning rate but less focused on whether the environment is actually solvable
Two proposed (combined) ideas:
Distribution of environments can easily contain errors
- unsolvable states
- Insufficient domain randomization or too much that leads to failure to converge
- minimax adversarial tuning often leads to distributions that are impossible for agent to solve
- (this is because the minimax’s policy is defined by $\argmin_\theta {U^{\theta}(\pi)}$, where $U$ is the cumulative discounted reward)
Unsupervised Environment Design - tunable environments to provide valid, solvable environments
PAIRED - Protagonist Antagonist Induced Regret Environment Design
- train the adversary based on the difference of the reward of the protagonist (agent) and the antagonist
- aiming for
- Feasible environments where antagonist can get high rewards and low rewards for protagonist
- since both agents are learning, adversary is encouraged to make an environemnt that is solvable but difficult
- Both teams will reach a Nash equilibrium, where regret is maximized
PAIRED
Regret
- the difference between payoff with current policy and the optimal payoff that could have been obtained in the same setting with different decision
Assume we are given a fixed environment with parameters $\theta$, a fixed policy for the protagonist agent, $\pi^P$ , and
we then train a second antagonist agent, $\pi^A$, to optimality in this environment. Then, the difference between the reward obtained by the antagonist, U(A), and the protagonist, U(P ), is the regret:

adversary and antagonists goal is to maximize regret
while protagonist tries to minimize regret

In bookshelf problem, environment parameters can be
- number of objects in bin
- positions of objects (front/mid/back)