PAIRED - Protagonist and Antogonist Adversarial Learning

One key problem:

Unsupervised learning is extremely fragile, especially in terms of environment design and auto curriculum methods have been based on the agent’s learning rate but less focused on whether the environment is actually solvable

Two proposed (combined) ideas: Distribution of environments can easily contain errors

unsolvable states
Insufficient domain randomization or too much that leads to failure to converge
minimax adversarial tuning often leads to distributions that are impossible for agent to solve
- (this is because the minimax’s policy is defined by $\argmin_\theta {U^{\theta}(\pi)}$, where $U$ is the cumulative discounted reward)

Unsupervised Environment Design - tunable environments to provide valid, solvable environments

PAIRED - Protagonist Antagonist Induced Regret Environment Design

train the adversary based on the difference of the reward of the protagonist (agent) and the antagonist
- aiming for
- Feasible environments where antagonist can get high rewards and low rewards for protagonist
since both agents are learning, adversary is encouraged to make an environemnt that is solvable but difficult
Both teams will reach a Nash equilibrium, where regret is maximized

PAIRED

Regret

the difference between payoff with current policy and the optimal payoff that could have been obtained in the same setting with different decision

Assume we are given a fixed environment with parameters $\theta$, a fixed policy for the protagonist agent, $\pi^P$ , and we then train a second antagonist agent, $\pi^A$, to optimality in this environment. Then, the difference between the reward obtained by the antagonist, U(A), and the protagonist, U(P ), is the regret:

Screenshot 2023-08-31 at 6.47.07 PM.png

adversary and antagonists goal is to maximize regret

while protagonist tries to minimize regret

Screenshot 2023-08-30 at 8.55.19 PM.png

In bookshelf problem, environment parameters can be

number of objects in bin
positions of objects (front/mid/back)