In model-free RL,
Optimal action based on learned policy
we do not attempt to learn the system dynamics and just let take in the observation from sensors. We only learn $\pi_\theta$, the optimal policy in order to get the best actions. We assume that the system dynamics will be implicitly learned.
In model-based learning,
Optimal action based on known/unknown (and learned) system dynamics
$s_{next}$ = TreePolicy ($s_1$), where $s_{next}$ is the next leaf
Evaluate value of $s_{next}$ with DefaultPolicy($s_{next}$)
Update all values in tree between $s_1$ and $s_{next}$
RL — LQR & iLQR Linear Quadratic Regulator
LQR takes the second derivative of the cost function
backward pass to optimize the cost (Q) at each timestep, until we find reach the initial state
We find the optimal action
We will find the initial optimal action to take based on the calculated matricies by forward propagation until target state (see right figure)
We find the next state