Reinforcement learning works on the idea that robots can execute trajectories from a policy and collect experience which is labelled by some heuristic called a reward function. We then optimise the policy as to improve the trajectories, with the overall goal of maximizing the expected returns.
Above, we have essentially described the data collection and improvement process. The robot executes actions based on the state, which are then labelled with rewards.
This evolution of states is also called the MDP. The essence of MDP is that the decisions made in the future depends only on the present state and not the past.
There are 2 main value functions in RL. One is $V(s)$, and the other is $Q(s, a)$. We can use them to evaluate the value of the current state and the value of the state-action pair respectively.
$V(s) = \mathbb{E}\{R_t|s_t = s\} = \mathbb{E}\{\sum_k \gamma^k r_{t+k+1}|s_t = s\}$
$Q(s, a) = \mathbb{E}\{R_t|s_t = s, a_t = a\} = \mathbb{E}\{\sum_k \gamma^k r_{t+k+1}|s_t = s\}$
The Bellman Equation allows us to express the expected return recursively. This enables us to predict the next value function $V_{t+1}$ such that our policy can take advantage of this information to always execute the most optimal actions.
$V(s) = \sum_a\pi(a|s) \sum_{s'}p(s' | s, a) [r + \gamma V(s')]$
Note that $p(s’|s, a)$ is our motion model.
The first summation is summing over all possible actions taken by our policies given our current state.
The second summation is our motion model. Recall that our motion model estimates all possible states that we can end up in given the current state action pair.
To sum up, the Bellman Equation for the Value function finds the total expected reward that can be gained from the current state, of all the possible actions and next states we could travel to.
There are many variants of RL Algorithms, we will go through value and policy iteration, Q-Learning and SARSA