MCMC Recap
In a typical MCMC algorithm, the complete episode is used to compute the empirical return $G_t$ ($G_t = \sum^{T-t-1}{k=0} \gamma^k R{t+k+1}$), and all episodes must eventually terminate.
Empirical mean return (Value function) for state s
Q Function (s, a)
Where 1 is the binary indicator function (only get the mean return at the corresponding state)
Temporal Difference
In contrast to MCMC, complete episodes are not required, and we do not need to track the episode up to termination.
If we have the previous state and the next state, we are able to find the reward given, and hence calculate an estimated value function
$V(S_t) \leftarrow (1-\alpha)V(S_t) + \alpha G_t$
$V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))$
$V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_t+1)-V(S_t)$
Off Policy TD: Q-Learning
Main observation: Q-Learning picks the e-greedy action.
However, Q* is estimated greedily
In SARSA, Q* is estimated e-greedy (not not max_a, just using the action used in last step), and actions are also e-greedy