Temporal Difference

MCMC Recap

In a typical MCMC algorithm, the complete episode is used to compute the empirical return $G_t$ ($G_t = \sum^{T-t-1}{k=0} \gamma^k R{t+k+1}$), and all episodes must eventually terminate.

Empirical mean return (Value function) for state s

Untitled

Q Function (s, a)

Untitled

Where 1 is the binary indicator function (only get the mean return at the corresponding state)

Untitled

Temporal Difference

In contrast to MCMC, complete episodes are not required, and we do not need to track the episode up to termination.

If we have the previous state and the next state, we are able to find the reward given, and hence calculate an estimated value function

$V(S_t) \leftarrow (1-\alpha)V(S_t) + \alpha G_t$

$V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))$

$V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_t+1)-V(S_t)$

Off Policy TD: Q-Learning

Untitled

Main observation: Q-Learning picks the e-greedy action.

However, Q* is estimated greedily

In SARSA, Q* is estimated e-greedy (not not max_a, just using the action used in last step), and actions are also e-greedy