Viet-Anh on Software Logo

What is: Expected Sarsa?

Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Expected Sarsa is like Q-learning but instead of taking the maximum over next state-action pairs, we use the expected value, taking into account how likely each action is under the current policy.

Q(S_t,A_t)Q(S_t,A_t)+α[Rt+1+γ_aπ(aS_t+1)Q(S_t+1,a)Q(S_t,A_t)]Q\left(S\_{t}, A\_{t}\right) \leftarrow Q\left(S\_{t}, A\_{t}\right) + \alpha\left[R_{t+1} + \gamma\sum\_{a}\pi\left(a\mid{S\_{t+1}}\right)Q\left(S\_{t+1}, a\right) - Q\left(S\_{t}, A\_{t}\right)\right]

Except for this change to the update rule, the algorithm otherwise follows the scheme of Q-learning. It is more computationally expensive than Sarsa but it eliminates the variance due to the random selection of A_t+1A\_{t+1}.

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition