Viet-Anh on Software Logo

What is: Double Q-learning?

SourceDouble Q-learning
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Double Q-learning is an off-policy reinforcement learning algorithm that utilises double estimation to counteract overestimation problems with traditional Q-learning.

The max operator in standard Q-learning and DQN uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation, which is the idea behind Double Q-learning:

YQ_t=R_t+1+γQ(S_t+1,argmax_aQ(S_t+1,a;θ_t);θ_t)Y^{Q}\_{t} = R\_{t+1} + \gamma{Q}\left(S\_{t+1}, \arg\max\_{a}Q\left(S\_{t+1}, a; \mathbb{\theta}\_{t}\right);\mathbb{\theta}\_{t}\right)

The Double Q-learning error can then be written as:

YDoubleQ_t=R_t+1+γQ(S_t+1,argmax_aQ(S_t+1,a;θ_t);θ_t)Y^{DoubleQ}\_{t} = R\_{t+1} + \gamma{Q}\left(S\_{t+1}, \arg\max\_{a}Q\left(S\_{t+1}, a; \mathbb{\theta}\_{t}\right);\mathbb{\theta}^{'}\_{t}\right)

Here the selection of the action in the argmax\arg\max is still due to the online weights θ_t\theta\_{t}. But we use a second set of weights θ_t\mathbb{\theta}^{'}\_{t} to fairly evaluate the value of this policy.

Source: Deep Reinforcement Learning with Double Q-learning