Viet-Anh on Software Logo

What is: Target Policy Smoothing?

SourceAddressing Function Approximation Error in Actor-Critic Methods
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Target Policy Smoothing is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a SARSA-like expectation/integral.

The modified target update is:

y=r+γQ_θ(s,π_θ(s)+ϵ)y = r + \gamma{Q}\_{\theta'}\left(s', \pi\_{\theta'}\left(s'\right) + \epsilon \right)

ϵclip(N(0,σ),c,c)\epsilon \sim \text{clip}\left(\mathcal{N}\left(0, \sigma\right), -c, c \right)

where the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of Expected SARSA, where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter σ\sigma.