Viet-Anh on Software Logo

What is: Stein Variational Policy Gradient?

SourceStein Variational Policy Gradient
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Stein Variational Policy Gradient, or SVPG, is a policy gradient based method in reinforcement learning that uses Stein Variational Gradient Descent to allow simultaneous exploitation and exploration of multiple policies. Unlike traditional policy optimization which attempts to learn a single policy, SVPG models a distribution of policy parameters, where samples from this distribution will represent strong policies. SVPG optimizes this distribution of policy parameters with (relative) entropy regularization. The (relative) entropy term explicitly encourages exploration in the parameter space while also optimizing the expected utility of polices drawn from this distribution. Stein variational gradient descent (SVGD) is then used to optimize this distribution. SVGD leverages efficient deterministic dynamics to transport a set of particles to approximate given target posterior distributions.

The update takes the form:

θ_i=1n_j=1n_θ_j(1αJ(θ_j)+logq_0(θ_j))k(θ_j,θ_i)+_θ_jk(θ_j,θ_i) \nabla\theta\_i = \frac{1} {n}\sum\_{j=1}^n \nabla\_{\theta\_{j}} \left(\frac{1}{\alpha} J(\theta\_{j}) + \log q\_0(\theta\_j)\right)k(\theta\_j, \theta\_i) + \nabla\_{\theta\_j} k(\theta\_j, \theta\_i)

Note that here the magnitude of α\alpha adjusts the relative importance between the policy gradient and the prior term θj(1αJ(θj)+logq0(θj))k(θj,θi)\nabla_{\theta_j} \left(\frac{1}{\alpha} J(\theta_j) + \log q_0(\theta_j)\right)k(\theta_j, \theta_i) and the repulsive term θjk(θj,θi)\nabla_{\theta_j} k(\theta_j, \theta_i). The repulsive functional is used to diversify particles to enable parameter exploration. A suitable α\alpha provides a good trade-off between exploitation and exploration. If α\alpha is too large, the Stein gradient would only drive the particles to be consistent with the prior q0q_0. As α0\alpha \to 0, this algorithm is reduced to running nn copies of independent policy gradient algorithms, if {θi}\{\theta_i\} are initialized very differently. A careful annealing scheme of α\alpha allows efficient exploration in the beginning of training and later focuses on exploitation towards the end of training.