Viet-Anh on Software Logo

What is: Generalized State-Dependent Exploration?

SourceSmooth Exploration for Robotic Reinforcement Learning
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Generalized State-Dependent Exploration, or gSDE, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically.

State-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state s_ts\_{t}, to the deterministic action μ(s_t)\mu\left(\mathbf{s}\_{t}\right). At the beginning of an episode, the parameters θ_ϵ\theta\_{\epsilon} of that exploration function are drawn from a Gaussian distribution. The resulting action a_t\mathbf{a}\_{t} is as follows:

a_t=μ(s_t;θ_μ)+ϵ(s_t;θ_ϵ),θ_ϵN(0,σ2)\mathbf{a}\_{t}=\mu\left(\mathbf{s}\_{t} ; \theta\_{\mu}\right)+\epsilon\left(\mathbf{s}\_{t} ; \theta\_{\epsilon}\right), \quad \theta\_{\epsilon} \sim \mathcal{N}\left(0, \sigma^{2}\right)

This episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state ss will be the same.

In the case of a linear exploration function ϵ(s;θ_ϵ)=θ_ϵs\epsilon\left(\mathbf{s} ; \theta\_{\epsilon}\right)=\theta\_{\epsilon} \mathbf{s}, by operation on Gaussian distributions, Rückstieß et al. show that the action element a_j\mathbf{a}\_{j} is normally distributed:

π]j(a_js)N(μ_j(s),σ_j^2)\pi]_{j}\left(\mathbf{a}\_{j} \mid \mathbf{s}\right) \sim \mathcal{N}\left(\mu\_{j}(\mathbf{s}), \hat{\sigma\_{j}}^{2}\right)

where σ^\hat{\sigma} is a diagonal matrix with elements σ^_j=_i(σ_ijs_i)2\hat{\sigma}\_{j}=\sqrt{\sum\_{i}\left(\sigma\_{i j} \mathbf{s}\_{i}\right)^{2}}.

Because we know the policy distribution, we can obtain the derivative of the log-likelihood logπ(as)\log \pi(\mathbf{a} \mid \mathbf{s}) with respect to the variance σ\sigma :

logπ(as)σij=(a_jμ_j)2σ_j^2σ^_j3s_i2σ_ijσj^\frac{\partial \log \pi(\mathbf{a} \mid \mathbf{s})}{\partial \sigma_{i j}}=\frac{\left(\mathbf{a}\_{j}-\mu\_{j}\right)^{2}-\hat{\sigma\_{j}}^{2}}{\hat{\sigma}\_{j}^{3}} \frac{\mathbf{s}\_{i}^{2} \sigma\_{i j}}{\hat{\sigma_{j}}}

This can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt σ\sigma during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.

For gSDE, two improvements are suggested:

  1. We sample the parameters θ_ϵ\theta\_{\epsilon} of the exploration function every nn steps instead of every episode.
  2. Instead of the state s, we can in fact use any features. We chose policy features z_μ(s;θ_z_μ)\mathbf{z}\_{\mu}\left(\mathbf{s} ; \theta\_{\mathbf{z}\_{\mu}}\right) (last layer before the deterministic output μ(s)=θ_μz_μ(s;θz_μ))\left.\mu(\mathbf{s})=\theta\_{\mu} \mathbf{z}\_{\mu}\left(\mathbf{s} ; \theta_{\mathbf{z}\_{\mu}}\right)\right) as input to the noise function ϵ(s;θ_ϵ)=θ_ϵz_μ(s)\epsilon\left(\mathbf{s} ; \theta\_{\epsilon}\right)=\theta\_{\epsilon} \mathbf{z}\_{\mu}(\mathbf{s})