Viet-Anh on Software Logo

What is: Trust Region Policy Optimization?

SourceTrust Region Policy Optimization
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Trust Region Policy Optimization, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.

Take the case of off-policy reinforcement learning, where the policy β\beta for collecting trajectories on rollout workers is different from the policy π\pi to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:

J(θ)=_sSpπ_θ_old_aA(π_θ(as)A^_θ_old(s,a))J\left(\theta\right) = \sum\_{s\in{S}}p^{\pi\_{\theta\_{old}}}\sum\_{a\in\mathcal{A}}\left(\pi\_{\theta}\left(a\mid{s}\right)\hat{A}\_{\theta\_{old}}\left(s, a\right)\right)

J(θ)=_sSpπ_θ_old_aA(β(as)π_θ(as)β(as)A^_θ_old(s,a))J\left(\theta\right) = \sum\_{s\in{S}}p^{\pi\_{\theta\_{old}}}\sum\_{a\in\mathcal{A}}\left(\beta\left(a\mid{s}\right)\frac{\pi\_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}\_{\theta\_{old}}\left(s, a\right)\right)

J(θ)=E_spπ_θ_old,aβ(π_θ(as)β(as)A^_θ_old(s,a)) J\left(\theta\right) = \mathbb{E}\_{s\sim{p}^{\pi\_{\theta\_{old}}}, a\sim{\beta}} \left(\frac{\pi\_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}\_{\theta\_{old}}\left(s, a\right)\right)

When training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as π_θ_old(as)\pi\_{\theta\_{old}}\left(a\mid{s}\right) and thus the objective function becomes:

J(θ)=E_spπ_θ_old,aπ_θ_old(π_θ(as)π_θ_old(as)A^_θ_old(s,a)) J\left(\theta\right) = \mathbb{E}\_{s\sim{p}^{\pi\_{\theta\_{old}}}, a\sim{\pi\_{\theta\_{old}}}} \left(\frac{\pi\_{\theta}\left(a\mid{s}\right)}{\pi\_{\theta\_{old}}\left(a\mid{s}\right)}\hat{A}\_{\theta\_{old}}\left(s, a\right)\right)

TRPO aims to maximize the objective function J(θ)J\left(\theta\right) subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter δ\delta:

E_spπ_θ_old[D_KL(π_θ_old(.s)π_θ(.s))]δ \mathbb{E}\_{s\sim{p}^{\pi\_{\theta\_{old}}}} \left[D\_{KL}\left(\pi\_{\theta\_{old}}\left(.\mid{s}\right)\mid\mid\pi\_{\theta}\left(.\mid{s}\right)\right)\right] \leq \delta