Viet-Anh on Software Logo

What is: Retrace?

SourceSafe and Efficient Off-Policy Reinforcement Learning
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Retrace is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy (π,β)\left(\pi, \beta\right). With off-policy rollout for TD learning, we must use importance sampling for the update:

ΔQimp(S_t,A_t)=γt_1τtπ(A_τS_τ)β(A_τS_τ)δ_t\Delta{Q}^{\text{imp}}\left(S\_{t}, A\_{t}\right) = \gamma^{t}\prod\_{1\leq{\tau}\leq{t}}\frac{\pi\left(A\_{\tau}\mid{S\_{\tau}}\right)}{\beta\left(A\_{\tau}\mid{S\_{\tau}}\right)}\delta\_{t}

This product term can lead to high variance, so Retrace modifies ΔQ\Delta{Q} to have importance weights truncated by no more than a constant cc:

ΔQimp(S_t,A_t)=γt_1τtmin(c,π(A_τS_τ)β(A_τS_τ))δ_t\Delta{Q}^{\text{imp}}\left(S\_{t}, A\_{t}\right) = \gamma^{t}\prod\_{1\leq{\tau}\leq{t}}\min\left(c, \frac{\pi\left(A\_{\tau}\mid{S\_{\tau}}\right)}{\beta\left(A\_{\tau}\mid{S\_{\tau}}\right)}\right)\delta\_{t}