Retrace is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy (π,β). With off-policy rollout for TD learning, we must use importance sampling for the update:
ΔQimp(S_t,A_t)=γt∏_1≤τ≤tβ(A_τ∣S_τ)π(A_τ∣S_τ)δ_t
This product term can lead to high variance, so Retrace modifies ΔQ to have importance weights truncated by no more than a constant c:
ΔQimp(S_t,A_t)=γt∏_1≤τ≤tmin(c,β(A_τ∣S_τ)π(A_τ∣S_τ))δ_t