Viet-Anh on Software Logo

What is: V-trace?

SourceIMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory (x_t,a_t,r_t)t=s+n_t=s\left(x\_{t}, a\_{t}, r\_{t}\right)^{t=s+n}\_{t=s} generated by the actor following some policy μ\mu. We can define the nn-steps V-trace target for V(x_s)V\left(x\_{s}\right), our value approximation at state x_sx\_{s} as:

v_s=V(x_s)+s+n1_t=sγts(t1_i=sc_i)δ_tVv\_{s} = V\left(x\_{s}\right) + \sum^{s+n-1}\_{t=s}\gamma^{t-s}\left(\prod^{t-1}\_{i=s}c\_{i}\right)\delta\_{t}V

Where δ_tV=ρ_t(r_t+γV(x_t+1)V(x_t))\delta\_{t}V = \rho\_{t}\left(r\_{t} + \gamma{V}\left(x\_{t+1}\right) - V\left(x\_{t}\right)\right) is a temporal difference algorithm for VV, and ρ_t=min(ρˉ,π(a_tx_t)μ(a_tx_t))\rho\_{t} = \text{min}\left(\bar{\rho}, \frac{\pi\left(a\_{t}\mid{x\_{t}}\right)}{\mu\left(a\_{t}\mid{x\_{t}}\right)}\right) and c_i=min(cˉ,π(a_tx_t)μ(a_tx_t))c\_{i} = \text{min}\left(\bar{c}, \frac{\pi\left(a\_{t}\mid{x\_{t}}\right)}{\mu\left(a\_{t}\mid{x\_{t}}\right)}\right) are truncated importance sampling weights. We assume that the truncation levels are such that ρˉcˉ\bar{\rho} \geq \bar{c}.