V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory (x_t,a_t,r_t)t=s+n_t=s generated by the actor following some policy μ. We can define the n-steps V-trace target for V(x_s), our value approximation at state x_s as:
v_s=V(x_s)+∑s+n−1_t=sγt−s(∏t−1_i=sc_i)δ_tV
Where δ_tV=ρ_t(r_t+γV(x_t+1)−V(x_t)) is a temporal difference algorithm for V, and ρ_t=min(ρˉ,μ(a_t∣x_t)π(a_t∣x_t)) and c_i=min(cˉ,μ(a_t∣x_t)π(a_t∣x_t)) are truncated importance sampling weights. We assume that the truncation levels are such that ρˉ≥cˉ.