Viet-Anh on Software Logo

What is: TD Lambda?

Year2000
Data SourceCC BY-SA - https://paperswithcode.com

TD_INLINE_MATH_1 is a generalisation of TD_INLINE_MATH_2 reinforcement learning algorithms, but it employs an eligibility trace λ\lambda and λ\lambda-weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by γλ\gamma\lambda:

z_1=0\textbf{z}\_{-1} = \mathbf{0} z_t=γλz_t1+v^(S_t,w_t),0tT \textbf{z}\_{t} = \gamma\lambda\textbf{z}\_{t-1} + \nabla\hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right), 0 \leq t \leq T

The eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here v^(S_t,w_t)\nabla\hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right) is the feature vector.

The TD error for state-value prediction is:

\delta\_{t} = R\_{t+1} + \gamma\hat{v}\left\(S\_{t+1}, \mathbf{w}\_{t}\right) - \hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right)

In TD_INLINE_MATH_1, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:

w_t+1=w_t+αδz_t\mathbf{w}\_{t+1} = \mathbf{w}\_{t} + \alpha\delta\mathbf{z}\_{t}

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition