Viet-Anh on Software Logo

What is: Taylor Expansion Policy Optimization?

SourceTaylor Expansion Policy Optimization
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

TayPO, or Taylor Expansion Policy Optimization, refers to a set of algorithms that apply the kk-th order Taylor expansions for policy optimization. This generalizes prior work, including TRPO as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line f:RRf : \mathbb{R} \rightarrow \mathbb{R}, the kk-th order Taylor expansion of f(x)f\left(x\right) at x_0x\_{0} is

f_k(x)=f(x_0)+k_i=1[f(i)(x_0)/i!](xx_0)if\_{k}\left(x\right) = f\left(x\_{0}\right)+\sum^{k}\_{i=1}\left[f^{(i)}\left(x\_{0}\right)/i!\right]\left(x−x\_{0}\right)^{i}

where f(i)(x_0)f^{(i)}\left(x\_{0}\right) are the ii-th order derivatives at x_0x\_{0}. First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required xx_0<R(f,x_0)1|x − x\_{0}| < R\left(f, x\_{0}\right)^{1}. Second, when using the truncation as an approximation to the original function f_K(x)f(x)f\_{K}\left(x\right) \approx f\left(x\right), Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation f_K(x)f\_{K}\left(x\right) at any xx (target policy), we only require the behavior policy "data" at x_0x\_{0} (i.e., derivatives f(i)(x_0)f^{(i)}\left(x\_{0}\right)).