What is: Taylor Expansion Policy Optimization?
Source | Taylor Expansion Policy Optimization |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
TayPO, or Taylor Expansion Policy Optimization, refers to a set of algorithms that apply the -th order Taylor expansions for policy optimization. This generalizes prior work, including TRPO as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line , the -th order Taylor expansion of at is
where are the -th order derivatives at . First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required . Second, when using the truncation as an approximation to the original function , Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation at any (target policy), we only require the behavior policy "data" at (i.e., derivatives ).