What is: Decentralized Distributed Proximal Policy Optimization?
Source | DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Decentralized Distributed Proximal Policy Optimization (DD-PPO) is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement.
Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization.
Let denote the probability ratio , so . TRPO maximizes a “surrogate” objective:
As a general abstraction, DD-PPO implements the following: at step , worker has a copy of the parameters, , calculates the gradient, , and updates via
where is any first-order optimization technique (e.g. gradient descent) and performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers. Distributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).