What is: A3C?
Source | Asynchronous Methods for Deep Reinforcement Learning |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
A3C, Asynchronous Advantage Actor Critic, is a policy gradient algorithm in reinforcement learning that maintains a policy and an estimate of the value function . It operates in the forward view and uses a mix of -step returns to update both the policy and the value-function. The policy and the value function are updated after every actions or when a terminal state is reached. The update performed by the algorithm can be seen as where is an estimate of the advantage function given by:
where can vary from state to state and is upper-bounded by .
The critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.
Note that while the parameters of the policy and of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one softmax output for the policy and one linear output for the value function , with all non-output layers shared.