Viet-Anh on Software Logo

What is: A3C?

SourceAsynchronous Methods for Deep Reinforcement Learning
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

A3C, Asynchronous Advantage Actor Critic, is a policy gradient algorithm in reinforcement learning that maintains a policy π(a_ts_t;θ)\pi\left(a\_{t}\mid{s}\_{t}; \theta\right) and an estimate of the value function V(s_t;θ_v)V\left(s\_{t}; \theta\_{v}\right). It operates in the forward view and uses a mix of nn-step returns to update both the policy and the value-function. The policy and the value function are updated after every t_maxt\_{\text{max}} actions or when a terminal state is reached. The update performed by the algorithm can be seen as _θlogπ(a_ts_t;θ)A(s_t,a_t;θ,θ_v)\nabla\_{\theta{'}}\log\pi\left(a\_{t}\mid{s\_{t}}; \theta{'}\right)A\left(s\_{t}, a\_{t}; \theta, \theta\_{v}\right) where A(s_t,a_t;θ,θ_v)A\left(s\_{t}, a\_{t}; \theta, \theta\_{v}\right) is an estimate of the advantage function given by:

k1_i=0γir_t+i+γkV(s_t+k;θ_v)V(s_t;θ_v)\sum^{k-1}\_{i=0}\gamma^{i}r\_{t+i} + \gamma^{k}V\left(s\_{t+k}; \theta\_{v}\right) - V\left(s\_{t}; \theta\_{v}\right)

where kk can vary from state to state and is upper-bounded by t_maxt\_{max}.

The critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.

Note that while the parameters θ\theta of the policy and θ_v\theta\_{v} of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one softmax output for the policy π(a_ts_t;θ)\pi\left(a\_{t}\mid{s}\_{t}; \theta\right) and one linear output for the value function V(s_t;θ_v)V\left(s\_{t}; \theta\_{v}\right), with all non-output layers shared.