Viet-Anh on Software Logo

What is: Prioritized Experience Replay?

SourcePrioritized Experience Replay
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Prioritized Experience Replay is a type of experience replay in reinforcement learning where we more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity, which is alleviated with stochastic prioritization, and introduce bias, which can be corrected with importance sampling.

The stochastic sampling method interpolates between pure greedy prioritization and uniform random sampling. The probability of being sampled is ensured to be monotonic in a transition's priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, define the probability of sampling transition ii as

P(i)=piαkpkαP(i) = \frac{p_i^{\alpha}}{\sum_k p_k^{\alpha}}

where pi>0p_i > 0 is the priority of transition ii. The exponent α\alpha determines how much prioritization is used, with α=0\alpha=0 corresponding to the uniform case.

Prioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to. We can correct this bias by using importance-sampling (IS) weights:

w_i=(1N1P(i))βw\_{i} = \left(\frac{1}{N}\cdot\frac{1}{P\left(i\right)}\right)^{\beta}

that fully compensates for the non-uniform probabilities P(i)P\left(i\right) if β=1\beta = 1. These weights can be folded into the Q-learning update by using w_iδ_iw\_{i}\delta\_{i} instead of δ_i\delta\_{i} - weighted IS rather than ordinary IS. For stability reasons, we always normalize weights by 1/max_iw_i1/\max\_{i}w\_{i} so that they only scale the update downwards.

The two types of prioritization are proportional based, where p_i=δ_i+ϵp\_{i} = |\delta\_{i}| + \epsilon and rank-based, where p_i=1rank(i)p\_{i} = \frac{1}{\text{rank}\left(i\right)}, the latter where rank(i)\text{rank}\left(i\right) is the rank of transition ii when the replay memory is sorted according to |δ_i\delta\_{i}|, For proportional based, hyperparameters used were α=0.7\alpha = 0.7, β_0=0.5\beta\_{0} = 0.5. For the rank-based variant, hyperparameters used were α=0.6\alpha = 0.6, β_0=0.4\beta\_{0} = 0.4.