**DistilBERT**  is a small, fast, cheap and light [Transformer](https://paperswithcode.com/method/transformer) model based on the [BERT](https://paperswithcode.com/method/bert) architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses.

**Prioritized Experience Replay** is a type of [experience replay](https://paperswithcode.com/method/experience-replay) in reinforcement learning where we more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity, which is alleviated with stochastic prioritization, and introduce bias, which can be corrected with importance sampling.

The stochastic sampling method interpolates between pure greedy prioritization and uniform random sampling. The probability of being sampled is ensured to be monotonic in a transition's priority,  while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, define the probability of sampling transition $i$ as

$$P(i) = \frac{p_i^{\alpha}}{\sum_k p_k^{\alpha}}$$

where $p_i > 0$ is the priority of transition $i$. The exponent $\alpha$ determines how much prioritization is used, with $\alpha=0$ corresponding to the uniform case.

Prioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to. We can correct this bias by using
importance-sampling (IS) weights:

$$ w\_{i} = \left(\frac{1}{N}\cdot\frac{1}{P\left(i\right)}\right)^{\beta} $$

that fully compensates for the non-uniform probabilities $P\left(i\right)$ if $\beta = 1$. These weights can be folded into the [Q-learning](https://paperswithcode.com/method/q-learning) update by using $w\_{i}\delta\_{i}$ instead of $\delta\_{i}$ - weighted IS rather than ordinary IS. For stability reasons, we always normalize weights by $1/\max\_{i}w\_{i}$ so
that they only scale the update downwards.

The two types of prioritization are proportional based, where $p\_{i} = |\delta\_{i}| + \epsilon$ and rank-based, where $p\_{i} = \frac{1}{\text{rank}\left(i\right)}$, the latter where $\text{rank}\left(i\right)$ is the rank of transition $i$ when the replay memory is sorted according to |$\delta\_{i}$|, For proportional based, hyperparameters used were $\alpha = 0.7$, $\beta\_{0} = 0.5$. For the rank-based variant, hyperparameters used were $\alpha = 0.6$, $\beta\_{0} = 0.4$.

Prioritized Experience Replay

DistilBERT

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

https://developer.nvidia.com/blog/flip-a-difference-evaluator-for-alternating-images/

Source	DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: DistilBERT?

Viet-Anh on Software