**DeBERTa** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based neural language model that aims to improve the [BERT](https://paperswithcode.com/method/bert) and [RoBERTa](https://paperswithcode.com/method/roberta) models with two techniques: a [disentangled attention mechanism](https://paperswithcode.com/method/disentangled-attention-mechanism) and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output [softmax](https://paperswithcode.com/method/softmax) layer to predict the masked tokens for model pre-training.  In addition, a new virtual adversarial training method is used for fine-tuning to improve model’s generalization on downstream tasks.

**Adam** is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of [RMSProp](https://paperswithcode.com/method/rmsprop) and [SGD w/th Momentum](https://paperswithcode.com/method/sgd-with-momentum). The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. 

The weight updates are performed as:

$$ w_{t} = w_{t-1} - \eta\frac{\hat{m}\_{t}}{\sqrt{\hat{v}\_{t}} + \epsilon}  $$

with

$$ \hat{m}\_{t} = \frac{m_{t}}{1-\beta^{t}_{1}} $$

$$ \hat{v}\_{t} = \frac{v_{t}}{1-\beta^{t}_{2}} $$

$$ m_{t} = \beta_{1}m_{t-1} + (1-\beta_{1})g_{t} $$

$$ v_{t} = \beta_{2}v_{t-1} + (1-\beta_{2})g_{t}^{2}  $$


$ \eta $ is the step size/learning rate, around 1e-3 in the original paper. $ \epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \beta_{1} $ and $ \beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.

Adam

Adam: A Method for Stochastic Optimization

DeBERTa

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

**Kalman Optimization for Value Approximation**, or **KOVA** is a general framework for addressing uncertainties while approximating value-based functions in deep RL domains. KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties. It is feasible when using non-linear approximation functions as DNNs and can estimate the value in both on-policy and off-policy settings. It can be incorporated as a policy evaluation component in policy optimization algorithms.

Source	DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: DeBERTa?

Viet-Anh on Software