**ALBERT** is a [Transformer](https://paperswithcode.com/method/transformer) architecture based on [BERT](https://paperswithcode.com/method/bert) but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. 

Additionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.

The **Convolutional vision Transformer (CvT)** is an architecture which incorporates convolutions into the [Transformer](https://paperswithcode.com/method/transformer). The CvT design introduces convolutions to two core sections of the ViT architecture.

First, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping [convolution](https://paperswithcode.com/method/convolution) operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by [layer normalization](https://paperswithcode.com/method/layer-normalization). This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs. 

Second, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s × s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4× or more, with minimal degradation of performance.

CvT: Introducing Convolutions to Vision Transformers

ALBERT

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

**WaveGrad DBlocks** are used to downsample the temporal dimension of noisy waveform in [WaveGrad](https://paperswithcode.com/method/wavegrad). They are similar to UBlocks except that only one [residual block](https://paperswithcode.com/method/residual-block) is included. The dilation factors are 1, 2, 4 in the main branch. Orthogonal initialization is used.

Source	ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com