An **XCiT Layer** is the main building block of the [XCiT](https://paperswithcode.com/method/xcit) architecture which uses a [cross-covariance attention]() operator as its principal operation. The XCiT layer consists of three main blocks, each preceded by [LayerNorm](https://paperswithcode.com/method/layer-normalization) and followed by a [residual connection](https://paperswithcode.com/method/residual-connection): (i) the core [cross-covariance attention](https://paperswithcode.com/method/cross-covariance-attention) (XCA) operation, (ii) the [local patch interaction](https://paperswithcode.com/method/local-patch-interaction) (LPI) module, and (iii) a [feed-forward network](https://paperswithcode.com/method/feedforward-network) (FFN). By transposing the query-key interaction, the computational complexity of XCA is linear in the number of data elements N, rather than quadratic as in conventional self-attention.

A **Cyclical Learning Rate Policy** combines a linear learning rate decay with warm restarts.

Image: [ESPNetv2](https://paperswithcode.com/method/espnetv2)

Cyclical Learning Rate Policy

XCiT Layer

XCiT: Cross-Covariance Image Transformers

**Adaptive Dropout** is a regularization technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [Dropout](https://paperswithcode.com/method/dropout) will ignore this confidence and drop the unit out 50% of the time. 

Denote the activity of unit $j$ in a deep neural network by $a\_{j}$ and assume that its inputs are {$a\_{i}: i < j$}. In dropout, $a\_{j}$ is randomly set to zero with probability 0.5. Let $m\_{j}$ be a binary variable that is used to mask, the activity $a\_{j}$, so that its value is:

$$ a\_{j} = m\_{j}g \left( \sum\_{i: i<j}w\_{j, i}a\_{i} \right)$$

where $w\_{j,i}$ is the weight from unit $i$ to unit $j$ and $g\left(·\right)$ is the activation function and $a\_{0} = 1$ accounts for biases. Whereas in standard dropout, $m\_{j}$ is Bernoulli with probability $0.5$, adaptive dropout uses adaptive dropout probabilities that depends on input activities:

$$ P\left(m\_{j} = 1\mid{\{a\_{i}: i < j\}}\right) = f \left( \sum\_{i: i<j}\pi{\_{j, i}a\_{i}} \right) $$

where $\pi\_{j, i}$ is the weight from unit $i$ to unit $j$ in the standout network or the adaptive dropout network; $f(·)$ is a sigmoidal function. Here 'standout' refers to a binary belief network is that is overlaid on a neural network as part of the overall regularization technique.

Source	XCiT: Cross-Covariance Image Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: XCiT Layer?

Viet-Anh on Software