**FixUp Initialization**, or **Fixed-Update Initialization**, is an initialization method that rescales the standard initialization of [residual branches](https://paperswithcode.com/method/residual-block) by adjusting for the network architecture. Fixup aims to enables training very deep [residual networks](https://paperswithcode.com/method/resnet) stably at a maximal learning rate without [normalization](https://paperswithcode.com/methods/category/normalization).

The steps are as follows:

1. Initialize the classification layer and the last layer of each residual branch to 0.

2. Initialize every other layer using a standard method, e.g. [Kaiming Initialization](https://paperswithcode.com/method/he-initialization), and scale only the weight layers inside residual branches by $L^{\frac{1}{2m-2}}$.

3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each [convolution](https://paperswithcode.com/method/convolution), [linear](https://paperswithcode.com/method/linear-layer), and element-wise activation layer.

**Rotary Position Embedding**, or **RoPE**, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.

Rotary Embeddings

RoFormer: Enhanced Transformer with Rotary Position Embedding

Fixup Initialization

Fixup Initialization: Residual Learning Without Normalization

**TrIVD-GAN**, or **Transformation-based & TrIple Video Discriminator GAN**, is a type of generative adversarial network for video generation that builds upon [DVD-GAN](https://paperswithcode.com/method/dvd-gan). Improvements include a novel transformation-based recurrent unit (the TSRU) that makes the generator more expressive, and an improved discriminator architecture. 

In contrast with DVD-[GAN](https://paperswithcode.com/method/gan), TrIVD-GAN has an alternative split for the roles of the discriminators, with $\mathcal{D}\_{S}$ judging per-frame global structure, while $\mathcal{D}\_{T}$ critiques local spatiotemporal structure. This is achieved by downsampling the $k$ randomly sampled frames fed to $\mathcal{D}\_{S}$ by a factor $s$, and cropping $T \times H/s \times W/s$ clips inside the high resolution video fed to $\mathcal{D}\_{T}$, where $T, H, W, C$ correspond to time, height, width and channel dimension of the input. This further reduces the number of pixels to process per video,
from $k \times H \times W + T \times H/s \times W/s$ to $\left(k + T\right) \times H/s \times W/s$.

Source	Fixup Initialization: Residual Learning Without Normalization
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Fixup Initialization?

Viet-Anh on Software