XGPT is a method of cross-modal generative pre-training for image captioning designed to pre-train text-to-image caption generators through three novel generation tasks, including image-conditioned masked language modeling (IMLM), image-conditioned denoising autoencoding (IDA), and text-conditioned image feature generation (TIGF). The pre-trained XGPT can be fine-tuned without any task-specific architecture modifications and build strong image captioning models.

**Layer-Sequential Unit-Variance Initialization** (**LSUV**) is a simple method for weight initialization for deep net learning. The initialization strategy involves the following two step:

1) First, pre-initialize weights of each [convolution](https://paperswithcode.com/method/convolution) or inner-product layer with
orthonormal matrices. 

2) Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.

LSUV Initialization

All you need is a good init

XGPT

XGPT: Cross-modal Generative Pre-Training for Image Captioning

**Serf**, or **Log-Softplus ERror activation Function**, is a type of activation function which is self-regularized and nonmonotonic in nature. It belongs to the [Swish](https://paperswithcode.com/method/swish) family of functions. Serf is defined as:

$$f\left(x\right) = x\text{erf}\left(\ln\left(1 + e^{x}\right)\right)$$

Source	XGPT: Cross-modal Generative Pre-Training for Image Captioning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com