**Shrink and Fine-Tune**, or **SFT**, is a type of distillation that avoids explicit distillation by copying parameters to a student student model and then fine-tuning. Specifically it extracts a student model from the maximally spaced layers of a fine-tuned teacher. Each layer $l \in L'$ is copied fully from $L$. For example, when creating a [BART](https://paperswithcode.com/method/bart) student with 3 decoder layers from the 12 encoder layer 12 decoder layer teacher, we copy the teacher’s full $Enc^{L}$ and decoder layers 0, 6, and 11 to the student. When deciding which layers to copy, we break ties arbitrarily; copying layers 0, 5, and 11 might work just as well. When copy only 1 decoder layer, we copy layer 0. This was found this to work better than copying layer 11. The impact of initialization on performance is measured experimentally in Section 6.1. After initialization, the student model continues to fine-tune on the summarization dataset, with the objective of minimizing $\mathcal{L}\_{Data}$.

**Precise RoI Pooling**, or **PrRoI Pooling**, is a region of interest feature extractor that avoids any quantization of coordinates and has a continuous gradient on bounding box coordinates. Given the feature map $\mathcal{F}$ before RoI/PrRoI Pooling (eg from Conv4 in [ResNet](https://paperswithcode.com/method/resnet)-50), let $w_{i,j}$ be the feature at one discrete location $(i,j)$ on the feature map. Using bilinear interpolation, the discrete feature map can be considered continuous at any continuous coordinates $(x,y)$:

$$
f(x,y) = \sum_{i,j}IC(x,y,i,j) \times w_{i,j},
$$

where $IC(x,y,i,j) = max(0,1-|x-i|)\times max(0,1-|y-j|)$ is the interpolation coefficient. Then denote a bin of a RoI as $bin=\{(x_1,y_1),(x_2,y_2)\}$, where $(x_1,y_1)$ and $(x_2,y_2)$ are the continuous coordinates of the top-left and bottom-right points, respectively. We perform pooling (e.g. [average pooling](https://paperswithcode.com/method/average-pooling)) given $bin$ and feature map $\mathcal{F}$ by computing a two-order integral:

Precise RoI Pooling

Acquisition of Localization Confidence for Accurate Object Detection

Pre-trained Summarization Distillation

**Layer-Sequential Unit-Variance Initialization** (**LSUV**) is a simple method for weight initialization for deep net learning. The initialization strategy involves the following two step:

1) First, pre-initialize weights of each [convolution](https://paperswithcode.com/method/convolution) or inner-product layer with
orthonormal matrices. 

2) Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.

Source	Pre-trained Summarization Distillation
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com