Viet-Anh on Software Logo

What is: Discriminative Fine-Tuning?

SourceUniversal Language Model Fine-tuning for Text Classification
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Discriminative Fine-Tuning is a fine-tuning strategy that is used for ULMFiT type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent (SGD) update of a model’s parameters θ\theta at time step tt looks like the following (Ruder, 2016):

θ_t=θ_t1η_θJ(θ) \theta\_{t} = \theta\_{t-1} − \eta\cdot\nabla\_{\theta}J\left(\theta\right)

where η\eta is the learning rate and _θJ(θ)\nabla\_{\theta}J\left(\theta\right) is the gradient with regard to the model’s objective function. For discriminative fine-tuning, we split the parameters θ\theta into {θ_1,,θ_L\theta\_{1}, \ldots, \theta\_{L}} where θ_l\theta\_{l} contains the parameters of the model at the ll-th layer and LL is the number of layers of the model. Similarly, we obtain {η_1,,η_L\eta\_{1}, \ldots, \eta\_{L}} where θ_l\theta\_{l} where η_l\eta\_{l} is the learning rate of the ll-th layer. The SGD update with discriminative finetuning is then:

θ_tl=θ_t1lηl_θlJ(θ)\theta\_{t}^{l} = \theta\_{t-1}^{l} - \eta^{l}\cdot\nabla\_{\theta^{l}}J\left(\theta\right)

The authors find that empirically it worked well to first choose the learning rate ηL\eta^{L} of the last layer by fine-tuning only the last layer and using ηl1=ηl/2.6\eta^{l-1}=\eta^{l}/2.6 as the learning rate for lower layers.