What is: Highway Layer?

A Highway Layer contains an information highway to other layers that helps with information flow. It is characterised by the use of a gating unit to help this information flow.

A plain feedforward neural network typically consists of $L$ layers where the $l$ th layer ( $l \in$ { $1, 2, \dots, L$ }) applies a nonlinear transform $H$ (parameterized by $\mathbf{W\_{H,l}}$ ) on its input $\mathbf{x\_{l}}$ to produce its output $\mathbf{y\_{l}}$ . Thus, $\mathbf{x\_{1}}$ is the input to the network and $\mathbf{y\_{L}}$ is the network’s output. Omitting the layer index and biases for clarity,

$\mathbf{y} = H\left(\mathbf{x},\mathbf{W\_{H}}\right)$

$H$ is usually an affine transform followed by a non-linear activation function, but in general it may take other forms.

For a highway network, we additionally define two nonlinear transforms $T\left(\mathbf{x},\mathbf{W\_{T}}\right)$ and $C\left(\mathbf{x},\mathbf{W\_{C}}\right)$ such that:

$\mathbf{y} = H\left(\mathbf{x},\mathbf{W\_{H}}\right)·T\left(\mathbf{x},\mathbf{W\_{T}}\right) + \mathbf{x}·C\left(\mathbf{x},\mathbf{W\_{C}}\right)$

We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. In the original paper, the authors set $C = 1 − T$ , giving:

$\mathbf{y} = H\left(\mathbf{x},\mathbf{W\_{H}}\right)·T\left(\mathbf{x},\mathbf{W\_{T}}\right) + \mathbf{x}·\left(1-T\left(\mathbf{x},\mathbf{W\_{T}}\right)\right)$

The authors set:

$T\left(x\right) = \sigma\left(\mathbf{W\_{T}}^{T}\mathbf{x} + \mathbf{b\_{T}}\right)$

Image: Sik-Ho Tsang

Source	Highway Networks
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Highway Layer?

Viet-Anh on Software