Viet-Anh on Software Logo

What is: Highway Layer?

SourceHighway Networks
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

A Highway Layer contains an information highway to other layers that helps with information flow. It is characterised by the use of a gating unit to help this information flow.

A plain feedforward neural network typically consists of LL layers where the llth layer (ll \in {1,2,,L1, 2, \dots, L}) applies a nonlinear transform HH (parameterized by W_H,l\mathbf{W\_{H,l}}) on its input x_l\mathbf{x\_{l}} to produce its output y_l\mathbf{y\_{l}}. Thus, x_1\mathbf{x\_{1}} is the input to the network and y_L\mathbf{y\_{L}} is the network’s output. Omitting the layer index and biases for clarity,

y=H(x,W_H)\mathbf{y} = H\left(\mathbf{x},\mathbf{W\_{H}}\right)

HH is usually an affine transform followed by a non-linear activation function, but in general it may take other forms.

For a highway network, we additionally define two nonlinear transforms T(x,W_T)T\left(\mathbf{x},\mathbf{W\_{T}}\right) and C(x,W_C)C\left(\mathbf{x},\mathbf{W\_{C}}\right) such that:

y=H(x,W_H)T(x,W_T)+xC(x,W_C) \mathbf{y} = H\left(\mathbf{x},\mathbf{W\_{H}}\right)·T\left(\mathbf{x},\mathbf{W\_{T}}\right) + \mathbf{x}·C\left(\mathbf{x},\mathbf{W\_{C}}\right)

We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. In the original paper, the authors set C=1TC = 1 − T, giving:

y=H(x,W_H)T(x,W_T)+x(1T(x,W_T)) \mathbf{y} = H\left(\mathbf{x},\mathbf{W\_{H}}\right)·T\left(\mathbf{x},\mathbf{W\_{T}}\right) + \mathbf{x}·\left(1-T\left(\mathbf{x},\mathbf{W\_{T}}\right)\right)

The authors set:

T(x)=σ(W_TTx+b_T)T\left(x\right) = \sigma\left(\mathbf{W\_{T}}^{T}\mathbf{x} + \mathbf{b\_{T}}\right)

Image: Sik-Ho Tsang