Viet-Anh on Software Logo

What is: Mix-FFN?

SourceSegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Mix-FFN is a feedforward layer used in the SegFormer architecture. ViT uses positional encoding (PE) to introduce the location information. However, the resolution of PE\mathrm{PE} is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, CPVT uses 3×33 \times 3 Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a 3×33 \times 3 Conv in the feed-forward network (FFN). Mix-FFN can be formulated as:

x_out =MLP(GELU(Conv_3×3(MLP(x_in))))+x_in\mathbf{x}\_{\text {out }}=\operatorname{MLP}\left(\operatorname{GELU}\left(\operatorname{Conv}\_{3 \times 3}\left(\operatorname{MLP}\left(\mathbf{x}\_{i n}\right)\right)\right)\right)+\mathbf{x}\_{i n}

where x_in\mathbf{x}\_{i n} is the feature from a self-attention module. Mix-FFN mixes a 3×33 \times 3 convolution and an MLP into each FFN.