What is: Mix-FFN?
Source | SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Mix-FFN is a feedforward layer used in the SegFormer architecture. ViT uses positional encoding (PE) to introduce the location information. However, the resolution of is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, CPVT uses Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a Conv in the feed-forward network (FFN). Mix-FFN can be formulated as:
where is the feature from a self-attention module. Mix-FFN mixes a convolution and an MLP into each FFN.