Viet-Anh on Software Logo

What is: Deep-MAC?

SourceThe surprising impact of mask-head architecture on novel class segmentation
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Deep-MAC, or Deep Mask-heads Above CenterNet, is a type of anchor-free instance segmentation model based on CenterNet. The motivation for this new architecture is that boxes are much cheaper to annotate than masks, so the authors address the “partially supervised” instance segmentation problem, where all classes have bounding box annotations but only a subset of classes have mask annotations.

For predicting bounding boxes, CenterNet outputs 3 tensors: (1) a class-specific heatmap which indicates the probability of the center of a bounding box being present at each location, (2) a class-agnostic 2-channel tensor indicating the height and width of the bounding box at each center pixel, and (3) since the output feature map is typically smaller than the image (stride 4 or 8), CenterNet also predicts an x and y direction offset to recover this discretization error at each center pixel.

For Deep-MAC, in parallel to the box-related prediction heads, we add a fourth pixel embedding branch PP. For each bounding box bb, we crop a region P_bP\_{b} from PP corresponding to bb via ROIAlign which results in a 32 × 32 tensor. We then feed each P_bP\_{b} to a mask-head. The final prediction at the end is a class-agnostic 32 × 32 tensor which we pass through a sigmoid to get per-pixel probabilities. We train this mask-head via a per-pixel cross-entropy loss averaged over all pixels and instances. During post-processing, the predicted mask is re-aligned according to the predicted box and resized to the resolution of the image.

In addition to this 32 × 32 cropped feature map, we add two inputs for improved stability of some mask-heads: (1) Instance embedding: an additional head is added to the backbone that predicts a per-pixel embedding. For each bounding box bb we extract its embedding from the center pixel. This embedding is tiled to a size of 32 × 32 and concatenated to the pixel embedding crop. This helps condition the mask-head on a particular instance and disambiguate it from others. (2) Coordinate Embedding: Inspired by CoordConv, the authors add a 32 × 32 × 2 tensor holding normalized (x,y)\left(x, y\right) coordinates relative to the bounding box bb.