Viet-Anh on Software Logo

What is: Support-set Based Cross-Supervision?

SourceSupport-Set Based Cross-Supervision for Video Grounding
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Sscs, or Support-set Based Cross-Supervision, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.

Specifically, in the Figure to the right, two video-text pairs { V_i,L_iV\_{i}, L\_{i}}, {V_j,L_jV\_{j} , L\_{j} } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( {X_i,Y_iX\_{i}, Y\_{i}} and {X_j,Y_jX\_{j} , Y\_{j}} ) in a shared space are acquired. Base on the support-set module, the weighted average of X_iX\_{i} and X_jX\_{j} is computed to obtain Xˉ_i\bar{X}\_{i}, Xˉ_j\bar{X}\_{j} respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs