What is: Support-set Based Cross-Supervision?
Source | Support-Set Based Cross-Supervision for Video Grounding |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Sscs, or Support-set Based Cross-Supervision, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.
Specifically, in the Figure to the right, two video-text pairs { }, { } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( {} and {} ) in a shared space are acquired. Base on the support-set module, the weighted average of and is computed to obtain , respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs