VideoBERT adapts the powerful [BERT](https://paperswithcode.com/method/bert) model to learn a joint visual-linguistic representation for video. It is used in numerous tasks, including action classification and video captioning.

**Sscs**, or **Support-set Based Cross-Supervision**, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.

Specifically, in the Figure to the right, two video-text pairs { $V\_{i}, L\_{i}$}, {$V\_{j} , L\_{j}$ } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( {$X\_{i}, Y\_{i}$} and {$X\_{j} , Y\_{j}$} ) in a shared space are acquired. Base on the support-set module, the weighted average of $X\_{i}$ and $X\_{j}$ is computed to obtain $\bar{X}\_{i}$, $\bar{X}\_{j}$ respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs

Sscs

Support-Set Based Cross-Supervision for Video Grounding

VideoBERT

VideoBERT: A Joint Model for Video and Language Representation Learning

Code for paper: H3DNet: 3D Object Detection Using Hybrid Geometric Primitives (ECCV 2020)

Source	VideoBERT: A Joint Model for Video and Language Representation Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com