In the OneR method, model input can be one of image, text or image+text, and CMC objective is combined with the traditional image-text contrastive (ITC) loss. Masked modeling is also carried out for all three input types (i.e., image, text and multi-modal). This framework employs no modality-specific architectural component except for the initial token embedding layer, making our model generic and modality-agnostic with minimal inductive bias.

**GLOW** is a type of flow-based generative model that is based on an invertible $1 \times 1$ [convolution](https://paperswithcode.com/method/convolution). This builds on the flows introduced by [NICE](https://paperswithcode.com/method/nice) and [RealNVP](https://paperswithcode.com/method/realnvp). It consists of a series of steps of flow, combined in a multi-scale architecture; see the Figure to the right. Each step of flow consists of Act Normalization followed by an *invertible $1 \times 1$ convolution* followed by an [affine coupling](https://paperswithcode.com/method/affine-coupling) layer.

GLOW

Glow: Generative Flow with Invertible 1x1 Convolutions

OneR

Unifying Vision-Language Representation Space with Single-tower Transformer

Channel & spatial attention combines the advantages of channel attention and spatial attention. It adaptively selects both important objects and regions

Source	Unifying Vision-Language Representation Space with Single-tower Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

What is: One Representation?

Viet-Anh on Software