Viet-Anh on Software Logo

What is: Multiscale Attention ViT with Late fusion?

SourceClass-agnostic Object Detection with Multi-modal Transformer
Data SourceCC BY-SA -

Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.