What is: Multiscale Attention ViT with Late fusion?
Source | Class-agnostic Object Detection with Multi-modal Transformer |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as "all objects", "all entities", etc.