What is: MDETR?
Source | MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
MDETR is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.