A **DeLighT Block** is a block used in the [DeLighT](https://paperswithcode.com/method/delight) [transformer](https://paperswithcode.com/method/transformer) architecture. It uses a [DExTra](https://paperswithcode.com/method/dextra) transformation to reduce the dimensionality of the vectors entered into the attention layer, where a [single-headed attention](https://paperswithcode.com/method/single-headed-attention) module is used.  Since the DeLighT block learns wider representations of the input across different layers using DExTra, it enables the authors to replace [multi-head attention](https://paperswithcode.com/method/multi-head-attention) with single-head attention. This is then followed by a light-weight FFN which, rather than expanding the dimension (as in normal Transformers which widen to a dimension 4x the size), imposes a bottleneck and squeezes the dimensions. Again, the reason for this is that the DExTra transformation has already incorporated wider representations so we can squeeze instead at this layer.

**CT3D** is a two-stage 3D object detection framework that leverages a high-quality region proposal network and a Channel-wise [Transformer](https://paperswithcode.com/method/transformer) architecture. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses a proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. 

In CT3D, the raw points are first fed into the [RPN](https://paperswithcode.com/method/rpn) for generating 3D proposals. Then the raw points along with the corresponding proposals are processed by the channel-wise Transformer composed of the proposal-to-point encoding module and the channel-wise decoding module. Specifically, the proposal-to-point encoding module is to modulate each point feature with global proposal-aware context information. After that, the encoded point features are transformed into an effective proposal feature representation by the
channel-wise decoding module for confidence prediction and box regression.

CT3D

Improving 3D Object Detection with Channel-wise Transformer

DeLighT Block

DeLighT: Deep and Light-weight Transformer

**Embedded Gaussian Affinity** is a type of affinity or self-similarity function between two points $\mathbf{x\_{i}}$ and $\mathbf{x\_{j}}$ that uses a Gaussian function in an embedding space:

$$ f\left(\mathbf{x\_{i}}, \mathbf{x\_{j}}\right) = e^{\theta\left(\mathbf{x\_{i}}\right)^{T}\phi\left(\mathbf{x\_{j}}\right)} $$

Here $\theta\left(x\_{i}\right) = W\_{θ}x\_{i}$ and $\phi\left(x\_{j}\right) = W\_{φ}x\_{j}$ are two embeddings.

Note that the self-attention module used in the original [Transformer](https://paperswithcode.com/method/transformer) model is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given $i$, $\frac{1}{\mathcal{C}\left(\mathbf{x}\right)}\sum\_{\forall{j}}f\left(\mathbf{x}\_{i}, \mathbf{x}\_{j}\right)g\left(\mathbf{x}\_{j}\right)$ becomes the [softmax](https://paperswithcode.com/method/softmax) computation along the dimension $j$. So we have $\mathbf{y} = \text{softmax}\left(\mathbf{x}^{T}W^{T}\_{\theta}W\_{\phi}\mathbf{x}\right)g\left(\mathbf{x}\right)$, which is the self-attention form in the Transformer model. This shows how we can relate this recent self-attention model to the classic computer vision method of non-local means.

Source	DeLighT: Deep and Light-weight Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: DeLighT Block?

Viet-Anh on Software