**NesT** stacks canonical transformer layers to conduct local self-attention on every image block independently, and then "nests" them hierarchically. Coupling of processed information between spatially adjacent blocks is achieved through a proposed block aggregation between every two hierarchies. The overall hierarchical structure can be determined by two key hyper-parameters: patch size $S × S$ and number of block hierarchies $T_d$. All blocks inside each hierarchy share one set of parameters. Given input of image, each image is linearly projected to an embedding. All embeddings are partitioned to blocks and flattened to generate final input. Each transformer layers is composed of a multi-head self attention (MSA) layer followed by a feed-forward fully-connected network (FFN) with skip-connection and Layer normalization. Positional embeddings are added to encode spatial information before feeding into the block. Lastly, a nested hierarchy with block aggregation is built -- every four spatially connected blocks are merged into one.

Mixture model network (MoNet) is a general framework allowing to design convolutional deep architectures on non-Euclidean domains such as graphs and manifolds.

Image and description from: [Geometric deep learning on graphs and manifolds using mixture model CNNs](https://arxiv.org/pdf/1611.08402.pdf)

MoNet

Geometric deep learning on graphs and manifolds using mixture model CNNs

NesT

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

An **Associative LSTM** combines an [LSTM](https://paperswithcode.com/method/lstm) with ideas from Holographic Reduced Representations (HRRs) to enable key-value storage of data. HRRs use a “binding” operator to implement key-value
binding between two vectors (the key and its associated content). They natively implement associative arrays; as a byproduct, they can also easily implement stacks, queues, or lists.

Source	Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com