**SMITH**, or **Siamese Multi-depth Transformer-based Hierarchical Encoder**, is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based model for document representation learning and matching. It contains several design choices to adapt [self-attention models](https://paperswithcode.com/methods/category/attention-modules) for long text inputs. For the model pre-training, a masked sentence block language modeling task is used in addition to the original masked word language model task used in [BERT](https://paperswithcode.com/method/bert), to capture sentence block relations within a document. Given a sequence of sentence block representation, the document level Transformers learn the contextual representation for each sentence block and the final document representation.

**Semantic reasoning network**, or **SRN**, is an end-to-end trainable framework for scene text recognition that consists of four parts: backbone network, parallel [visual attention](https://paperswithcode.com/method/visual-attention) module (PVAM), global semantic reasoning module (GSRM), and visual-semantic fusion decoder (VSFD). Given an input image, the backbone network is first used to extract 2D features $V$. Then, the PVAM is used to generate $N$ aligned 1-D features $G$, where each feature corresponds to a character in the text and captures the aligned visual information. These $N$ 1-D features $G$ are then fed into a GSRM to capture the semantic information $S$. Finally, the aligned visual features $G$ and the semantic information $S$ are fused by the VSFD to predict $N$ characters. For text string shorter than $N$, ’EOS’ are padded.

Semantic Reasoning Network

Towards Accurate Scene Text Recognition with Semantic Reasoning Networks

SMITH

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

**AccoMontage** is a model for accompaniment arrangement, a type of music generation task involving intertwined constraints of melody, harmony, texture, and music structure. AccoMontage generates piano accompaniments for folk/pop songs based on a lead sheet (i.e. a melody with chord progression). It first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure deep learning approaches, AccoMontage uses a hybrid pathway, in which rule-based optimization and deep learning are both leveraged.

Source	Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com

Viet-Anh on Software

What is: Siamese Multi-depth Transformer-based Hierarchical Encoder?

Viet-Anh on Software