**GPT-NeoX** is an autoregressive transformer decoder model whose architecture largely follows that of GPT-3, with a few notable deviations. The model has 20 billion parameters with 44 layers, a hidden dimension size of 6144, and 64 heads. The main difference with GPT-3 is the change in tokenizer, the addition of Rotary Positional Embeddings, the parallel computation of attention and feed-forward layers, and a different initialization scheme and hyperparameters.

**BS-Net** is an architecture for COVID-19 severity prediction based on clinical data from different modalities. The architecture comprises 1) a shared multi-task feature extraction backbone, 2) a lung segmentation branch, 3) an original registration mechanism that acts as a ”multi-resolution feature alignment” block operating on the encoding backbone , and 4) a multi-regional classification part for the final six-valued score estimation. 

All these blocks act together in the final training thanks to a loss specifically crated for this task. This loss guarantees also performance robustness, comprising a differentiable version of the target discrete metric. The learning phase operates in a weakly-supervised fashion. This is due to the fact that difficulties and pitfalls in the visual interpretation of the disease signs on CXRs (spanning from subtle findings to heavy lung impairment), and the lack of detailed localization information, produces unavoidable inter-rater variability among radiologists in assigning scores.

Specifically the architectural details are:

- The input image is processed with a convolutional backbone; the authors opt for a [ResNet](https://paperswithcode.com/method/resnet)-18.
- Segmentation is performed by a nested version of [U-Net](https://paperswithcode.com/method/u-net) (U-Net++).
- Alignment is estimated through the segmentation probability map produced by the U-Net++ decoder, which is achieved through a [spatial transformer network](https://paperswithcode.com/method/spatial-transformer) -- able to estimate the spatial transform matrix in order to center, rotate, and correctly zoom the lungs. After alignment at various scales, features are forward to a [ROIPool](https://paperswithcode.com/method/roi-pooling). 
- The alignment block is pre-trained on the synthetic alignment dataset in a weakly-supervised setting, using a Dice loss.
- The scoring head uses [FPNs](https://paperswithcode.com/method/fpn) for the combination of multi-scale feature maps. The multiresolution feature aligner produces input feature maps that are well focused on the specific area of interest. Eventually, the output of the FPN layer flows in a series of convolutional blocks to retrieve the output map. The classification is performed by a final [Global Average Pooling](https://paperswithcode.com/method/global-average-pooling) layer and a [SoftMax](https://paperswithcode.com/method/softmax) activation.
- The Loss function used for training is a sparse categorical cross entropy (SCCE) with a (differentiable) mean absolute error contribution.

BS-Net

BS-Net: learning COVID-19 pneumonia severity on a large Chest X-Ray dataset

GPT-NeoX

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

A **Neural Probablistic Language Model** is an early language modelling architecture. It involves a feedforward architecture that takes in input vector representations (i.e. word embeddings) of the previous $n$ words, which are looked up in a table $C$.

The word embeddings are concatenated and fed into a hidden layer which then feeds into a [softmax](https://paperswithcode.com/method/softmax) layer to estimate the probability of the word given the context.

Source	GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com