What is: SentencePiece?
Source | SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
SentencePiece is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding (BPE) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.