Common Models Collection The first generation of models pretrained on Common Corpus. • 5 items • Updated Dec 5, 2024 • 28
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Paper • 2409.04599 • Published Sep 6, 2024 • 1
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29, 2024 • 10