view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • Nov 13, 2024 • 98
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29, 2024 • 10
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models Paper • 2408.08926 • Published Aug 15, 2024 • 5
Toxic Commons Collection Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors. • 3 items • Updated Oct 31, 2024 • 5
view article Article wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR?? By catherinearnett • Sep 27, 2024 • 38