HuggingFaceFW
AI & ML interests
None defined yet.
Recent Activity
π€ HuggingFace π· FineWeb datasets
Read our technical report!
This organization hosts the π· FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).
The creation of π· FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the π€ libraries datatrove
, nanotron
or lighteval
.
Version 1 of the π· FineWeb dataset is available here. Our ablation models can be found here.
Version 2 of the π₯ FineWeb dataset (multilingual extension to +1800 languages/script) is available here.