leonardlin
's Collections
data
updated
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
•
2305.13169
•
Published
•
3
A Survey on Data Selection for Language Models
Paper
•
2402.16827
•
Published
•
4
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
•
3.24B
•
211k
•
585
Updated
•
29k
•
132
Viewer
•
Updated
•
7.18B
•
10.6k
•
491
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
•
2404.07503
•
Published
•
29
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
•
2406.20094
•
Published
•
96
DDK: Distilling Domain Knowledge for Efficient Large Language Models
Paper
•
2407.16154
•
Published
•
21
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data
Assessment and Selection for Instruction Tuning of Language Models
Paper
•
2408.02085
•
Published
•
17
Better Alignment with Instruction Back-and-Forth Translation
Paper
•
2408.04614
•
Published
•
14
The ShareLM Collection and Plugin: Contributing Human-Model Chats for
the Benefit of the Community
Paper
•
2408.08291
•
Published
•
11