Lists of URLs from various training datasets
Nick Hagar
nhagar
AI & ML interests
digital media, collective attention, computational social science
Recent Activity
updated
a collection
17 days ago
LLM-training-URLs
updated
a dataset
17 days ago
nhagar/cultura_urls
updated
a collection
18 days ago
CC-domain-counts
Organizations
Collections
2
models
None public yet
datasets
6
nhagar/fineweb_urls
Updated
nhagar/cultura_urls
Viewer
•
Updated
•
7.18B
•
34
•
1
nhagar/CC_MAIN_2024_18_urls
Viewer
•
Updated
•
64.1M
•
24
nhagar/CC_MAIN_2017_47_urls
Viewer
•
Updated
•
75.8M
•
94
•
1
nhagar/falcon_urls
Viewer
•
Updated
•
968M
•
16
•
1
nhagar/c4_en_urls
Viewer
•
Updated
•
365M
•
98
•
1