victormiller commited on
Commit
8115558
·
verified ·
1 Parent(s): 4f82979

Update main.py

Browse files
Files changed (1) hide show
  1. main.py +5 -5
main.py CHANGED
@@ -128,11 +128,11 @@ def intro():
128
  We present TxT360, the Trillion eXtracted Text corpus, a 5.7T token dataset for pretraining projects that:
129
 
130
 
131
- 1. Curates commonly used pretraining datasets, including all CommonCrawl
132
- 2. Employs carefully selected filters designed for each data source
133
- 3. Provides only unique data elements via globally deduplicated across all datasets
134
- 4. Retains all deduplication metadata for custom upweighting
135
- 5. Is Production ready! Download here [link to HF repo]
136
  """),
137
  id="section1",
138
  ),
 
128
  We present TxT360, the Trillion eXtracted Text corpus, a 5.7T token dataset for pretraining projects that:
129
 
130
 
131
+ - 1. Curates commonly used pretraining datasets, including all CommonCrawl
132
+ - 2. Employs carefully selected filters designed for each data source
133
+ - 3. Provides only unique data elements via globally deduplicated across all datasets
134
+ - 4. Retains all deduplication metadata for custom upweighting
135
+ - 5. Is Production ready! Download here [link to HF repo]
136
  """),
137
  id="section1",
138
  ),