Spaces:
Sleeping
Sleeping
victormiller
commited on
Update main.py
Browse files
main.py
CHANGED
@@ -128,11 +128,11 @@ def intro():
|
|
128 |
We present TxT360, the Trillion eXtracted Text corpus, a 5.7T token dataset for pretraining projects that:
|
129 |
|
130 |
|
131 |
-
1. Curates commonly used pretraining datasets, including all CommonCrawl
|
132 |
-
2. Employs carefully selected filters designed for each data source
|
133 |
-
3. Provides only unique data elements via globally deduplicated across all datasets
|
134 |
-
4. Retains all deduplication metadata for custom upweighting
|
135 |
-
5. Is Production ready! Download here [link to HF repo]
|
136 |
"""),
|
137 |
id="section1",
|
138 |
),
|
|
|
128 |
We present TxT360, the Trillion eXtracted Text corpus, a 5.7T token dataset for pretraining projects that:
|
129 |
|
130 |
|
131 |
+
- 1. Curates commonly used pretraining datasets, including all CommonCrawl
|
132 |
+
- 2. Employs carefully selected filters designed for each data source
|
133 |
+
- 3. Provides only unique data elements via globally deduplicated across all datasets
|
134 |
+
- 4. Retains all deduplication metadata for custom upweighting
|
135 |
+
- 5. Is Production ready! Download here [link to HF repo]
|
136 |
"""),
|
137 |
id="section1",
|
138 |
),
|