File size: 455 Bytes
a505bc0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
---
language:
- en
---
V1 of an English/code tokenizer. Byte-level BPE, 64k vocab, split digits (the difference with v1). Equal mix between:
On the NL side:
- Books
- C4
- v1 of our CC (helen quality classifier)
- enwiki
- Gutenberg
- Reddit
On the code side:
- Jupyter notebooks (0.5 weight, it was small)
- GH issues
- Stackexchange
- The cleaned Python Stack
For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH). |