Spaces:
Sleeping
Sleeping
Delete doc/train_llama_tokenizer.md
Browse files- doc/train_llama_tokenizer.md +0 -99
doc/train_llama_tokenizer.md
DELETED
@@ -1,99 +0,0 @@
|
|
1 |
-
# training llama tokenizer
|
2 |
-
|
3 |
-
How does Meta train their sentencepiece tokenizer? You can print the config as follows:
|
4 |
-
|
5 |
-
```python
|
6 |
-
import sentencepiece.sentencepiece_model_pb2
|
7 |
-
mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
|
8 |
-
mp.ParseFromString(open("tokenizer.model", "rb").read())
|
9 |
-
print(mp.trainer_spec)
|
10 |
-
print(mp.normalizer_spec)
|
11 |
-
```
|
12 |
-
|
13 |
-
this gives:
|
14 |
-
|
15 |
-
```
|
16 |
-
trainer_spec {
|
17 |
-
input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
|
18 |
-
model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
|
19 |
-
model_type: BPE
|
20 |
-
vocab_size: 32000
|
21 |
-
self_test_sample_size: 0
|
22 |
-
input_format: "text"
|
23 |
-
character_coverage: 0.9999499917030334
|
24 |
-
input_sentence_size: 200000000
|
25 |
-
seed_sentencepiece_size: 1000000
|
26 |
-
shrinking_factor: 0.75
|
27 |
-
num_threads: 80
|
28 |
-
num_sub_iterations: 2
|
29 |
-
max_sentence_length: 4192
|
30 |
-
shuffle_input_sentence: true
|
31 |
-
max_sentencepiece_length: 16
|
32 |
-
split_by_unicode_script: true
|
33 |
-
split_by_whitespace: true
|
34 |
-
split_by_number: true
|
35 |
-
treat_whitespace_as_suffix: false
|
36 |
-
split_digits: true
|
37 |
-
allow_whitespace_only_pieces: true
|
38 |
-
vocabulary_output_piece_score: true
|
39 |
-
hard_vocab_limit: true
|
40 |
-
use_all_vocab: false
|
41 |
-
byte_fallback: true
|
42 |
-
required_chars: ""
|
43 |
-
unk_id: 0
|
44 |
-
bos_id: 1
|
45 |
-
eos_id: 2
|
46 |
-
pad_id: -1
|
47 |
-
unk_surface: " \342\201\207 "
|
48 |
-
unk_piece: "<unk>"
|
49 |
-
bos_piece: "<s>"
|
50 |
-
eos_piece: "</s>"
|
51 |
-
pad_piece: "<pad>"
|
52 |
-
train_extremely_large_corpus: false
|
53 |
-
enable_differential_privacy: false
|
54 |
-
differential_privacy_noise_level: 0.0
|
55 |
-
differential_privacy_clipping_threshold: 0
|
56 |
-
}
|
57 |
-
normalizer_spec {
|
58 |
-
name: "identity"
|
59 |
-
precompiled_charsmap: ""
|
60 |
-
add_dummy_prefix: true
|
61 |
-
remove_extra_whitespaces: false
|
62 |
-
normalization_rule_tsv: ""
|
63 |
-
}
|
64 |
-
```
|
65 |
-
|
66 |
-
We can use the sentencepiece spm_train to train the same models, but optionally smaller. Here are their [options docs](https://github.com/google/sentencepiece/blob/master/doc/options.md) we can refer to. It's not much but it helps.
|
67 |
-
|
68 |
-
We'll depart on one setting, I recommend changing `character_coverage` -> 1.0. We also want to make sure to note the following important settings that come up in the paper and are not necessarily the default sentencepiece settings:
|
69 |
-
|
70 |
-
```
|
71 |
-
--split-digits = true
|
72 |
-
--allow_whitespace_only_pieces = true
|
73 |
-
--byte_fallback = true
|
74 |
-
--normalization_rule_name = identity
|
75 |
-
```
|
76 |
-
|
77 |
-
With this in mind we can train a sentencepiece vocab in what I believe is probably the same to how Meta trained theirs as:
|
78 |
-
|
79 |
-
```
|
80 |
-
spm_train --input="$input" \
|
81 |
-
--model_prefix="$model_prefix" \
|
82 |
-
--model_type=bpe \
|
83 |
-
--vocab_size="$vocab_size" \
|
84 |
-
--self_test_sample_size=0 \
|
85 |
-
--input_format="text" \
|
86 |
-
--character_coverage=1.0 \
|
87 |
-
--num_threads="$(nproc)" \
|
88 |
-
--split_digits=true \
|
89 |
-
--allow_whitespace_only_pieces=true \
|
90 |
-
--byte_fallback=true \
|
91 |
-
--unk_surface=" \342\201\207 " \
|
92 |
-
--normalization_rule_name=identity \
|
93 |
-
```
|
94 |
-
|
95 |
-
Where $input is the input file, $model_prefix is the output path prefix, vocab_size is the desired vocab, and we're by default taking over the CPU resources of the machine.
|
96 |
-
|
97 |
-
Lastly note that sentencepiece is weird and expects "sentences" delimited by newlines as the input. You can't just put in a massive block of text. And they have a hyperparameter that constols the maximum size of a "sentence". Fwiw I really dislike this design choice around a weird concept of a "sentence". It should just be block of text with no assumptions. But here we are.
|
98 |
-
|
99 |
-
Look into the file `tinystories.py` where we train the vocab in the same way, but using Python bindings instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|