teragron commited on
Commit
651da3a
·
1 Parent(s): ae54289

Delete doc/train_llama_tokenizer.md

Browse files
Files changed (1) hide show
  1. doc/train_llama_tokenizer.md +0 -99
doc/train_llama_tokenizer.md DELETED
@@ -1,99 +0,0 @@
1
- # training llama tokenizer
2
-
3
- How does Meta train their sentencepiece tokenizer? You can print the config as follows:
4
-
5
- ```python
6
- import sentencepiece.sentencepiece_model_pb2
7
- mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
8
- mp.ParseFromString(open("tokenizer.model", "rb").read())
9
- print(mp.trainer_spec)
10
- print(mp.normalizer_spec)
11
- ```
12
-
13
- this gives:
14
-
15
- ```
16
- trainer_spec {
17
- input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
18
- model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
19
- model_type: BPE
20
- vocab_size: 32000
21
- self_test_sample_size: 0
22
- input_format: "text"
23
- character_coverage: 0.9999499917030334
24
- input_sentence_size: 200000000
25
- seed_sentencepiece_size: 1000000
26
- shrinking_factor: 0.75
27
- num_threads: 80
28
- num_sub_iterations: 2
29
- max_sentence_length: 4192
30
- shuffle_input_sentence: true
31
- max_sentencepiece_length: 16
32
- split_by_unicode_script: true
33
- split_by_whitespace: true
34
- split_by_number: true
35
- treat_whitespace_as_suffix: false
36
- split_digits: true
37
- allow_whitespace_only_pieces: true
38
- vocabulary_output_piece_score: true
39
- hard_vocab_limit: true
40
- use_all_vocab: false
41
- byte_fallback: true
42
- required_chars: ""
43
- unk_id: 0
44
- bos_id: 1
45
- eos_id: 2
46
- pad_id: -1
47
- unk_surface: " \342\201\207 "
48
- unk_piece: "<unk>"
49
- bos_piece: "<s>"
50
- eos_piece: "</s>"
51
- pad_piece: "<pad>"
52
- train_extremely_large_corpus: false
53
- enable_differential_privacy: false
54
- differential_privacy_noise_level: 0.0
55
- differential_privacy_clipping_threshold: 0
56
- }
57
- normalizer_spec {
58
- name: "identity"
59
- precompiled_charsmap: ""
60
- add_dummy_prefix: true
61
- remove_extra_whitespaces: false
62
- normalization_rule_tsv: ""
63
- }
64
- ```
65
-
66
- We can use the sentencepiece spm_train to train the same models, but optionally smaller. Here are their [options docs](https://github.com/google/sentencepiece/blob/master/doc/options.md) we can refer to. It's not much but it helps.
67
-
68
- We'll depart on one setting, I recommend changing `character_coverage` -> 1.0. We also want to make sure to note the following important settings that come up in the paper and are not necessarily the default sentencepiece settings:
69
-
70
- ```
71
- --split-digits = true
72
- --allow_whitespace_only_pieces = true
73
- --byte_fallback = true
74
- --normalization_rule_name = identity
75
- ```
76
-
77
- With this in mind we can train a sentencepiece vocab in what I believe is probably the same to how Meta trained theirs as:
78
-
79
- ```
80
- spm_train --input="$input" \
81
- --model_prefix="$model_prefix" \
82
- --model_type=bpe \
83
- --vocab_size="$vocab_size" \
84
- --self_test_sample_size=0 \
85
- --input_format="text" \
86
- --character_coverage=1.0 \
87
- --num_threads="$(nproc)" \
88
- --split_digits=true \
89
- --allow_whitespace_only_pieces=true \
90
- --byte_fallback=true \
91
- --unk_surface=" \342\201\207 " \
92
- --normalization_rule_name=identity \
93
- ```
94
-
95
- Where $input is the input file, $model_prefix is the output path prefix, vocab_size is the desired vocab, and we're by default taking over the CPU resources of the machine.
96
-
97
- Lastly note that sentencepiece is weird and expects "sentences" delimited by newlines as the input. You can't just put in a massive block of text. And they have a hyperparameter that constols the maximum size of a "sentence". Fwiw I really dislike this design choice around a weird concept of a "sentence". It should just be block of text with no assumptions. But here we are.
98
-
99
- Look into the file `tinystories.py` where we train the vocab in the same way, but using Python bindings instead.