Discrepancy between tokenizer and vocab

#5
by ahmedsqrd - opened

hi all,

It seems in the config for this model and Olmo-7b-hf the vocab size is 50304, but the length of the tokenizer seems to be 50280 from looking at the tokenizer json file. Why is this the case? shouldn't the vocab size be 50280?

Transformers assumes that the vocabulary size (length of tokenizer) is the same as the embedding dimension. This assumption doesn't hold for the OLMo codebase, since we found that having more powers of 2 in the embedding size can make the model run faster. From our codebase:

If ``vocab_size`` is not a multiple of 128, setting this to the
next multiple of 128 that's greater than ``vocab_size`` can improve throughput
substantially.

50304 is the next multiple of 128 after 50280, so we set our embedding dimension to 50304. The corresponding setting in transformers is vocab size.

ok that makes sense. Just to clarify, is the tokenizer the same for 1B and 7B models? I noticed that the 7B model has the vocab size set to 50304 (as opposed to 50280 for the 7B model currently)

The models without -hf and the models with -hf have tokenizers that should act in the same way, but their implementations (and so their configs) differ.

okay that makes sense, thank you!

ahmedsqrd changed discussion status to closed

Sign up or log in to comment