Why 72B model has different vocab size comparing with other models?

#1
by Mikasaka - opened

I found this 72B has vocab size of 152064, while other 7b 4b models etc have vocab size of 151936. Why it is designed in such way?

I also have a similar problem. For Qwen 1.8B they mentioned that the vocab size is 151851, and the tokenizer also has the same 151851 vocabs, but in the model weights, the vocab_size is 151936. Can someone explain why it is that way? Thanks.

The vocabularies are the same actually. The reason why we have different sizes of vocab is our distributed training. For larger models trained across devices, we need padding for the vocab.

jklj077 changed discussion status to closed

The problem is that vLLM checks for vocab size and if it doesn't match, the speculative decoding is not enabled. If you pad, then maybe pad all models to the same vocab size.

The problem is that vLLM checks for vocab size and if it doesn't match, the speculative decoding is not enabled. If you pad, then maybe pad all models to the same vocab size.

hi, do you solve this problem?

Yes and no. I modified the model to have the same vocab size. However, the vLLM speculative decoding performance is so terrible that it is not worth using.

For tokenizers in transformers, in convention, tokenizer.vocab_size as documented is the size of the base vocabulary (without the added tokens). To get the actual vocabulary size, you need to use len(tokenizer), which is 151646 for Qwen1.5 models.

The vocab_size in config.json is the number of embeddings, which can be larger than the acutal vocabulary size because of optimization for GPU computation and other consideration. 152064 can be divided by 256 and 151926 can be divided by 128.

how to slove it ?

Sign up or log in to comment