Model generates token > 50280 ?
Hello.
I'm using this model with vllm to generate some synthetic data. I generate very simply with temperature 1, no top_k, no top_p. However, rarely, the generation outputs token which are bigger than 50280. I assume this is because IIRC when training for some technical reason the model used vocab size of 50304 to be a multiple of 128.
However, why is this leaking in the generation ? Should I put a cutoff probability maybe?
Thanks for the help.
Hey
@frotaur
, you’re absolutely right. The vocab size is 50280 but during training, the embedding matrix was padded to 50304 to make it a multiple of 128 for computational efficiency. These unused tokens are not associated with meaningful content in the tokenizer, and their presence in the generation could be considered a leakage. To fix this, you should filter these unused token IDs during generation by masking before generation something like: logits[50280:] = -float('inf')
. Doing this just prevents the extra tokens from being picked at all.
Indeed, that's what I figured ( I looked at the frequencies, and indeed the assigned probabilities seemed to be ~1e-8 which is consistent with some floats ~0). Thanks for the answer though!