Model generates token > 50280 ?

#7
by frotaur - opened

Hello.

I'm using this model with vllm to generate some synthetic data. I generate very simply with temperature 1, no top_k, no top_p. However, rarely, the generation outputs token which are bigger than 50280. I assume this is because IIRC when training for some technical reason the model used vocab size of 50304 to be a multiple of 128.

However, why is this leaking in the generation ? Should I put a cutoff probability maybe?

Thanks for the help.

Hey @frotaur , you’re absolutely right. The vocab size is 50280 but during training, the embedding matrix was padded to 50304 to make it a multiple of 128 for computational efficiency. These unused tokens are not associated with meaningful content in the tokenizer, and their presence in the generation could be considered a leakage. To fix this, you should filter these unused token IDs during generation by masking before generation something like: logits[50280:] = -float('inf'). Doing this just prevents the extra tokens from being picked at all.

Indeed, that's what I figured ( I looked at the frequencies, and indeed the assigned probabilities seemed to be ~1e-8 which is consistent with some floats ~0). Thanks for the answer though!

frotaur changed discussion status to closed

Sign up or log in to comment