Why is there a discrepancy between the size of the tokenizer and the word embedding matrix?
The word embedding matrix holds a vector representation for each token in the vocabulary. Generally, the number of tokens in the tokenizer (i.e., the vocabulary size) matches the number of rows in the word embedding matrix.
nomic
- 30522 != 30528
bge
- size of tokenizer and word_embeddings is same in bge
- 30522 == 30522
During training, we padded the the embedding matrix to a multiple of 64 similar to MosaicBert as it helps training time. The padded tokens never get updated during training
https://x.com/karpathy/status/1621578354024677377
Thanks
@zpn
. So the following row ids of word_embeddings will never get use, because tokenizer
object will never generate these token ids.[30522, 30523, 30524, 30525, 30526, 30527]
On a side but related note, I have added new tokens in the bert-base-uncased
tokenizer. The new token_ids starts from 30522 (since token ids till 30521
already exist in default tokenizer). Now in order to fine-tune the model, I'm using following code to extend the embedding layer.
old_num_tokens, old_embedding_dim = model.embeddings.word_embeddings.weight.shape
new_embeddings = nn.Embedding(
old_num_tokens + num_new_tokens, old_embedding_dim
)
Because of the padding, the new token_ids are getting mapped to wrong row number in the word_embeddings matrix.
Instead, I should have use the following code to extend the embedding layer.
_, old_embedding_dim = model.embeddings.word_embeddings.weight.shape
old_num_tokens = tokenizer.vocab_size # original number of tokens
new_embeddings = nn.Embedding(
old_num_tokens + num_new_tokens, old_embedding_dim
)
Just wanted to double check if my understanding is correct.