Why is there a discrepancy between the size of the tokenizer and the word embedding matrix?

#27
by rajydv - opened

The word embedding matrix holds a vector representation for each token in the vocabulary. Generally, the number of tokens in the tokenizer (i.e., the vocabulary size) matches the number of rows in the word embedding matrix.

nomic

nomic.png

  • 30522 != 30528

bge

  • size of tokenizer and word_embeddings is same in bge

image.png

  • 30522 == 30522
Nomic AI org

During training, we padded the the embedding matrix to a multiple of 64 similar to MosaicBert as it helps training time. The padded tokens never get updated during training
https://x.com/karpathy/status/1621578354024677377

zpn changed discussion status to closed

Thanks @zpn . So the following row ids of word_embeddings will never get use, because tokenizer object will never generate these token ids.
[30522, 30523, 30524, 30525, 30526, 30527]

On a side but related note, I have added new tokens in the bert-base-uncased tokenizer. The new token_ids starts from 30522 (since token ids till 30521 already exist in default tokenizer). Now in order to fine-tune the model, I'm using following code to extend the embedding layer.

old_num_tokens, old_embedding_dim = model.embeddings.word_embeddings.weight.shape
new_embeddings = nn.Embedding(
        old_num_tokens + num_new_tokens, old_embedding_dim
)

Because of the padding, the new token_ids are getting mapped to wrong row number in the word_embeddings matrix.
Instead, I should have use the following code to extend the embedding layer.

_, old_embedding_dim = model.embeddings.word_embeddings.weight.shape
old_num_tokens = tokenizer.vocab_size # original number of tokens
new_embeddings = nn.Embedding(
        old_num_tokens + num_new_tokens, old_embedding_dim
)

Just wanted to double check if my understanding is correct.

Sign up or log in to comment