dleemiller's picture
Update README.md
abf40c0 verified
|
raw
history blame
3.48 kB
metadata
license: mit
language:
  - en
datasets:
  - sentence-transformers/all-nli
  - sentence-transformers/gooaq

wordllama

Installation

Use the github repo or install via pip: https://github.com/dleemiller/WordLlama

pip install wordllama

Intended Use

This model is intended for use in natural language processing applications that require text embeddings, such as text classification, sentiment analysis, and document clustering. It's a token embedding model that is comparable to word embedding models, but substantionally smaller in size (16mb default 256-dim model).

from wordllama import WordLlama

wl = WordLlama.load()
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

Model Architecture

Wordllama is based on token embedding codebooks extracted from large language models. It is trained like a general embedding, with MultipleNegativesRankingLoss using the sentence transformers library, using Matryoshka Representation Learning so that embeddings can be truncated to 64, 128, 256, 512 or 1024 dimensions.

To create WordLlama L2 "supercat", we extract and concatenate the token embedding codebooks from several large language models that use the llama2 tokenizer vocabulary (32k vocab size). This includes models like Llama2 70B and Phi-3 Medium. Then we add a trainable token weight parameter and initialize stopwords to a smaller value (0.1). Finally, we train a projection from the large, concatenated codebook down to a smaller dimension and average pool.

We use popular embeddings datasets from sentence transformers, and matryoshka representation learning (MRL) so that dimensions can be truncated. For "binary" models, we train using a straight through estimator, so that the embeddings can be binarized eg, (x>0).sign() and packed into integers for hamming distance computation.

After training, we save a new, small token embedding codebook, which is analogous to vectors of a word embedding.

MTEB Results (l2_supercat)

Metric WL64 WL128 WL256 (X) WL512 WL1024 GloVe 300d Komninos all-MiniLM-L6-v2
Clustering 30.27 32.20 33.25 33.40 33.62 27.73 26.57 42.35
Reranking 50.38 51.52 52.03 52.32 52.39 43.29 44.75 58.04
Classification 53.14 56.25 58.21 59.13 59.50 57.29 57.65 63.05
Pair Classification 75.80 77.59 78.22 78.50 78.60 70.92 72.94 82.37
STS 66.24 67.53 67.91 68.22 68.27 61.85 62.46 78.90
CQA DupStack 18.76 22.54 24.12 24.59 24.83 15.47 16.79 41.32
SummEval 30.79 29.99 30.99 29.56 29.39 28.87 30.49 30.81

license: mit datasets: - sentence-transformers/all-nli - sentence-transformers/gooaq language: - en