|
--- |
|
license: apache-2.0 |
|
base_model: google/flat-ul2 |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- embedding |
|
- text embedding |
|
--- |
|
|
|
# flan-ul2-text-encoder |
|
|
|
The encoder model extracted from [flan-ul2](https://huggingface.co/google/flan-ul2) via a new class add [in a recent release](https://github.com/huggingface/transformers/releases/tag/v4.31.0). |
|
|
|
⚠️ This model is 17.44 GB in `bfloat16` precision ⚠️ |
|
|
|
|
|
## basic usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTextEncoding |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("pszemraj/flan-ul2-text-encoder") |
|
model = AutoModelForTextEncoding.from_pretrained("pszemraj/flan-ul2-text-encoder") |
|
|
|
inputs = tokenizer("Hello, my dog loves memes", return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
last_hidden_states = outputs.last_hidden_state |
|
``` |
|
|
|
## usage: semantic similarity |
|
|
|
> note: this is 'one way' to use the encoder, not 'the only way'. suggestions and ideas welcome. |
|
|
|
Below is an example and a set of functions to compute the cosine similarity between the embeddings of different texts with this model |
|
|
|
## Functions |
|
|
|
### load_model_and_tokenizer |
|
|
|
Loads the model and tokenizer based on `model_name`, returning a tuple containing the loaded model and tokenizer. |
|
|
|
|
|
<details> |
|
<summary><b>Details</b></summary> |
|
|
|
|
|
```python |
|
from typing import List, Tuple |
|
|
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
from transformers import AutoModelForTextEncoding |
|
|
|
|
|
def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]: |
|
""" |
|
Load the model and tokenizer based on the given model name. |
|
|
|
Args: |
|
model_name (str): The name of the model to be loaded. |
|
|
|
Returns: |
|
Tuple[AutoModelForTextEncoding, AutoTokenizer]: The loaded model and tokenizer. |
|
""" |
|
model = AutoModelForTextEncoding.from_pretrained( |
|
model_name, torch_dtype=torch.bfloat16, device_map="auto" |
|
).eval() |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
return model, tokenizer |
|
``` |
|
|
|
</details> |
|
|
|
### get_embeddings |
|
|
|
This computes the embeddings for the given texts given the model and tokenizer via weighted mean pooling across seq_len (as in [SGPT](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be)) |
|
|
|
<details> |
|
<summary><b>Details</b></summary> |
|
|
|
|
|
```python |
|
def get_embeddings( |
|
model: AutoModel, tokenizer: AutoTokenizer, texts: List[str] |
|
) -> torch.Tensor: |
|
""" |
|
compute text embeddings via weighted mean pooling across seq_len |
|
|
|
Args: |
|
model (AutoModel): The model to be used for getting embeddings. |
|
tokenizer (AutoTokenizer): The tokenizer to be used for tokenizing the texts. |
|
texts (List[str]): The texts for which embeddings are to be calculated. |
|
|
|
Returns: |
|
torch.Tensor: The calculated embeddings. |
|
""" |
|
# Tokenize input texts |
|
batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt") |
|
|
|
# Get the embeddings |
|
with torch.no_grad(): |
|
last_hidden_state = model( |
|
**batch_tokens, output_hidden_states=True, return_dict=True |
|
).last_hidden_state |
|
|
|
# Get weights |
|
weights = ( |
|
torch.arange(start=1, end=last_hidden_state.shape[1] + 1) |
|
.unsqueeze(0) |
|
.unsqueeze(-1) |
|
.expand(last_hidden_state.size()) |
|
.float() |
|
.to(last_hidden_state.device) |
|
) |
|
|
|
# Get attn mask |
|
input_mask_expanded = ( |
|
batch_tokens["attention_mask"] |
|
.unsqueeze(-1) |
|
.expand(last_hidden_state.size()) |
|
.float() |
|
) |
|
|
|
# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim |
|
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1) |
|
sum_mask = torch.sum(input_mask_expanded * weights, dim=1) |
|
|
|
embeddings = sum_embeddings / sum_mask |
|
|
|
return embeddings |
|
``` |
|
|
|
</details> |
|
|
|
### calculate_cosine_similarity |
|
|
|
Helper fn to compute and print out cosine similarity |
|
|
|
<details> |
|
<summary><b>click to expand</b></summary> |
|
|
|
|
|
```python |
|
from scipy.spatial.distance import cosine |
|
|
|
def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> None: |
|
"""compute and print the cosine sim between the first text and all others""" |
|
# Calculate cosine similarities |
|
for i in range(1, len(embeddings)): |
|
cosine_sim = 1 - cosine(embeddings[0], embeddings[i]) |
|
print( |
|
'Cosine similarity between "%s" and "%s" is: %.3f' |
|
% (texts[0], texts[i], cosine_sim) |
|
) |
|
``` |
|
|
|
</details> |
|
|
|
## Usage |
|
|
|
Install packages: |
|
|
|
```bash |
|
pip install transformers accelerate sentencepiece scipy |
|
``` |
|
|
|
Then, you can use the functions to compute embeddings and similarity scores: |
|
|
|
```python |
|
model_name = "pszemraj/flan-ul2-text-encoder" |
|
model, tokenizer = load_model_and_tokenizer(model_name) |
|
|
|
texts = [ |
|
"deep learning", |
|
"artificial intelligence", |
|
"deep diving", |
|
"artificial snow", |
|
] |
|
|
|
embeddings = get_embeddings(model, tokenizer, texts) |
|
calculate_cosine_similarity(embeddings, texts) |
|
``` |
|
|
|
This will print the cosine similarity between the first text and all other texts in the `texts' list. |
|
|
|
## References |
|
|
|
Inference with this model/the example is based on the ideas and examples in the [SGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be). |
|
|
|
``` |
|
@article{muennighoff2022sgpt, |
|
title={SGPT: GPT Sentence Embeddings for Semantic Search}, |
|
author={Muennighoff, Niklas}, |
|
journal={arXiv preprint arXiv:2202.08904}, |
|
year={2022} |
|
} |
|
``` |