Update README.md

1ffcec2 over 1 year ago

5.47 kB

	---
	license: apache-2.0
	base_model: google/flat-ul2
	pipeline_tag: feature-extraction
	tags:
	- embedding
	- text embedding
	---

	# flan-ul2-text-encoder

	The encoder model extracted from [flan-ul2](https://huggingface.co/google/flan-ul2) via a new class add [in a recent release](https://github.com/huggingface/transformers/releases/tag/v4.31.0).

	⚠️ This model is 17.44 GB in `bfloat16` precision ⚠️


	## basic usage

	```python
	from transformers import AutoTokenizer, AutoModelForTextEncoding

	tokenizer = AutoTokenizer.from_pretrained("pszemraj/flan-ul2-text-encoder")
	model = AutoModelForTextEncoding.from_pretrained("pszemraj/flan-ul2-text-encoder")

	inputs = tokenizer("Hello, my dog loves memes", return_tensors="pt")
	outputs = model(**inputs)

	last_hidden_states = outputs.last_hidden_state
	```

	## usage: semantic similarity

	> note: this is 'one way' to use the encoder, not 'the only way'. suggestions and ideas welcome.

	Below is an example and a set of functions to compute the cosine similarity between the embeddings of different texts with this model

	## Functions

	### load_model_and_tokenizer

	Loads the model and tokenizer based on `model_name`, returning a tuple containing the loaded model and tokenizer.


	<details>
	<summary><b>Details</b></summary>


	```python
	from typing import List, Tuple

	import torch
	from transformers import AutoModel, AutoTokenizer
	from transformers import AutoModelForTextEncoding


	def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
	"""
	Load the model and tokenizer based on the given model name.

	Args:
	model_name (str): The name of the model to be loaded.

	Returns:
	Tuple[AutoModelForTextEncoding, AutoTokenizer]: The loaded model and tokenizer.
	"""
	model = AutoModelForTextEncoding.from_pretrained(
	model_name, torch_dtype=torch.bfloat16, device_map="auto"
	).eval()
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	return model, tokenizer
	```

	</details>

	### get_embeddings

	This computes the embeddings for the given texts given the model and tokenizer via weighted mean pooling across seq_len (as in [SGPT](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be))

	<details>
	<summary><b>Details</b></summary>


	```python
	def get_embeddings(
	model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]
	) -> torch.Tensor:
	"""
	compute text embeddings via weighted mean pooling across seq_len

	Args:
	model (AutoModel): The model to be used for getting embeddings.
	tokenizer (AutoTokenizer): The tokenizer to be used for tokenizing the texts.
	texts (List[str]): The texts for which embeddings are to be calculated.

	Returns:
	torch.Tensor: The calculated embeddings.
	"""
	# Tokenize input texts
	batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

	# Get the embeddings
	with torch.no_grad():
	last_hidden_state = model(
	**batch_tokens, output_hidden_states=True, return_dict=True
	).last_hidden_state

	# Get weights
	weights = (
	torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
	.unsqueeze(0)
	.unsqueeze(-1)
	.expand(last_hidden_state.size())
	.float()
	.to(last_hidden_state.device)
	)

	# Get attn mask
	input_mask_expanded = (
	batch_tokens["attention_mask"]
	.unsqueeze(-1)
	.expand(last_hidden_state.size())
	.float()
	)

	# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
	sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
	sum_mask = torch.sum(input_mask_expanded * weights, dim=1)

	embeddings = sum_embeddings / sum_mask

	return embeddings
	```

	</details>

	### calculate_cosine_similarity

	Helper fn to compute and print out cosine similarity

	<details>
	<summary><b>click to expand</b></summary>


	```python
	from scipy.spatial.distance import cosine

	def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> None:
	"""compute and print the cosine sim between the first text and all others"""
	# Calculate cosine similarities
	for i in range(1, len(embeddings)):
	cosine_sim = 1 - cosine(embeddings[0], embeddings[i])
	print(
	'Cosine similarity between "%s" and "%s" is: %.3f'
	% (texts[0], texts[i], cosine_sim)
	)
	```

	</details>

	## Usage

	Install packages:

	```bash
	pip install transformers accelerate sentencepiece scipy
	```

	Then, you can use the functions to compute embeddings and similarity scores:

	```python
	model_name = "pszemraj/flan-ul2-text-encoder"
	model, tokenizer = load_model_and_tokenizer(model_name)

	texts = [
	"deep learning",
	"artificial intelligence",
	"deep diving",
	"artificial snow",
	]

	embeddings = get_embeddings(model, tokenizer, texts)
	calculate_cosine_similarity(embeddings, texts)
	```

	This will print the cosine similarity between the first text and all other texts in the `texts' list.

	## References

	Inference with this model/the example is based on the ideas and examples in the [SGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).

	```
	@article{muennighoff2022sgpt,
	title={SGPT: GPT Sentence Embeddings for Semantic Search},
	author={Muennighoff, Niklas},
	journal={arXiv preprint arXiv:2202.08904},
	year={2022}
	}
	```