trapoom555
/

Phi-2-Text-Embedding-cft

Sentence Similarity

sentence-embedding

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

Phi-2-Text-Embedding-cft / README.md

trapoom555's picture

Update README.md

cc2a881 verified 6 months ago

|

history blame contribute delete

4.42 kB

	---
	license: mit
	language:
	- en
	tags:
	- sentence-embedding
	- sentence-similarity
	- transformers
	- feature-extraction
	pipeline_tag: sentence-similarity
	---

	# Phi-2-Text-Embedding-cft

	## Description

	This is a fine-tuned version of [Phi-2](https://huggingface.co/microsoft/phi-2) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. The paper can be found [here](https://arxiv.org/abs/2408.00690).

	## Base Model

	[Phi-2](https://huggingface.co/microsoft/phi-2)

	## Usage

	1. Clone Phi-2 repository

	```bash
	git clone https://huggingface.co/microsoft/phi-2
	```

	2. Change a tokenizer setting in `tokenizer_config.json`

	```json
	"add_eos_token": true
	```

	3. Use the model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch
	import numpy as np

	class PhiSentenceEmbedding:
	def __init__(self, model_path='microsoft/phi-2', adapter_path=None):
	self.tokenizer = AutoTokenizer.from_pretrained(model_path)
	self.model = AutoModelForCausalLM.from_pretrained(model_path,
	torch_dtype=torch.bfloat16,
	device_map='cuda',
	trust_remote_code=True)
	if adapter_path != None:
	# Load fine-tuned LoRA
	self.model.load_adapter(adapter_path)

	def get_last_hidden_state(self, text):
	inputs = self.tokenizer(text, return_tensors="pt").to('cuda')
	with torch.no_grad():
	out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :]
	return out.squeeze().float().cpu().numpy()

	def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]:
	"""
	Returns a list of embeddings for the given sentences.

	Args:
	sentences: List of sentences to encode

	Returns:
	List of embeddings for the given sentences
	"""

	out = []

	for s in sentences:
	out.append(self.get_last_hidden_state(s))

	return out

	phi_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/Phi-2-Text-Embedding-cft')

	example_sentences = ["I don't like apples", "I like apples"]

	encoded_sentences = phi_sentence_embedding.encode(example_sentences)

	print(encoded_sentences)

	```

	## Training Details

	\| Training Details \| Value \|
	\|-------------------------\|-------------------\|
	\| Loss \| InfoNCE \|
	\| Batch Size \| 60 \|
	\| InfoNCE Temperature \| 0.05 \|
	\| Learning Rate \| 5e-05 \|
	\| Warmup Steps \| 100 \|
	\| Learning Rate Scheduler \| CosineAnnealingLR \|
	\| LoRA Rank \| 8 \|
	\| LoRA Alpha \| 32 \|
	\| LoRA Dropout \| 0.1 \|
	\| Training Precision \| bf16 \|
	\| Max Epoch \| 1 \|
	\| GPU \| RTX3090 \|
	\| Num GPUs \| 4 \|

	## Training Scripts

	The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main).

	## Checkpoints

	We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/Phi-2-Text-Embedding-cft-checkpoints).

	## Evaluation Results

	\| Benchmarks \| Before cft \| After cft \|
	\|----------------\|----------------\|---------------\|
	\| STS12 \| 23.04 \| 61.62 \|
	\| STS13 \| 20.79 \| 71.87 \|
	\| STS14 \| 17.06 \| 60.46 \|
	\| STS15 \| 24.56 \| 71.18 \|
	\| STS16 \| 48.68 \| 74.77 \|
	\| STS17 \| 41.43 \| 80.20 \|
	\| STSBenchmark \| 37.87 \| 79.46 \|
	\| BOISSES \| 28.04 \| 64.06 \|
	\| SICK-R \| 48.40 \| 74.37 \|
	\| Overall \| 32.21 \| 70.89 \|

	## Contributors

	Trapoom Ukarapol, Zhicheng Lee, Amy Xin

	## Foot Notes

	This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !