|
--- |
|
license: mit |
|
language: |
|
- en |
|
tags: |
|
- sentence-embedding |
|
- sentence-similarity |
|
- transformers |
|
- feature-extraction |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
# Phi-2-Text-Embedding-cft |
|
|
|
## Description |
|
|
|
This is a fine-tuned version of [Phi-2](https://huggingface.co/microsoft/phi-2) to perform Text Embedding tasks. The model is fine-tuned using the Contrastive Fine-tuning and LoRA technique on NLI datasets. The paper can be found [here](https://arxiv.org/abs/2408.00690). |
|
|
|
## Base Model |
|
|
|
[Phi-2](https://huggingface.co/microsoft/phi-2) |
|
|
|
## Usage |
|
|
|
1. Clone Phi-2 repository |
|
|
|
```bash |
|
git clone https://huggingface.co/microsoft/phi-2 |
|
``` |
|
|
|
2. Change a tokenizer setting in `tokenizer_config.json` |
|
|
|
```json |
|
"add_eos_token": true |
|
``` |
|
|
|
3. Use the model |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
import numpy as np |
|
|
|
class PhiSentenceEmbedding: |
|
def __init__(self, model_path='microsoft/phi-2', adapter_path=None): |
|
self.tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
self.model = AutoModelForCausalLM.from_pretrained(model_path, |
|
torch_dtype=torch.bfloat16, |
|
device_map='cuda', |
|
trust_remote_code=True) |
|
if adapter_path != None: |
|
# Load fine-tuned LoRA |
|
self.model.load_adapter(adapter_path) |
|
|
|
def get_last_hidden_state(self, text): |
|
inputs = self.tokenizer(text, return_tensors="pt").to('cuda') |
|
with torch.no_grad(): |
|
out = self.model(**inputs, output_hidden_states=True).hidden_states[-1][0, -1, :] |
|
return out.squeeze().float().cpu().numpy() |
|
|
|
def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray]: |
|
""" |
|
Returns a list of embeddings for the given sentences. |
|
|
|
Args: |
|
sentences: List of sentences to encode |
|
|
|
Returns: |
|
List of embeddings for the given sentences |
|
""" |
|
|
|
out = [] |
|
|
|
for s in sentences: |
|
out.append(self.get_last_hidden_state(s)) |
|
|
|
return out |
|
|
|
phi_sentence_embedding = PhiSentenceEmbedding(<your-cloned-base-model-path>, 'trapoom555/Phi-2-Text-Embedding-cft') |
|
|
|
example_sentences = ["I don't like apples", "I like apples"] |
|
|
|
encoded_sentences = phi_sentence_embedding.encode(example_sentences) |
|
|
|
print(encoded_sentences) |
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
| **Training Details** | **Value** | |
|
|-------------------------|-------------------| |
|
| Loss | InfoNCE | |
|
| Batch Size | 60 | |
|
| InfoNCE Temperature | 0.05 | |
|
| Learning Rate | 5e-05 | |
|
| Warmup Steps | 100 | |
|
| Learning Rate Scheduler | CosineAnnealingLR | |
|
| LoRA Rank | 8 | |
|
| LoRA Alpha | 32 | |
|
| LoRA Dropout | 0.1 | |
|
| Training Precision | bf16 | |
|
| Max Epoch | 1 | |
|
| GPU | RTX3090 | |
|
| Num GPUs | 4 | |
|
|
|
## Training Scripts |
|
|
|
The training script for this model is written in this [Github repository](https://github.com/trapoom555/Language-Model-STS-CFT/tree/main). |
|
|
|
## Checkpoints |
|
|
|
We provide checkpoints every 500 training steps which can be found [here](https://huggingface.co/trapoom555/Phi-2-Text-Embedding-cft-checkpoints). |
|
|
|
## Evaluation Results |
|
|
|
| **Benchmarks** | **Before cft** | **After cft** | |
|
|----------------|----------------|---------------| |
|
| STS12 | 23.04 | 61.62 | |
|
| STS13 | 20.79 | 71.87 | |
|
| STS14 | 17.06 | 60.46 | |
|
| STS15 | 24.56 | 71.18 | |
|
| STS16 | 48.68 | 74.77 | |
|
| STS17 | 41.43 | 80.20 | |
|
| STSBenchmark | 37.87 | 79.46 | |
|
| BOISSES | 28.04 | 64.06 | |
|
| SICK-R | 48.40 | 74.37 | |
|
| **Overall** | **32.21** | **70.89** | |
|
|
|
## Contributors |
|
|
|
Trapoom Ukarapol, Zhicheng Lee, Amy Xin |
|
|
|
## Foot Notes |
|
|
|
This work is the final project of the Natural Language Processing Spring 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course ! |