peterizsak's picture
Upload README.md
9af8004 verified
---
license: mit
language:
- en
---
# BGE-base-en-v1.5-rag-int8-static
A quantized version of [BAAI/BGE-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) quantized with [Intel® Neural Compressor](https://github.com/huggingface/optimum-intel) and compatible with [Optimum-Intel](https://github.com/huggingface/optimum-intel).
The model can be used with [Optimum-Intel](https://github.com/huggingface/optimum-intel) API and as a standalone model or as an embedder or ranker module as part of [fastRAG](https://github.com/IntelLabs/fastRAG) RAG pipeline.
## Technical details
Quantized using post-training static quantization.
| | |
|---|:---:|
| Calibration set | [qasper](https://huggingface.co/datasets/allenai/qasper) (with 80 random samples)" |
| Quantization tool | [Optimum-Intel](https://github.com/huggingface/optimum-intel) |
| Backend | `IPEX` |
| Original model | [BAAI/BGE-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) |
Instructions how to reproduce the quantized model can be found [here](https://github.com/IntelLabs/fastRAG/tree/main/scripts/optimizations/embedders).
## Evaluation - MTEB
Model performance on the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) *retrieval* and *reranking* tasks.
| | `INT8` | `FP32` | % diff |
|---|:---:|:---:|:---:|
| Reranking | 0.5886 | 0.5886 | 0.0% |
| Retrieval | 0.5242 | 0.5325 | -1.55% |
## Usage
### Using with Optimum-intel
See [Optimum-intel](https://github.com/huggingface/optimum-intel) installation page for instructions how to install. Or run:
``` sh
pip install -U optimum[neural-compressor, ipex] intel-extension-for-transformers
```
Loading a model:
``` python
from optimum.intel import IPEXModel
model = IPEXModel.from_pretrained("Intel/bge-base-en-v1.5-rag-int8-static")
```
Running inference:
``` python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Intel/bge-base-en-v1.5-rag-int8-static")
inputs = tokenizer(sentences, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
# get the vector of [CLS]
embedded = model_output[0][:, 0]
```
### Using with a fastRAG RAG pipeline
Get started with installing [fastRAG](https://github.com/IntelLabs/fastRAG) as instructed [here](https://github.com/IntelLabs/fastRAG).
Below is an example for loading the model into a ranker node that embeds and re-ranks all the documents it gets in the node input of a pipeline.
``` python
from fastrag.rankers import QuantizedBiEncoderRanker
ranker = QuantizedBiEncoderRanker("Intel/bge-base-en-v1.5-rag-int8-static")
```
and plugging it into a pipeline
``` python
from haystack import Pipeline
p = Pipeline()
p.add_node(component=retriever, name="retriever", inputs=["Query"])
p.add_node(component=ranker, name="ranker", inputs=["retriever"])
```
See a more complete example notebook [here](https://github.com/IntelLabs/fastRAG/blob/main/examples/optimized-embeddings.ipynb).