ldwang commited on
Commit
f57f860
·
1 Parent(s): bc309d0
Files changed (1) hide show
  1. README.md +34 -16
README.md CHANGED
@@ -14,6 +14,7 @@ license: mit
14
  <a href="#evaluation">Evaluation</a> |
15
  <a href="#train">Train</a> |
16
  <a href="#contact">Contact</a> |
 
17
  <a href="#license">License</a>
18
  <p>
19
  </h4>
@@ -27,6 +28,7 @@ FlagEmbedding can map any text to a low-dimensional dense vector which can be us
27
  And it also can be used in vector databases for LLMs.
28
 
29
  ************* 🌟**Updates**🌟 *************
 
30
  - 09/12/2023: New Release:
31
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
32
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
@@ -61,10 +63,9 @@ And it also can be used in vector databases for LLMs.
61
 
62
  \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
63
 
64
- \**: Different embedding model, reranker is a cross-encoder, which cannot be used to generate embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
65
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
66
 
67
-
68
  ## Frequently asked questions
69
 
70
  <details>
@@ -127,7 +128,9 @@ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagO
127
  from FlagEmbedding import FlagModel
128
  sentences_1 = ["样例数据-1", "样例数据-2"]
129
  sentences_2 = ["样例数据-3", "样例数据-4"]
130
- model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
 
 
131
  embeddings_1 = model.encode(sentences_1)
132
  embeddings_2 = model.encode(sentences_2)
133
  similarity = embeddings_1 @ embeddings_2.T
@@ -158,7 +161,7 @@ pip install -U sentence-transformers
158
  from sentence_transformers import SentenceTransformer
159
  sentences_1 = ["样例数据-1", "样例数据-2"]
160
  sentences_2 = ["样例数据-3", "样例数据-4"]
161
- model = SentenceTransformer('BAAI/bge-large-zh')
162
  embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
163
  embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
164
  similarity = embeddings_1 @ embeddings_2.T
@@ -173,7 +176,7 @@ queries = ['query_1', 'query_2']
173
  passages = ["样例文档-1", "样例文档-2"]
174
  instruction = "为这个句子生成表示以用于检索相关文章:"
175
 
176
- model = SentenceTransformer('BAAI/bge-large-zh')
177
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
178
  p_embeddings = model.encode(passages, normalize_embeddings=True)
179
  scores = q_embeddings @ p_embeddings.T
@@ -184,7 +187,7 @@ scores = q_embeddings @ p_embeddings.T
184
  You can use `bge` in langchain like this:
185
  ```python
186
  from langchain.embeddings import HuggingFaceBgeEmbeddings
187
- model_name = "BAAI/bge-small-en"
188
  model_kwargs = {'device': 'cuda'}
189
  encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
190
  model = HuggingFaceBgeEmbeddings(
@@ -208,8 +211,8 @@ import torch
208
  sentences = ["样例数据-1", "样例数据-2"]
209
 
210
  # Load model from HuggingFace Hub
211
- tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
212
- model = AutoModel.from_pretrained('BAAI/bge-large-zh')
213
  model.eval()
214
 
215
  # Tokenize sentences
@@ -229,6 +232,7 @@ print("Sentence embeddings:", sentence_embeddings)
229
 
230
  ### Usage for Reranker
231
 
 
232
  You can get a relevance score by inputting query and passage to the reranker.
233
  The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
234
 
@@ -238,10 +242,10 @@ The reranker is optimized based cross-entropy loss, so the relevance score is no
238
  pip install -U FlagEmbedding
239
  ```
240
 
241
- Get relevance score:
242
  ```python
243
  from FlagEmbedding import FlagReranker
244
- reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) #use fp16 can speed up computing
245
 
246
  score = reranker.compute_score(['query', 'passage'])
247
  print(score)
@@ -255,10 +259,10 @@ print(scores)
255
 
256
  ```python
257
  import torch
258
- from transformers import AutoModelForSequenceClassification, AutoTokenizer, BatchEncoding, PreTrainedTokenizerFast
259
 
260
- tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
261
- model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
262
  model.eval()
263
 
264
  pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
@@ -324,7 +328,7 @@ Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C
324
  - **Reranking**:
325
  See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for evaluation script.
326
 
327
- | Model | T2Reranking | T2RerankingZh2En\* | T2RerankingEn2Zh\* | MmarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
328
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
329
  | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
330
  | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
@@ -337,13 +341,13 @@ See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for
337
  | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
338
  | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
339
 
340
- \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval task
341
 
342
  ## Train
343
 
344
  ### BAAI Embedding
345
 
346
- We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning.
347
  **You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
348
  We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
349
  Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
@@ -366,6 +370,20 @@ If you have any question or suggestion related to this project, feel free to ope
366
  You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]).
367
 
368
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
369
  ## License
370
  FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
371
 
 
14
  <a href="#evaluation">Evaluation</a> |
15
  <a href="#train">Train</a> |
16
  <a href="#contact">Contact</a> |
17
+ <a href="#citation">Citation</a> |
18
  <a href="#license">License</a>
19
  <p>
20
  </h4>
 
28
  And it also can be used in vector databases for LLMs.
29
 
30
  ************* 🌟**Updates**🌟 *************
31
+ - 09/15/2023: Release [paper](https://arxiv.org/pdf/2309.07597.pdf) and [dataset](https://data.baai.ac.cn/details/BAAI-MTP).
32
  - 09/12/2023: New Release:
33
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
34
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
 
63
 
64
  \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
65
 
66
+ \**: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
67
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
68
 
 
69
  ## Frequently asked questions
70
 
71
  <details>
 
128
  from FlagEmbedding import FlagModel
129
  sentences_1 = ["样例数据-1", "样例数据-2"]
130
  sentences_2 = ["样例数据-3", "样例数据-4"]
131
+ model = FlagModel('BAAI/bge-large-zh-v1.5',
132
+ query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
133
+ use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
134
  embeddings_1 = model.encode(sentences_1)
135
  embeddings_2 = model.encode(sentences_2)
136
  similarity = embeddings_1 @ embeddings_2.T
 
161
  from sentence_transformers import SentenceTransformer
162
  sentences_1 = ["样例数据-1", "样例数据-2"]
163
  sentences_2 = ["样例数据-3", "样例数据-4"]
164
+ model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
165
  embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
166
  embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
167
  similarity = embeddings_1 @ embeddings_2.T
 
176
  passages = ["样例文档-1", "样例文档-2"]
177
  instruction = "为这个句子生成表示以用于检索相关文章:"
178
 
179
+ model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
180
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
181
  p_embeddings = model.encode(passages, normalize_embeddings=True)
182
  scores = q_embeddings @ p_embeddings.T
 
187
  You can use `bge` in langchain like this:
188
  ```python
189
  from langchain.embeddings import HuggingFaceBgeEmbeddings
190
+ model_name = "BAAI/bge-large-en-v1.5"
191
  model_kwargs = {'device': 'cuda'}
192
  encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
193
  model = HuggingFaceBgeEmbeddings(
 
211
  sentences = ["样例数据-1", "样例数据-2"]
212
 
213
  # Load model from HuggingFace Hub
214
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
215
+ model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
216
  model.eval()
217
 
218
  # Tokenize sentences
 
232
 
233
  ### Usage for Reranker
234
 
235
+ Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding.
236
  You can get a relevance score by inputting query and passage to the reranker.
237
  The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
238
 
 
242
  pip install -U FlagEmbedding
243
  ```
244
 
245
+ Get relevance scores (higher scores indicate more relevance):
246
  ```python
247
  from FlagEmbedding import FlagReranker
248
+ reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
249
 
250
  score = reranker.compute_score(['query', 'passage'])
251
  print(score)
 
259
 
260
  ```python
261
  import torch
262
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
263
 
264
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
265
+ model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
266
  model.eval()
267
 
268
  pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
 
328
  - **Reranking**:
329
  See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for evaluation script.
330
 
331
+ | Model | T2Reranking | T2RerankingZh2En\* | T2RerankingEn2Zh\* | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
332
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
333
  | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
334
  | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
 
341
  | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
342
  | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
343
 
344
+ \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks
345
 
346
  ## Train
347
 
348
  ### BAAI Embedding
349
 
350
+ We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pairs data using contrastive learning.
351
  **You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
352
  We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
353
  Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
 
370
  You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]).
371
 
372
 
373
+ ## Citation
374
+
375
+ If you find our work helpful, please cite us:
376
+ ```
377
+ @misc{bge_embedding,
378
+ title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
379
+ author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
380
+ year={2023},
381
+ eprint={2309.07597},
382
+ archivePrefix={arXiv},
383
+ primaryClass={cs.CL}
384
+ }
385
+ ```
386
+
387
  ## License
388
  FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
389