Shitao commited on
Commit
f11a3f1
·
verified ·
1 Parent(s): 8017dbb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -5
README.md CHANGED
@@ -9,7 +9,8 @@ license: mit
9
 
10
  For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
11
 
12
- # BGE-M3
 
13
  In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
14
  - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
15
  - Multi-Linguality: It can support more than 100 working languages.
@@ -26,12 +27,14 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
26
 
27
 
28
  ## News:
 
29
  - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
30
 
31
 
32
  ## Specs
33
 
34
  - Model
 
35
  | Model Name | Dimension | Sequence Length | Introduction |
36
  |:----:|:---:|:---:|:---:|
37
  | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
@@ -48,7 +51,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
48
  | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
49
 
50
 
51
-
52
  ## FAQ
53
 
54
  **1. Introduction for different retrieval methods**
@@ -57,7 +59,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
57
  - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
58
  - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
59
 
60
-
61
  **2. Comparison with BGE-v1.5 and other monolingual models**
62
 
63
  BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
@@ -77,6 +78,11 @@ For sparse retrieval methods, most open-source libraries currently do not suppor
77
  Contributions from the community are welcome.
78
 
79
 
 
 
 
 
 
80
  **4. How to fine-tune bge-M3 model?**
81
 
82
  You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
@@ -218,10 +224,10 @@ print(model.compute_score(sentence_pairs,
218
  - Long Document Retrieval
219
  - MLDR:
220
  ![avatar](./imgs/long.jpg)
221
- Please note that MLDR is a document retrieval dataset we constructed via LLM,
222
  covering 13 languages, including test set, validation set, and training set.
223
  We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
224
- Therefore, comparing baseline with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
225
  Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
226
  We believe that this data will be helpful for the open-source community in training document retrieval models.
227
 
 
9
 
10
  For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
11
 
12
+ # BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
13
+
14
  In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
15
  - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
16
  - Multi-Linguality: It can support more than 100 working languages.
 
27
 
28
 
29
  ## News:
30
+ - 2/6/2024: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR), a long document retrieval dataset covering 13 languages.
31
  - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
32
 
33
 
34
  ## Specs
35
 
36
  - Model
37
+
38
  | Model Name | Dimension | Sequence Length | Introduction |
39
  |:----:|:---:|:---:|:---:|
40
  | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
 
51
  | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
52
 
53
 
 
54
  ## FAQ
55
 
56
  **1. Introduction for different retrieval methods**
 
59
  - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
60
  - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
61
 
 
62
  **2. Comparison with BGE-v1.5 and other monolingual models**
63
 
64
  BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
 
78
  Contributions from the community are welcome.
79
 
80
 
81
+ In our experiments, we use [Pyserini](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse) and Faiss to do hybrid retrieval.
82
+ **Now you can ou can try the hybrid mode of BGE-M3 in [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
83
+ ). Thanks @jobergum.**
84
+
85
+
86
  **4. How to fine-tune bge-M3 model?**
87
 
88
  You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
 
224
  - Long Document Retrieval
225
  - MLDR:
226
  ![avatar](./imgs/long.jpg)
227
+ Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM,
228
  covering 13 languages, including test set, validation set, and training set.
229
  We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
230
+ Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
231
  Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
232
  We believe that this data will be helpful for the open-source community in training document retrieval models.
233