MiniCPM-Reranker

MiniCPM-Reranker 是面壁智能与清华大学自然语言处理实验室(THUNLP)、东北大学信息检索小组(NEUIR)共同开发的中英双语言文本重排序模型,有如下特点:

  • 出色的中文、英文重排序能力。
  • 出色的中英跨语言重排序能力。

MiniCPM-Reranker 基于 MiniCPM-2B-sft-bf16 训练,结构上采取双向注意力。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。

欢迎关注 RAG 套件系列:

MiniCPM-Reranker is a bilingual & cross-lingual text re-ranking model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:

  • Exceptional Chinese and English re-ranking capabilities.
  • Outstanding cross-lingual re-ranking capabilities between Chinese and English.

MiniCPM-Reranker is trained based on MiniCPM-2B-sft-bf16 and incorporates bidirectional attention in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.

We also invite you to explore the RAG toolkit series:

模型信息 Model Information

  • 模型大小:2.4B

  • 最大输入token数:1024

  • Model Size: 2.4B

  • Max Input Tokens: 1024

使用方法 Usage

输入格式 Input Format

本模型支持指令,输入格式如下:

MiniCPM-Reranker supports instructions in the following format:

<s>Instruction: {{ instruction }} Query: {{ query }}</s>{{ document }}

例如:

For example:

<s>Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么?</s>(文档省略)
<s>Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.</s>(document omitted)

也可以不提供指令,即采取如下格式:

MiniCPM-Reranker also works in instruction-free mode in the following format:

<s>Query: {{ query }}</s>{{ document }}

我们在BEIR与C-MTEB/Retrieval上测试时使用的指令见 instructions.json,其他测试不使用指令。

When running evaluation on BEIR and C-MTEB/Retrieval, we use instructions in instructions.json. For other evaluations, we do not use instructions.

环境要求 Requirements

transformers==4.37.2
flash-attn>2.3.5

示例脚本 Demo

Huggingface Transformers

from transformers import AutoModel, LlamaTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

# from https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
class MiniCPMRerankerLLamaTokenizer(LlamaTokenizer):
    def build_inputs_with_special_tokens(
            self, token_ids_0, token_ids_1 = None
        ):
            """
            - single sequence: `<s> X </s>`
            - pair of sequences: `<s> A </s> B`

            Args:
                token_ids_0 (`List[int]`):
                    List of IDs to which the special tokens will be added.
                token_ids_1 (`List[int]`, *optional*):
                    Optional second list of IDs for sequence pairs.

            Returns:
                `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
            """

            if token_ids_1 is None:
                return super().build_inputs_with_special_tokens(token_ids_0)
            bos = [self.bos_token_id]
            sep = [self.eos_token_id]
            return bos + token_ids_0 + sep + token_ids_1

model_name = "openbmb/MiniCPM-Reranker"
tokenizer = MiniCPMRerankerLLamaTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.padding_side = "right"

model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True,attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

@torch.no_grad()
def rerank(input_query, input_docs):
    tokenized_inputs = tokenizer([[input_query, input_doc] for input_doc in input_docs], return_tensors="pt", padding=True, truncation=True, max_length=1024) 

    for k in tokenized_inputs:
      tokenized_inputs [k] = tokenized_inputs[k].to("cuda")

    outputs = model(**tokenized_inputs)
    score = outputs.logits
    return score.float().detach().cpu().numpy()

queries = ["中国的首都是哪里?"]
passages = [["beijing", "shanghai"]]

INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]

scores = []
for i in range(len(queries)):
    print(queries[i])
    scores.append(rerank(queries[i],passages[i]))

print(np.array(scores))  # [[[-4.7460938][-8.8515625]]]

Sentence Transformer

from sentence_transformers import CrossEncoder
from transformers import LlamaTokenizer
import torch

# from https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
class MiniCPMRerankerLLamaTokenizer(LlamaTokenizer):
    def build_inputs_with_special_tokens(
            self, token_ids_0, token_ids_1 = None
        ):
            """
            - single sequence: `<s> X </s>`
            - pair of sequences: `<s> A </s> B`

            Args:
                token_ids_0 (`List[int]`):
                    List of IDs to which the special tokens will be added.
                token_ids_1 (`List[int]`, *optional*):
                    Optional second list of IDs for sequence pairs.

            Returns:
                `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
            """

            if token_ids_1 is None:
                return super().build_inputs_with_special_tokens(token_ids_0)
            bos = [self.bos_token_id]
            sep = [self.eos_token_id]
            return bos + token_ids_0 + sep + token_ids_1

model_name = "openbmb/MiniCPM-Reranker"
model = CrossEncoder(model_name,max_length=1024,trust_remote_code=True, automodel_args={"attn_implementation":"flash_attention_2","torch_dtype": torch.float16})
model.tokenizer = MiniCPMRerankerLLamaTokenizer.from_pretrained(model_name, trust_remote_code=True)
model.tokenizer.padding_side = "right"

query = "中国的首都是哪里?"
passages = [["beijing", "shanghai"]]

INSTRUCTION = "Query: "
query = INSTRUCTION + query

sentence_pairs = [[query, doc] for doc in passages]

scores = model.predict(sentence_pairs, convert_to_tensor=True).tolist()
rankings = model.rank(query, passages, return_documents=True, convert_to_tensor=True)

print(scores) # [0.0087432861328125, 0.00020503997802734375]
for ranking in rankings:
    print(f"Score: {ranking['score']:.4f}, Corpus: {ranking['text']}")
  
# ID: 0, Score: 0.0087, Text: beijing
# ID: 1, Score: 0.0002, Text: shanghai

实验结果 Evaluation Results

中文与英文重排序结果 CN/EN Re-ranking Results

中文对bge-large-zh-v1.5检索的top-100进行重排,英文对bge-large-en-v1.5检索的top-100进行重排。

We re-rank top-100 docments from bge-large-zh-v1.5 in C-MTEB/Retrieval and from bge-large-en-v1.5 in BEIR.

模型 Model C-MTEB/Retrieval (NDCG@10) BEIR (NDCG@10)
bge-large-zh-v1.5(Retriever for Chinese) 70.46 -
bge-large-en-v1.5(Retriever for English) - 54.29
bge-reranker-v2-m3 71.82 55.36
bge-reranker-v2-minicpm-28 73.51 59.86
bge-reranker-v2-gemma 71.74 60.71
bge-reranker-v2.5-gemma2 - 63.67
MiniCPM-Reranker 76.79 61.32

中英跨语言重排序结果 CN-EN Cross-lingual Re-ranking Results

对bge-m3(Dense)检索的top100进行重排。

We re-rank top-100 documents from bge-m3 (Dense).

模型 Model MKQA En-Zh_CN (Recall@20) NeuCLIR22 (NDCG@10) NeuCLIR23 (NDCG@10)
bge-m3 (Dense)(Retriever) 66.4 30.49 41.09
jina-reranker-v2-base-multilingual 69.33 36.66 50.03
bge-reranker-v2-m3 69.75 40.98 49.67
gte-multilingual-reranker-base 68.51 38.74 45.3
MiniCPM-Reranker 71.73 43.65 50.59

许可证 License

  • 本仓库中代码依照 Apache-2.0 协议开源。
  • MiniCPM-Reranker 模型权重的使用则需要遵循 MiniCPM 模型协议
  • MiniCPM-Reranker 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写此问卷
  • The code in this repo is released under the Apache-2.0 License.
  • The usage of MiniCPM-Reranker model weights must strictly follow MiniCPM Model License.md.
  • The models and weights of MiniCPM-Reranker are completely free for academic research. After filling out a "questionnaire" for registration, MiniCPM-Reranker weights are also available for free commercial use.
Downloads last month
180
Safetensors
Model size
2.72B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for openbmb/MiniCPM-Reranker

Finetuned
(9)
this model

Collections including openbmb/MiniCPM-Reranker