LiveMath-Judge

Introduction

LiveMath-Judge is a mathematical judge model for evaluating the consistency of the model-generated answer and the golden answer. Given the question, golden answer, and model-generated answer as input, LiveMath-Judge will output yes for the equivalent answer, and no for the inequal answer.

To train LiveMath-Judge, we collect about 2.8 million data in LiveMathBench judged by Qwen2.5-72B-Instruct, and finetune Qwen2.5-3B-Instruct for one epoch with learning-rate of 5e-6.

Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    'jnanliu/LiveMath-Judge', 
    device_map='auto', 
    torch_dtype=torch.bfloat16, 
)
tokenizer = AutoTokenizer.from_pretrained(
    'jnanliu/LiveMath-Judge', 
    trust_remote_code=True
)

question = 'In $\\triangle ABC$, given $\\cos C = \\frac{\\sin A + \\cos A}{2} = \\frac{\\sin B + \\cos B}{2}$, find the value of $\\sin C$.'
golden_answer = '$\\frac{3}{4}$'
generated_answer = '\\frac{3}{4}'
prompt = '''Please act as an expert in grading mathematics exam papers, and judge whether the following answers match the standard answers, i.e., whether the examinee answered correctly. Here are some evaluation criteria:

1. Some answers may contain multiple parts, such as single-choice questions, multiple-choice questions, fill-in-the-blank questions, and problem-solving questions. As long as the answer matches the standard answer, it is considered correct. For multiple-choice questions and fill-in-the-blank questions with multiple blanks, the examinee must answer all corresponding options or blanks correctly to be considered correct.
2. Some answers may be expressed in different ways; for example, some answers may be mathematical expressions, while others may be textual descriptions. As long as the meaning conveyed is consistent, it is considered correct. Additionally, some formulas may be expressed differently but are equivalent, which is also considered correct.
3. You do not need to recalculate the problem answers, as the standard answers are already provided. You only need to judge whether the examinee's answer matches the standard answer based on the form of the question and whether it is correct.

Please judge whether the following answer matches the standard answer according to the above criteria. If they match, output \\boxed{{yes}}, otherwise output \\boxed{{no}}. If it is difficult to judge, also output \\boxed{{no}}.
Original Question: {question}
Standard Answer: {gold_answer}
Examinee's Answer: {answer}

Analysis:
'''
# prompt for chinese questions
# prompt = '''请你作为一个数学阅卷专家,判断下面的答案是否与标准答案一致,即考生是否回答正确。下面是一些评判标准:
# 1. 有些答案可能包含多项内容,可能有单选题,多选题,填空题和问答题,只要答案与标准答案一致即可, 对于多选题和多个空的填空题,需要考生对应的选项或空都回答正确才算正确。
# 2. 有些答案可能通过不同的方式表达,比如有些答案可能是一个数学表达式,有些答案可能是一个文字描述,只要表达的意思一致即可。且有些公式通过不同的方式表达,但等价,也是正确的。
# 3. 你不需要重新计算问题答案,因为标准答案已经给出,只需要根据问题形式来判断考生的答案是否与标准答案一致,是否正确即可。

# 请你根据上述标准,判断下面的答案是否与标准答案一致,如果一致,请在最后输出\\boxed{{yes}}, 否则输出\\boxed{{no}}, 如果难以判断,请输出\\boxed{{no}}.
# 原问题:{question}
# 标准答案:{gold_answer}
# 考生答案:{answer}

# 分析:
# '''

conversations = [
  {'role': 'user', 'content': prompt.format(question=question, gold_answer=golden_answer, answer=generated_answer)}
]
inputs = tokenizer.apply_chat_template(conversations, return_tensors='pt')

# do inference
pred = model.generate(
    input_ids=inputs['input_ids'].to(model.device),
    attention_mask=inputs['attention_mask'].to(model.device),
    num_return_sequences=1,
)[0].cpu().tolist()
response = tokenizer.decode(pred, skip_special_tokens=True)
# \\boxed{yes}

Performance

Following is the mG-Pass@16 results of Qwen2.5-72B-Instruct-as-Judge and LiveMath-Judge-as-Judge of 10 models.

model Qwen2.5-72B-Instruct LiveMath-Judge
Qwen2.5-7B-Instruct 26.45 28.17
Qwen2.5-Math-7B-Instruct 37.91 39.54
Llama-3.1-8B-Instruct 10.43 9.29
Llama-3.1-70B-Instruct 21.37 22.12
Llama-3.3-70B-Instruct 27.36 30.47
Mistral-Large-Instruct-2411 36.66 36.67
Qwen2.5-32B-Instruct 39.09 37.84
Qwen2.5-72B-Instruct 38.52 37.57
deepseek-math-7b-rl 14.01 14.13
Qwen2.5-Math-72B-Instruct 43.80 42.10

Related Citation

@article{liu2024your,
  title={Are Your LLMs Capable of Stable Reasoning?},
  author={Liu, Junnan and Liu, Hongwei and Xiao, Linchen and Wang, Ziyi and Liu, Kuikun and Gao, Songyang and Zhang, Wenwei and Zhang, Songyang and Chen, Kai},
  journal={arXiv preprint arXiv:2412.13147},
  year={2024}
}
Downloads last month
12
Safetensors
Model size
3.09B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for jnanliu/LiveMath-Judge

Base model

Qwen/Qwen2.5-3B
Finetuned
(78)
this model