---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-14B-Instruct
pipeline_tag: text-generation
tags:
- GenRM
- Reasonging
- o1
---

### Model Description

This model is fine-tuned on reward modeling data and has undergone two stages of training: **Supervised Fine-Tuning (SFT)** and **Direct Preference Optimization (DPO)**. As a result, it is a post-DPO model optimized for reasoning and text generation tasks.

```text
chat_message = [
  {"role": "user", "content": ...},
  {"role": "reason", "content": ...},
  {"role": "assistant", "content": ...},
]
```

### Intended Use

While this model is specifically designed for **reward modeling tasks**, it also demonstrates adaptability to **general-purpose tasks**. Notably, it exhibits a degree of correctness and reliability across various applications.

### Limitations

- The model’s performance may vary depending on the domain and specificity of the input.
- It may inherit biases present in the training data.

### Code and Resources

The code and additional resources for this model are available on [GitHub](https://github.com/Freder-chen/ReasonGenRM).