--- license: apache-2.0 language: - en base_model: - Qwen/Qwen2.5-14B-Instruct pipeline_tag: text-generation tags: - GenRM - Reasonging - o1 --- ### Model Description This model is fine-tuned on reward modeling data and has undergone two stages of training: **Supervised Fine-Tuning (SFT)** and **Direct Preference Optimization (DPO)**. As a result, it is a post-DPO model optimized for reasoning and text generation tasks. ```text chat_message = [ {"role": "user", "content": ...}, {"role": "reason", "content": ...}, {"role": "assistant", "content": ...}, ] ``` ### Intended Use While this model is specifically designed for **reward modeling tasks**, it also demonstrates adaptability to **general-purpose tasks**. Notably, it exhibits a degree of correctness and reliability across various applications. ### Limitations - The model’s performance may vary depending on the domain and specificity of the input. - It may inherit biases present in the training data. ### Code and Resources The code and additional resources for this model are available on [GitHub](https://github.com/Freder-chen/ReasonGenRM).