Step-level Value Preference Optimization for Mathematical Reasoning

This is the official repository for paper Step-level Value Preference Optimization for Mathematical Reasoning. It is extracted from our internal corporate codebase. As a result, there may be slight differences when reproducing the numbers reported in our paper, but they should be very close.

The implementation of SVPO is based on AlphaMath, such as MCTS and Step-level beam search (SBS). Therefore, we provide the code of step-level preference pairs construction in this repository to facilitate reproduction.

Citation

SVPO

@misc{chen2024steplevel,
      title={Step-level Value Preference Optimization for Mathematical Reasoning}, 
      author={Guoxin Chen and Minpeng Liao and Chengxi Li and Kai Fan},
      year={2024},
      eprint={2406.10858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

AlphaMATH

@misc{chen2024alphamath,
      title={AlphaMath Almost Zero: process Supervision without process}, 
      author={Guoxin Chen and Minpeng Liao and Chengxi Li and Kai Fan},
      year={2024},
      eprint={2405.03553},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}