HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
Abstract
We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.
Community
We present HumanEval Pro and MBPP Pro, two expanded versions of the traditional HumanEval and MBPP benchmarks to evaluate LLMs on self-invoking code generation task. Self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one.
Check out our GitHub repo: https://github.com/CodeEval-Pro/CodeEval-Pro
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models (2024)
- SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation (2024)
- MdEval: Massively Multilingual Code Debugging (2024)
- WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models (2024)
- On the Adversarial Robustness of Instruction-Tuned Large Language Models for Code (2024)
- Evaluating and Aligning CodeLLMs on Human Preference (2024)
- CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper