Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models Paper • 2312.04724 • Published Dec 7, 2023 • 20
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution Paper • 2401.03065 • Published Jan 5, 2024 • 11
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers Paper • 2401.04695 • Published Jan 9, 2024 • 11
TravelPlanner: A Benchmark for Real-World Planning with Language Agents Paper • 2402.01622 • Published Feb 2, 2024 • 35
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios Paper • 2401.17167 • Published Jan 30, 2024 • 1
Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning Paper • 2312.05230 • Published Dec 8, 2023
LongAlign: A Recipe for Long Context Alignment of Large Language Models Paper • 2401.18058 • Published Jan 31, 2024 • 20
Premise Order Matters in Reasoning with Large Language Models Paper • 2402.08939 • Published Feb 14, 2024 • 28
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss Paper • 2402.10790 • Published Feb 16, 2024 • 42
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks Paper • 2412.15204 • Published Dec 19, 2024 • 33
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios Paper • 2412.08972 • Published Dec 12, 2024 • 10