-
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Paper • 2401.09603 • Published • 16 -
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
Paper • 2402.10524 • Published • 22 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10 -
RewardBench: Evaluating Reward Models for Language Modeling
Paper • 2403.13787 • Published • 21
Collections
Discover the best community collections!
Collections including paper arxiv:2402.14261
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 145 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 29 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 21 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 66