floom
's Collections
Evaluation
updated
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
•
2403.04132
•
Published
•
38
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper
•
2402.17753
•
Published
•
18
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper
•
2402.12659
•
Published
•
17
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue
Summarization
Paper
•
2402.13249
•
Published
•
11
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
•
2405.01535
•
Published
•
119
To Believe or Not to Believe Your LLM
Paper
•
2406.02543
•
Published
•
32
Evaluating Open Language Models Across Task Types, Application Domains,
and Reasoning Types: An In-Depth Experimental Analysis
Paper
•
2406.11402
•
Published
•
6
Judging the Judges: Evaluating Alignment and Vulnerabilities in
LLMs-as-Judges
Paper
•
2406.12624
•
Published
•
36
Paper
•
2408.02666
•
Published
•
27