Quantization made by Richard Erkhov.

Themis - GGUF

Model creator: https://huggingface.co/PKU-ONELab/
Original model: https://huggingface.co/PKU-ONELab/Themis/

Name	Quant method	Size
Themis.Q2_K.gguf	Q2_K	2.96GB
Themis.Q3_K_S.gguf	Q3_K_S	3.41GB
Themis.Q3_K.gguf	Q3_K	3.74GB
Themis.Q3_K_M.gguf	Q3_K_M	3.74GB
Themis.Q3_K_L.gguf	Q3_K_L	4.03GB
Themis.IQ4_XS.gguf	IQ4_XS	4.18GB
Themis.Q4_0.gguf	Q4_0	4.34GB
Themis.IQ4_NL.gguf	IQ4_NL	4.38GB
Themis.Q4_K_S.gguf	Q4_K_S	4.37GB
Themis.Q4_K.gguf	Q4_K	4.58GB
Themis.Q4_K_M.gguf	Q4_K_M	4.58GB
Themis.Q4_1.gguf	Q4_1	4.78GB
Themis.Q5_0.gguf	Q5_0	5.21GB
Themis.Q5_K_S.gguf	Q5_K_S	5.21GB
Themis.Q5_K.gguf	Q5_K	5.34GB
Themis.Q5_K_M.gguf	Q5_K_M	5.34GB
Themis.Q5_1.gguf	Q5_1	5.65GB
Themis.Q6_K.gguf	Q6_K	6.14GB
Themis.Q8_0.gguf	Q8_0	7.95GB

Original model description:

license: apache-2.0

Themis

Paper: https://arxiv.org/abs/2406.18365

Github: https://github.com/PKU-ONELab/Themis

Introduction

We propose Themis, an 8B-parameter large language model (LLM) specifically designed and trained for NLG evaluation with more comprehensive capabilities.

Our Themis can evaluate various NLG tasks, including uncommon ones like question-answering evaluation (Versatility), in a reference-free manner (Independence). Moreover, it allows for specific and customized evaluation aspects and criteria, including overall quality and more fine-grained aspects (Flexibility), and its evaluation contains corresponding analysis and explanation together with the rating (Interpretability).

We believe that an ideal evaluator should be convenient to use and possess these characteristics. The comparison between related methods and Themis is shown in the table below.

Method	Versatility	Independence	Flexibility	Interpretability	Open-source
UniEval	❌	❌	✔️	❌	✔️
G-Eval	✔️	✔️	✔️	✔️	❌
X-Eval	✔️	❌	✔️	❌	❌
Prometheus	✔️	❌	✔️	✔️	✔️
Auto-J	✔️	✔️	❌	✔️	✔️
InstructScore	✔️	❌	❌	✔️	✔️
TIGERScore	✔️	✔️	❌	✔️	✔️
Themis (Ours)	✔️	✔️	✔️	✔️	✔️

Performance

We implement experiments on several common NLG evaluation tasks and datasets to compare our Themis with other methods, including SummEval for summarization, Topical-Chat for dialogue response generation, SFRES&SFHOT for data-to-text, QAGS for factuality, MANS for story generation, and WMT23 zh-en for machine translation. Experimental results show that our Themis achieves better overall evaluation performance over other evaluation models, including GPT-4.

Method	SummEval	Topical-Chat	SFHOT& SFRES	QAGS	MANS	WMT23	Average Spearman
BLEU	0.075	0.388	0.024	-	0.032	0.021	-
ROUGE	0.152	0.412	0.101	-	-0.002	0.151	-
BARTScore	0.329	0.086	0.208	0.425	0.350	0.118	0.253
BERTScore	0.231	0.394	0.139	-	0.285	0.219	-
BLEURT	0.152	0.388	0.244	-	0.138	0.263	-
CometKiwi	0.228	0.340	0.251	0.094	0.251	0.343	0.251
UniEval	0.474	0.577	0.282	-	-	-	-
G-Eval (GPT-3.5)	0.409	0.585	-	0.461	-	-	-
G-Eval (GPT-4)	0.523	0.588	-	0.611	-	-	-
GPT-3.5 Turbo	0.416	0.578	0.306	0.431	0.328	0.347	0.401
GPT-4 Turbo	0.511	0.746	0.320	0.637	0.473	0.437	0.521
X-Eval	0.480	0.605	0.303	0.578	-	-	-
Prometheus-13B	0.163	0.434	0.173	-	0.007	0.129	-
Auto-J-13B	0.198	0.425	0.141	0.226	0.380	0.104	0.246
TIGERScore-13B	0.384	0.346	0.200	0.504	0.231	0.248	0.319
InstructScore-7B	0.258	0.241	0.247	-	0.298	0.219	-
Themis-8B (ours)	0.553	0.725	0.333	0.684	0.551	0.405	0.542

We further conduct more in-depth analyses, including generalization tests on unseen tasks like the instruction-following evaluation as well as aspect-targeted perturbation tests, and our Themis also exhibits superior evaluation performance. For more experimental results and details, please refer to our paper.

Requirements and Usage

Please refer to our github repo for more details.

Citation

@article{hu2024themis,
  title={Themis: Towards Flexible and Interpretable NLG Evaluation},
  author={Hu, Xinyu and Lin, Li and Gao, Mingqi and Yin, Xunjian and Wan, Xiaojun},
  journal={arXiv preprint arXiv:2406.18365},
  year={2024}
}