YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Quantization made by Richard Erkhov.

Github

Discord

Request more models

Themis - GGUF

Name Quant method Size
Themis.Q2_K.gguf Q2_K 2.96GB
Themis.Q3_K_S.gguf Q3_K_S 3.41GB
Themis.Q3_K.gguf Q3_K 3.74GB
Themis.Q3_K_M.gguf Q3_K_M 3.74GB
Themis.Q3_K_L.gguf Q3_K_L 4.03GB
Themis.IQ4_XS.gguf IQ4_XS 4.18GB
Themis.Q4_0.gguf Q4_0 4.34GB
Themis.IQ4_NL.gguf IQ4_NL 4.38GB
Themis.Q4_K_S.gguf Q4_K_S 4.37GB
Themis.Q4_K.gguf Q4_K 4.58GB
Themis.Q4_K_M.gguf Q4_K_M 4.58GB
Themis.Q4_1.gguf Q4_1 4.78GB
Themis.Q5_0.gguf Q5_0 5.21GB
Themis.Q5_K_S.gguf Q5_K_S 5.21GB
Themis.Q5_K.gguf Q5_K 5.34GB
Themis.Q5_K_M.gguf Q5_K_M 5.34GB
Themis.Q5_1.gguf Q5_1 5.65GB
Themis.Q6_K.gguf Q6_K 6.14GB
Themis.Q8_0.gguf Q8_0 7.95GB

Original model description:

license: apache-2.0

Themis

Paper: https://arxiv.org/abs/2406.18365

Github: https://github.com/PKU-ONELab/Themis

Introduction

We propose Themis, an 8B-parameter large language model (LLM) specifically designed and trained for NLG evaluation with more comprehensive capabilities.

Our Themis can evaluate various NLG tasks, including uncommon ones like question-answering evaluation (Versatility), in a reference-free manner (Independence). Moreover, it allows for specific and customized evaluation aspects and criteria, including overall quality and more fine-grained aspects (Flexibility), and its evaluation contains corresponding analysis and explanation together with the rating (Interpretability).

We believe that an ideal evaluator should be convenient to use and possess these characteristics. The comparison between related methods and Themis is shown in the table below.

Method Versatility Independence Flexibility Interpretability Open-source
UniEval ❌ ❌ βœ”οΈ ❌ βœ”οΈ
G-Eval βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ ❌
X-Eval βœ”οΈ ❌ βœ”οΈ ❌ ❌
Prometheus βœ”οΈ ❌ βœ”οΈ βœ”οΈ βœ”οΈ
Auto-J βœ”οΈ βœ”οΈ ❌ βœ”οΈ βœ”οΈ
InstructScore βœ”οΈ ❌ ❌ βœ”οΈ βœ”οΈ
TIGERScore βœ”οΈ βœ”οΈ ❌ βœ”οΈ βœ”οΈ
Themis (Ours) βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ

Performance

We implement experiments on several common NLG evaluation tasks and datasets to compare our Themis with other methods, including SummEval for summarization, Topical-Chat for dialogue response generation, SFRES&SFHOT for data-to-text, QAGS for factuality, MANS for story generation, and WMT23 zh-en for machine translation. Experimental results show that our Themis achieves better overall evaluation performance over other evaluation models, including GPT-4.

Method SummEval Topical-Chat SFHOT& SFRES QAGS MANS WMT23 Average Spearman
BLEU 0.075 0.388 0.024 - 0.032 0.021 -
ROUGE 0.152 0.412 0.101 - -0.002 0.151 -
BARTScore 0.329 0.086 0.208 0.425 0.350 0.118 0.253
BERTScore 0.231 0.394 0.139 - 0.285 0.219 -
BLEURT 0.152 0.388 0.244 - 0.138 0.263 -
CometKiwi 0.228 0.340 0.251 0.094 0.251 0.343 0.251
UniEval 0.474 0.577 0.282 - - - -
G-Eval (GPT-3.5) 0.409 0.585 - 0.461 - - -
G-Eval (GPT-4) 0.523 0.588 - 0.611 - - -
GPT-3.5 Turbo 0.416 0.578 0.306 0.431 0.328 0.347 0.401
GPT-4 Turbo 0.511 0.746 0.320 0.637 0.473 0.437 0.521
X-Eval 0.480 0.605 0.303 0.578 - - -
Prometheus-13B 0.163 0.434 0.173 - 0.007 0.129 -
Auto-J-13B 0.198 0.425 0.141 0.226 0.380 0.104 0.246
TIGERScore-13B 0.384 0.346 0.200 0.504 0.231 0.248 0.319
InstructScore-7B 0.258 0.241 0.247 - 0.298 0.219 -
Themis-8B (ours) 0.553 0.725 0.333 0.684 0.551 0.405 0.542

We further conduct more in-depth analyses, including generalization tests on unseen tasks like the instruction-following evaluation as well as aspect-targeted perturbation tests, and our Themis also exhibits superior evaluation performance. For more experimental results and details, please refer to our paper.

Requirements and Usage

Please refer to our github repo for more details.

Citation

@article{hu2024themis,
  title={Themis: Towards Flexible and Interpretable NLG Evaluation},
  author={Hu, Xinyu and Lin, Li and Gao, Mingqi and Yin, Xunjian and Wan, Xiaojun},
  journal={arXiv preprint arXiv:2406.18365},
  year={2024}
}
Downloads last month
0
GGUF
Model size
8.03B params
Architecture
llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference API
Unable to determine this model's library. Check the docs .