Adding Evaluation Results (#4)

- Adding Evaluation Results (416127de28557d05ee2c694171f7b22bffe11e0d)

Co-authored-by: Open LLM Leaderboard PR Bot <[email protected]>

Files changed (1) hide show

README.md CHANGED Viewed

@@ -115,4 +115,17 @@ state of the art, but rather further show that chat-like behaviors in LLMs can b
 *DLite is an experimental technology and is not designed for use in any environment without significant testing and safety consideration.
 Furthermore, the model can sometimes exhibit undesired behaviors. Some of these behaviors include, but are not limited to: factual
 inaccuracies, biases, offensive responses, toxicity, and hallucinations. Just as with any other LLM, we advise users of this technology
-to exercise good judgment when applying this technology.*

 *DLite is an experimental technology and is not designed for use in any environment without significant testing and safety consideration.
 Furthermore, the model can sometimes exhibit undesired behaviors. Some of these behaviors include, but are not limited to: factual
 inaccuracies, biases, offensive responses, toxicity, and hallucinations. Just as with any other LLM, we advise users of this technology
+to exercise good judgment when applying this technology.*
+# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
+Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_aisquared__dlite-v2-774m)
+| Metric                | Value                     |
+|-----------------------|---------------------------|
+| Avg.                  | 29.01   |
+| ARC (25-shot)         | 30.12          |
+| HellaSwag (10-shot)   | 47.68    |
+| MMLU (5-shot)         | 25.37         |
+| TruthfulQA (0-shot)   | 40.0   |
+| Winogrande (5-shot)   | 53.99   |
+| GSM8K (5-shot)        | 0.0        |
+| DROP (3-shot)         | 5.93         |