Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
- de
|
5 |
+
license: apache-2.0
|
6 |
+
tags:
|
7 |
+
- text-generation-inference
|
8 |
+
- transformers
|
9 |
+
- unsloth
|
10 |
+
- llama
|
11 |
+
- trl
|
12 |
+
- orpo
|
13 |
+
base_model: cstr/phi-3-orpo-v8_16
|
14 |
+
---
|
15 |
+
|
16 |
+
# Model details
|
17 |
+
|
18 |
+
These are the q4 in GGUF of a quick experiment on llamafied phi-3 with only 1000 orpo steps from an azureml translated german orca binarized-dataset (johannhartmann/mistralorpo), with original phi-3 prompt template. The immediate result is not really good, but also not bad enough to disencourage further experiments.
|
19 |
+
|
20 |
+
# Benchmark results
|
21 |
+
|
22 |
+
This was an experiment on a german dataset snippet which, as expected, worsened results on english benchmarks:
|
23 |
+
|
24 |
+
| Metric |Value|
|
25 |
+
|---------------------------------|----:|
|
26 |
+
|Avg. |64.40|
|
27 |
+
|AI2 Reasoning Challenge (25-Shot)|60.41|
|
28 |
+
|HellaSwag (10-Shot) |78.37|
|
29 |
+
|MMLU (5-Shot) |65.26|
|
30 |
+
|TruthfulQA (0-shot) |49.76|
|
31 |
+
|Winogrande (5-shot) |70.24|
|
32 |
+
|GSM8k (5-shot) |62.32|
|
33 |
+
|
34 |
+
On german EQ-Bench (v2_de) 51.82 (insignificant over 51.41 for original llamafied but significantly better than intermediate cstr/phi-3-orpo-v8_16 which after initial 150 test steps achieved 46.38) but with still only 164/171 correctly parsed.
|
35 |
+
|
36 |
+
Note: We can improve the correctness of parsing, i.a., by only a few SFT steps, as shown with cas/phi3-mini-4k-llamafied-sft-v3 (170/171 correct but with then only 39.46 score in v2_de, which was also an experiment in changing the prompt template).
|
37 |
+
All that was quickly done with bnb and q4 quants only, which might, in theory, affect especially such small dense models significantly.
|
38 |
+
But it served the intention for both proof-of-concept-experiments at least. Probably it would easily be possible to further improve results, but that would take some time and compute.
|
39 |
+
|
40 |
+
# Training setup
|
41 |
+
|
42 |
+
This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
|