Update README.md
Browse files
README.md
CHANGED
@@ -4,9 +4,8 @@ license: apache-2.0
|
|
4 |
|
5 |
|
6 |
## Performance
|
7 |
-
*(We run a [slightly modified llm-jp-eval](https://github.com/llm-jp/llm-jp-eval/compare/main...AUGMXNT:llm-jp-eval:main) to support testing of Qwen and to emit a `bos_token` if available)*
|
8 |
|
9 |
-
For our final model,
|
10 |
|
11 |
| Benchmark | Score |
|
12 |
| ----------- | ----- |
|
|
|
4 |
|
5 |
|
6 |
## Performance
|
|
|
7 |
|
8 |
+
For our final model, we've used Stability AI Japan's [Japanese MT-Bench](https://github.com/Stability-AI/FastChat) as a more representative test of our model's capabilities. For [our JA MT-Bench testing](https://github.com/Stability-AI/FastChat/compare/jp-stable...AUGMXNT:FastChat:jp-stable) we use a Japanese prompt ("あなたは役立つアシスタントです。") as well as `--num-choices 4` in an effort to reduce sampling variability, however we've still observed regular 0.5+ point (and sometimes even greater swings) between generations, as well as issues with default prompts and parameters when testing, so again, we'd urge caution in over-interpreting these scores and treating them as more of a probabilistic directional indicator, rather than a definitive score or ranking:
|
9 |
|
10 |
| Benchmark | Score |
|
11 |
| ----------- | ----- |
|