liang.zhao
commited on
Commit
·
ea950d1
1
Parent(s):
a9cbfba
update model and config
Browse files
README.md
CHANGED
@@ -34,10 +34,11 @@ We evaluate our models on [RewardBench](https://huggingface.co/spaces/allenai/re
|
|
34 |
|
35 |
As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench for generative models across all sizes, while Skywork-Critic-Llama3.1-8B tops the list for generative models under 10B parameters. (Note: An asterisk (*) indicates an open-source model.)
|
36 |
|
|
|
37 |
| Model | Chat | Chat Hard | Safety | Reasoning | Overall Score |
|
38 |
| ------------------------------- | :---: | :-------: | :----: | :-------: | :---: |
|
39 |
| **Skywork-Critic-Llama3.1-70B** * | **96.9** | **88.4** | **93.2** | **95.4** | **93.4** |
|
40 |
-
| Salesforce/SFR-LLaMa-3.1-70B-Judge-r | 96.9 | 84.8 |
|
41 |
| Salesforce/SFR-nemo-12B-Judge-r | 97.2 | 82.2 | 86.5 | 95.1 | 90.3 |
|
42 |
| **Skywork-Critic-Llama3.1-8B** * | **93.6** | **81.4** | **91.1** | **89.8** | **89.0** |
|
43 |
| Salesforce/SFR-LLaMa-3.1-8B-Judge-r | 95.5 | 77.7 | 86.2 | 95.1 | 88.7 |
|
@@ -51,10 +52,11 @@ As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench
|
|
51 |
| NCSOFT/Llama-3-OffsetBias-8B * | 92.5 | 80.3 | 86.8 | 76.4 | 84.0 |
|
52 |
|
53 |
|
|
|
54 |
# Demo Code
|
55 |
Below is an example of obtaining the critic of two conversations.
|
56 |
|
57 |
-
```
|
58 |
import torch
|
59 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
60 |
|
|
|
34 |
|
35 |
As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench for generative models across all sizes, while Skywork-Critic-Llama3.1-8B tops the list for generative models under 10B parameters. (Note: An asterisk (*) indicates an open-source model.)
|
36 |
|
37 |
+
|
38 |
| Model | Chat | Chat Hard | Safety | Reasoning | Overall Score |
|
39 |
| ------------------------------- | :---: | :-------: | :----: | :-------: | :---: |
|
40 |
| **Skywork-Critic-Llama3.1-70B** * | **96.9** | **88.4** | **93.2** | **95.4** | **93.4** |
|
41 |
+
| Salesforce/SFR-LLaMa-3.1-70B-Judge-r | 96.9 | 84.8 | 91.6 | 97.6 | 92.7 |
|
42 |
| Salesforce/SFR-nemo-12B-Judge-r | 97.2 | 82.2 | 86.5 | 95.1 | 90.3 |
|
43 |
| **Skywork-Critic-Llama3.1-8B** * | **93.6** | **81.4** | **91.1** | **89.8** | **89.0** |
|
44 |
| Salesforce/SFR-LLaMa-3.1-8B-Judge-r | 95.5 | 77.7 | 86.2 | 95.1 | 88.7 |
|
|
|
52 |
| NCSOFT/Llama-3-OffsetBias-8B * | 92.5 | 80.3 | 86.8 | 76.4 | 84.0 |
|
53 |
|
54 |
|
55 |
+
|
56 |
# Demo Code
|
57 |
Below is an example of obtaining the critic of two conversations.
|
58 |
|
59 |
+
```python
|
60 |
import torch
|
61 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
62 |
|