Taekyoon commited on
Commit
076a14a
·
verified ·
1 Parent(s): e714105

Update Evaluation contents

Browse files

Add eval scripts and modify xwinograd metric scores

Files changed (1) hide show
  1. README.md +40 -12
README.md CHANGED
@@ -101,6 +101,34 @@ Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-Easy
101
 
102
  Model evaluation metrics and results.
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ### Benchmark Results
105
 
106
  | Category | Metric | Shots | 7b |
@@ -117,24 +145,24 @@ Model evaluation metrics and results.
117
  | | Hellaswag (acc-norm) | | 63.2 |
118
  | | Sentineg | | 97.98 |
119
  | | WiC | | 70.95 |
120
- | **JP Eval Harness (Prompt ver 0.3)** | JcommonsenseQA | 3-shot | 85.97 |
121
- | | JNLI | 3-shot | 39.11 |
122
- | | Marc_ja | 3-shot | 96.48 |
123
- | | JSquad | 2-shot | 70.69 |
124
- | | Jaqket | 1-shot | 81.53 |
125
- | | MGSM | 5-shot | 28.8 |
126
- | **XWinograd (5-shot)** | EN | | 90.71 |
127
- | | FR | | 80.72 |
128
- | | JP | | 84.15 |
129
- | | PT | | 80.99 |
130
- | | RU | | 76.51 |
131
- | | ZH | | 76.98 |
132
  | **XCOPA (5-shot)** | IT | | 72.8 |
133
  | | ID | | 76.4 |
134
  | | TH | | 60.2 |
135
  | | TR | | 65.6 |
136
  | | VI | | 77.2 |
137
  | | ZH | | 80.2 |
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
 
140
 
 
101
 
102
  Model evaluation metrics and results.
103
 
104
+ ### Evaluation Scripts
105
+
106
+ - For Knowledge / KoBest / XCOPA / XWinograd
107
+ - [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.2
108
+ ```bash
109
+ !git clone https://github.com/EleutherAI/lm-evaluation-harness.git
110
+ !cd lm-evaluation-harness && pip install -r requirements.txt && pip install -e .
111
+
112
+ !lm_eval --model hf \
113
+ --model_args pretrained=beomi/gemma-mling-7b,dtype="float16" \
114
+ --tasks "haerae,kobest,kmmlu_direct,cmmlu,ceval-valid,mmlu,xwinograd,xcopa \
115
+ --num_fewshot "0,5,5,5,5,5,0,5" \
116
+ --device cuda
117
+ ```
118
+ - For JP Eval Harness
119
+ - [Stability-AI/lm-evaluation-harness (`jp-stable` branch)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable)
120
+ ```bash
121
+ !git clone -b jp-stable https://github.com/Stability-AI/lm-evaluation-harness.git
122
+ !cd lm-evaluation-harness && pip install -e ".[ja]"
123
+ !pip install 'fugashi[unidic]' && python -m unidic download
124
+
125
+ !cd lm-evaluation-harness && python main.py \
126
+ --model hf-causal \
127
+ --model_args pretrained=beomi/gemma-mling-7b,torch_dtype='auto'"
128
+ --tasks "jcommonsenseqa-1.1-0.3,jnli-1.3-0.3,marc_ja-1.1-0.3,jsquad-1.1-0.3,jaqket_v2-0.2-0.3,xlsum_ja,mgsm"
129
+ --num_fewshot "3,3,3,2,1,1,5"
130
+ ```
131
+
132
  ### Benchmark Results
133
 
134
  | Category | Metric | Shots | 7b |
 
145
  | | Hellaswag (acc-norm) | | 63.2 |
146
  | | Sentineg | | 97.98 |
147
  | | WiC | | 70.95 |
 
 
 
 
 
 
 
 
 
 
 
 
148
  | **XCOPA (5-shot)** | IT | | 72.8 |
149
  | | ID | | 76.4 |
150
  | | TH | | 60.2 |
151
  | | TR | | 65.6 |
152
  | | VI | | 77.2 |
153
  | | ZH | | 80.2 |
154
+ | **JP Eval Harness (Prompt ver 0.3)** | JcommonsenseQA | 3-shot | 85.97 |
155
+ | | JNLI | 3-shot | 39.11 |
156
+ | | Marc_ja | 3-shot | 96.48 |
157
+ | | JSquad | 2-shot | 70.69 |
158
+ | | Jaqket | 1-shot | 81.53 |
159
+ | | MGSM | 5-shot | 28.8 |
160
+ | **XWinograd (0-shot)** | EN | | 89.03 |
161
+ | | FR | | 72.29 |
162
+ | | JP | | 82.69 |
163
+ | | PT | | 73.38 |
164
+ | | RU | | 68.57 |
165
+ | | ZH | | 79.17 |
166
 
167
 
168