gonzalo-santamaria-iic
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -235,39 +235,59 @@ As can be seen in the time used, in eight and a half hours we have managed to im
|
|
235 |
|
236 |
## Evaluation
|
237 |
|
238 |
-
To evaluate, we use the following datasets:
|
239 |
-
|
240 |
-
1. [IIC/AQuAS](https://huggingface.co/datasets/IIC/AQuAS).
|
241 |
-
2. [IIC/RagQuAS](https://huggingface.co/datasets/IIC/RagQuAS).
|
242 |
-
3. privados
|
243 |
-
|
244 |
### Testing Data, Factors & Metrics
|
245 |
|
246 |
#### Testing Data
|
247 |
|
248 |
-
<!-- This should link to a Dataset Card if possible. -->
|
249 |
|
250 |
-
|
251 |
|
252 |
-
|
253 |
|
254 |
-
|
255 |
|
256 |
-
|
257 |
|
258 |
-
|
259 |
|
260 |
-
|
261 |
|
262 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
263 |
|
264 |
### Results
|
265 |
|
266 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
267 |
|
268 |
-
#### Summary
|
269 |
|
|
|
270 |
|
|
|
271 |
|
272 |
## Environmental Impact
|
273 |
|
|
|
235 |
|
236 |
## Evaluation
|
237 |
|
|
|
|
|
|
|
|
|
|
|
|
|
238 |
### Testing Data, Factors & Metrics
|
239 |
|
240 |
#### Testing Data
|
241 |
|
|
|
242 |
|
243 |
+
To assess the performance of Large Language Models (LLMs), we have developed and utilized several high-quality corpora tailored to specific evaluation needs:
|
244 |
|
245 |
+
1. [IIC/AQuAS](https://huggingface.co/datasets/IIC/AQuAS): A manually curated corpus created by two computational linguists to evaluate language models in the task of Abstractive Question Answering in Spanish. It includes examples from domains such as finance, insurance, healthcare, law, and music.
|
246 |
|
247 |
+
2. [IIC/RagQuAS](https://huggingface.co/datasets/IIC/RagQuAS). Another manually curated corpus developed by the same linguists to evaluate full RAG systems and language models in Abstractive Question Answering tasks in Spanish. This corpus spans a wide range of domains, including hobbies, linguistics, pets, health, astronomy, customer service, cars, daily life, documentation, energy, skiing, fraud, gastronomy, languages, games, nail care, music, skating, first aid, recipes, recycling, complaints, insurance, tennis, transportation, tourism, veterinary, travel, and yoga.
|
248 |
|
249 |
+
3. **CAM:** Designed for all CAM tasks, this corpus consists of frequently asked questions (FAQs) sourced from consumer-related topics on the websites of the Comunidad de Madrid. The questions are categorized into three levels of degradation—E1, E2, and E3—intended to measure the LLMs’ ability to understand and effectively respond to poorly formulated queries caused by spelling errors, varying levels of colloquialism, and similar issues. This task also falls under the Abstractive Question Answering category.
|
250 |
|
251 |
+
4. **Shops:** A multi-turn conversational corpus centered on policies from various clothing companies. The task involves Multi-turn Abstractive Question Answering.
|
252 |
|
253 |
+
5. **Insurance:** Another multi-turn conversational corpus, this one focuses on policies from various insurance companies. It also involves Multi-turn Abstractive Question Answering.
|
254 |
|
255 |
+
Each corpus includes the following columns: question, answer, and context(s) containing relevant information from which the model can derive the answer. In multi-turn tasks, a chat history is also provided.
|
256 |
+
|
257 |
+
The scoring process for LLMs involves measuring the similarity between the original answer and the one generated by the model. All corpora are private except for AQuAS and RagQuAS, which are publicly available and can serve as examples of the structure and content of the others.
|
258 |
+
|
259 |
+
#### Factors
|
260 |
+
|
261 |
+
These evaluations are very specific and do not encompass all the general scenarios to which the model could be exposed, since all evaluations are focused on solving tasks for RAG in very specific domains.
|
262 |
+
|
263 |
+
#### Metrics
|
264 |
+
|
265 |
+
The evaluation is based on using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to score the answers.
|
266 |
|
267 |
### Results
|
268 |
|
269 |
+
| **Model** | **Average** | **AQuAS** | **RagQuAS** | **CAM** | **CAM_E1** | **CAM_E2** | **CAM_E3** | **Shops** | **Insurance** |
|
270 |
+
|----------------------------|-------------|-----------|-------------|----------|------------|------------|------------|-----------|---------------|
|
271 |
+
| **RigoChat-7b-v2** | **79.01** | 82.06 | 77.91 | **78.91**| **79.27** | 76.55 | 75.27 | **81.05** | **81.04** |
|
272 |
+
| GPT-4o | 78.26 | **85.23** | 77.91 | 78.00 | 74.91 | 73.45 | **77.09** | 78.60 | 80.89 |
|
273 |
+
| stablelm-2-12b-chat | 77.74 | 78.88 | 78.21 | 77.82 | 78.73 | **77.27** | 74.73 | 77.03 | 79.26 |
|
274 |
+
| Mistral-Small-Instruct-2409| 77.29 | 80.56 | 78.81 | 77.82 | 75.82 | 73.27 | 73.45 | 78.25 | 80.36 |
|
275 |
+
| Qwen2.5-7B-Instruct | 77.17 | 80.93 | 77.41 | 77.82 | 75.09 | 75.45 | 72.91 | 78.08 | 79.67 |
|
276 |
+
| Meta-Llama-3.1-8B-Instruct | 76.55 | 81.87 | 80.50 | 72.91 | 73.45 | 75.45 | 71.64 | 77.73 | 78.88 |
|
277 |
+
| GPT-4o-mini | 76.48 | 82.80 | 75.82 | 76.36 | 74.36 | 72.36 | 71.82 | 78.25 | 80.08 |
|
278 |
+
| Phi-3.5-mini-instruct | 76.38 | 81.68 | **81.09** | 75.82 | 74.73 | 71.45 | 70.36 | 77.43 | 78.45 |
|
279 |
+
| gemma-2-9b-it | 75.80 | 82.80 | 78.11 | 72.91 | 73.45 | 71.09 | 71.27 | 77.08 | 79.72 |
|
280 |
+
| Ministral-8B-Instruct-2410 | 75.19 | 79.63 | 77.31 | 76.00 | 73.45 | 72.36 | 70.18 | 76.44 | 76.14 |
|
281 |
+
| GPT-3.5-turbo-0125 | 74.78 | 80.93 | 73.53 | 76.73 | 72.55 | 72.18 | 69.09 | 75.63 | 77.64 |
|
282 |
+
| Llama-2-7b-chat-hf | 71.18 | 67.10 | 77.31 | 71.45 | 70.36 | 70.73 | 68.55 | 72.07 | 71.90 |
|
283 |
+
| granite-3.0-8b-instruct | 71.08 | 73.08 | 72.44 | 72.36 | 71.82 | 69.09 | 66.18 | 69.97 | 73.73 |
|
284 |
+
| RigoChat-7b-v1 | 62.13 | 72.34 | 67.46 | 61.27 | 59.45 | 57.45 | 57.64 | 62.10 | 59.34 |
|
285 |
+
| salamandra-7b-instruct | 61.96 | 63.74 | 60.70 | 64.91 | 63.27 | 62.36 | 60.55 | 59.94 | 60.23 |
|
286 |
|
|
|
287 |
|
288 |
+
#### Summary
|
289 |
|
290 |
+
RigoChat-7b-v2 manages to significantly improve performance compared to Qwen-2.5 in the tasks for which it has been indirectly designed. On the other hand, it manages to outperform most state-of-the-art models in these tasks, demonstrating that with few resources LLMs can be aligned for specific use cases.
|
291 |
|
292 |
## Environmental Impact
|
293 |
|