IIC
/

Text Generation
Transformers
Safetensors
Spanish
qwen2
chat
conversational
text-generation-inference
Inference Endpoints
gonzalo-santamaria-iic commited on
Commit
2c20427
·
verified ·
1 Parent(s): fbe429f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -16
README.md CHANGED
@@ -235,39 +235,59 @@ As can be seen in the time used, in eight and a half hours we have managed to im
235
 
236
  ## Evaluation
237
 
238
- To evaluate, we use the following datasets:
239
-
240
- 1. [IIC/AQuAS](https://huggingface.co/datasets/IIC/AQuAS).
241
- 2. [IIC/RagQuAS](https://huggingface.co/datasets/IIC/RagQuAS).
242
- 3. privados
243
-
244
  ### Testing Data, Factors & Metrics
245
 
246
  #### Testing Data
247
 
248
- <!-- This should link to a Dataset Card if possible. -->
249
 
250
- [More Information Needed]
251
 
252
- #### Factors
253
 
254
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
255
 
256
- [More Information Needed]
257
 
258
- #### Metrics
259
 
260
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
261
 
262
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
263
 
264
  ### Results
265
 
266
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
267
 
268
- #### Summary
269
 
 
270
 
 
271
 
272
  ## Environmental Impact
273
 
 
235
 
236
  ## Evaluation
237
 
 
 
 
 
 
 
238
  ### Testing Data, Factors & Metrics
239
 
240
  #### Testing Data
241
 
 
242
 
243
+ To assess the performance of Large Language Models (LLMs), we have developed and utilized several high-quality corpora tailored to specific evaluation needs:
244
 
245
+ 1. [IIC/AQuAS](https://huggingface.co/datasets/IIC/AQuAS): A manually curated corpus created by two computational linguists to evaluate language models in the task of Abstractive Question Answering in Spanish. It includes examples from domains such as finance, insurance, healthcare, law, and music.
246
 
247
+ 2. [IIC/RagQuAS](https://huggingface.co/datasets/IIC/RagQuAS). Another manually curated corpus developed by the same linguists to evaluate full RAG systems and language models in Abstractive Question Answering tasks in Spanish. This corpus spans a wide range of domains, including hobbies, linguistics, pets, health, astronomy, customer service, cars, daily life, documentation, energy, skiing, fraud, gastronomy, languages, games, nail care, music, skating, first aid, recipes, recycling, complaints, insurance, tennis, transportation, tourism, veterinary, travel, and yoga.
248
 
249
+ 3. **CAM:** Designed for all CAM tasks, this corpus consists of frequently asked questions (FAQs) sourced from consumer-related topics on the websites of the Comunidad de Madrid. The questions are categorized into three levels of degradation—E1, E2, and E3—intended to measure the LLMs’ ability to understand and effectively respond to poorly formulated queries caused by spelling errors, varying levels of colloquialism, and similar issues. This task also falls under the Abstractive Question Answering category.
250
 
251
+ 4. **Shops:** A multi-turn conversational corpus centered on policies from various clothing companies. The task involves Multi-turn Abstractive Question Answering.
252
 
253
+ 5. **Insurance:** Another multi-turn conversational corpus, this one focuses on policies from various insurance companies. It also involves Multi-turn Abstractive Question Answering.
254
 
255
+ Each corpus includes the following columns: question, answer, and context(s) containing relevant information from which the model can derive the answer. In multi-turn tasks, a chat history is also provided.
256
+
257
+ The scoring process for LLMs involves measuring the similarity between the original answer and the one generated by the model. All corpora are private except for AQuAS and RagQuAS, which are publicly available and can serve as examples of the structure and content of the others.
258
+
259
+ #### Factors
260
+
261
+ These evaluations are very specific and do not encompass all the general scenarios to which the model could be exposed, since all evaluations are focused on solving tasks for RAG in very specific domains.
262
+
263
+ #### Metrics
264
+
265
+ The evaluation is based on using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) to score the answers.
266
 
267
  ### Results
268
 
269
+ | **Model** | **Average** | **AQuAS** | **RagQuAS** | **CAM** | **CAM_E1** | **CAM_E2** | **CAM_E3** | **Shops** | **Insurance** |
270
+ |----------------------------|-------------|-----------|-------------|----------|------------|------------|------------|-----------|---------------|
271
+ | **RigoChat-7b-v2** | **79.01** | 82.06 | 77.91 | **78.91**| **79.27** | 76.55 | 75.27 | **81.05** | **81.04** |
272
+ | GPT-4o | 78.26 | **85.23** | 77.91 | 78.00 | 74.91 | 73.45 | **77.09** | 78.60 | 80.89 |
273
+ | stablelm-2-12b-chat | 77.74 | 78.88 | 78.21 | 77.82 | 78.73 | **77.27** | 74.73 | 77.03 | 79.26 |
274
+ | Mistral-Small-Instruct-2409| 77.29 | 80.56 | 78.81 | 77.82 | 75.82 | 73.27 | 73.45 | 78.25 | 80.36 |
275
+ | Qwen2.5-7B-Instruct | 77.17 | 80.93 | 77.41 | 77.82 | 75.09 | 75.45 | 72.91 | 78.08 | 79.67 |
276
+ | Meta-Llama-3.1-8B-Instruct | 76.55 | 81.87 | 80.50 | 72.91 | 73.45 | 75.45 | 71.64 | 77.73 | 78.88 |
277
+ | GPT-4o-mini | 76.48 | 82.80 | 75.82 | 76.36 | 74.36 | 72.36 | 71.82 | 78.25 | 80.08 |
278
+ | Phi-3.5-mini-instruct | 76.38 | 81.68 | **81.09** | 75.82 | 74.73 | 71.45 | 70.36 | 77.43 | 78.45 |
279
+ | gemma-2-9b-it | 75.80 | 82.80 | 78.11 | 72.91 | 73.45 | 71.09 | 71.27 | 77.08 | 79.72 |
280
+ | Ministral-8B-Instruct-2410 | 75.19 | 79.63 | 77.31 | 76.00 | 73.45 | 72.36 | 70.18 | 76.44 | 76.14 |
281
+ | GPT-3.5-turbo-0125 | 74.78 | 80.93 | 73.53 | 76.73 | 72.55 | 72.18 | 69.09 | 75.63 | 77.64 |
282
+ | Llama-2-7b-chat-hf | 71.18 | 67.10 | 77.31 | 71.45 | 70.36 | 70.73 | 68.55 | 72.07 | 71.90 |
283
+ | granite-3.0-8b-instruct | 71.08 | 73.08 | 72.44 | 72.36 | 71.82 | 69.09 | 66.18 | 69.97 | 73.73 |
284
+ | RigoChat-7b-v1 | 62.13 | 72.34 | 67.46 | 61.27 | 59.45 | 57.45 | 57.64 | 62.10 | 59.34 |
285
+ | salamandra-7b-instruct | 61.96 | 63.74 | 60.70 | 64.91 | 63.27 | 62.36 | 60.55 | 59.94 | 60.23 |
286
 
 
287
 
288
+ #### Summary
289
 
290
+ RigoChat-7b-v2 manages to significantly improve performance compared to Qwen-2.5 in the tasks for which it has been indirectly designed. On the other hand, it manages to outperform most state-of-the-art models in these tasks, demonstrating that with few resources LLMs can be aligned for specific use cases.
291
 
292
  ## Environmental Impact
293