Updated by Author
Browse files
README.md
CHANGED
@@ -210,6 +210,81 @@ The training utilized **Google Colab GPUs, which provided the necessary computat
|
|
210 |
The training process was carried out using **PyTorch** as the primary framework, leveraging libraries such as **Hugging Face Transformers** for model implementation and training.
|
211 |
|
212 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
213 |
|
214 |
## Glossary [optional]
|
215 |
|
|
|
210 |
The training process was carried out using **PyTorch** as the primary framework, leveraging libraries such as **Hugging Face Transformers** for model implementation and training.
|
211 |
|
212 |
|
213 |
+
## ROUGE Evaluation
|
214 |
+
|
215 |
+
To evaluate the quality of the generated summaries, we employed the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scoring system. This method compares the generated summaries against reference summaries to quantify their similarity and overall quality.
|
216 |
+
|
217 |
+
### Evaluation Code
|
218 |
+
|
219 |
+
We used the `rouge_score` library to compute the ROUGE scores for our summaries. Below is the implementation:
|
220 |
+
|
221 |
+
```python
|
222 |
+
from rouge_score import rouge_scorer
|
223 |
+
|
224 |
+
reference_summaries = [
|
225 |
+
"AI systems in healthcare improve diagnostics and personalize treatments.",
|
226 |
+
"Algorithms analyze market trends and help in fraud detection.",
|
227 |
+
]
|
228 |
+
|
229 |
+
generated_summaries = [
|
230 |
+
"In healthcare, AI systems are used for predictive analytics and improving diagnostics.",
|
231 |
+
"In finance, algorithms analyze market trends and assist in fraud detection."
|
232 |
+
]
|
233 |
+
|
234 |
+
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
|
235 |
+
|
236 |
+
for reference, generated in zip(reference_summaries, generated_summaries):
|
237 |
+
scores = scorer.score(reference, generated)
|
238 |
+
print(f"Reference: {reference}")
|
239 |
+
print(f"Generated: {generated}")
|
240 |
+
print(f"ROUGE Scores: {scores}\n")
|
241 |
+
```
|
242 |
+
|
243 |
+
### ROUGE Scores
|
244 |
+
|
245 |
+
#### Summary 1
|
246 |
+
- **Reference**: "AI systems in healthcare improve diagnostics and personalize treatments."
|
247 |
+
- **Generated**: "In healthcare, AI systems are used for predictive analytics and improving diagnostics."
|
248 |
+
|
249 |
+
**ROUGE-1**:
|
250 |
+
- Precision: 72.73%
|
251 |
+
- Recall: 88.89%
|
252 |
+
- F1-Score: 80.00%
|
253 |
+
|
254 |
+
This score indicates a strong overlap, showing that the generated summary captures a significant amount of relevant information.
|
255 |
+
|
256 |
+
**ROUGE-2**:
|
257 |
+
- Precision: 60.00%
|
258 |
+
- Recall: 75.00%
|
259 |
+
- F1-Score: 66.67%
|
260 |
+
|
261 |
+
This indicates a good capture of bigrams, reflecting the generated summary's effectiveness in retaining key phrases.
|
262 |
+
|
263 |
+
**ROUGE-L**:
|
264 |
+
- Precision: 72.73%
|
265 |
+
- Recall: 88.89%
|
266 |
+
- F1-Score: 80.00%
|
267 |
+
|
268 |
+
This score confirms that the sequence of words in the generated summary closely follows that of the reference.
|
269 |
+
|
270 |
+
#### Summary 2
|
271 |
+
- **Reference**: "Algorithms analyze market trends and help in fraud detection."
|
272 |
+
- **Generated**: "In finance, algorithms analyze market trends and assist in fraud detection."
|
273 |
+
|
274 |
+
**ROUGE-1**:
|
275 |
+
- Precision: 72.73%
|
276 |
+
- Recall: 88.89%
|
277 |
+
- F1-Score: 80.00%
|
278 |
+
|
279 |
+
**ROUGE-2**:
|
280 |
+
- Precision: 60.00%
|
281 |
+
- Recall: 75.00%
|
282 |
+
- F1-Score: 66.67%
|
283 |
+
|
284 |
+
**ROUGE-L**:
|
285 |
+
- Precision: 72.73%
|
286 |
+
- Recall: 88.89%
|
287 |
+
- F1-Score: 80.00%
|
288 |
|
289 |
## Glossary [optional]
|
290 |
|