small tweaks
Browse files
README.md
CHANGED
@@ -72,7 +72,7 @@ print(decoded_output)
|
|
72 |
# Model Details
|
73 |
|
74 |
## Model Description
|
75 |
-
The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation,
|
76 |
It has been training on
|
77 |
- [BNC 2014 Spoken](http://cass.lancs.ac.uk/cass-projects/spoken-bnc2014/)
|
78 |
- [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
|
@@ -90,7 +90,7 @@ Then injecting typos from a range of places
|
|
90 |
- Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
|
91 |
- **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
|
92 |
- Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
|
93 |
-
- **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones
|
94 |
- **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc
|
95 |
|
96 |
And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
|
@@ -136,6 +136,14 @@ Users are encouraged to critically assess the model's output, especially when us
|
|
136 |
|
137 |
# Training Details
|
138 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
139 |
## Training Data
|
140 |
The model was trained on a curated subset of the DailyDialog and BNC corpora (2014 spoken), focusing on sentences 2-5 words in length, with manual introduction of typos and removal of spaces for robustness in text correction tasks.You can see the code to pre-process this [here](https://github.com/willwade/dailyDialogCorrections/tree/main)
|
141 |
|
@@ -264,9 +272,6 @@ We hope to build on this by further fine-tuning in time on real corpous of indvi
|
|
264 |
## Model Architecture and Objective
|
265 |
The model follows the T5 architecture, fine-tuned for the specific task of text correction with a focus on typo correction and space insertion.
|
266 |
|
267 |
-
## Compute Infrastructure
|
268 |
-
- **Hardware**: T4 GPU (Google Colab)
|
269 |
-
- **Software**: PyTorch 1.8.1 with Transformers 4.8.2
|
270 |
|
271 |
# Citation
|
272 |
|
|
|
72 |
# Model Details
|
73 |
|
74 |
## Model Description
|
75 |
+
The `t5-small-spoken-typo` model is specifically designed to tackle the challenges of text correction within user-generated content, particularly in short, conversation-like sentences. It corrects for missing spaces, removes unnecessary punctuation, corrects typos, and normalizes text by replacing informal contractions and abbreviations with their full forms.
|
76 |
It has been training on
|
77 |
- [BNC 2014 Spoken](http://cass.lancs.ac.uk/cass-projects/spoken-bnc2014/)
|
78 |
- [Daily Dialog](https://huggingface.co/datasets/daily_dialog)
|
|
|
90 |
- Find these resources [here](https://www.dcs.bbk.ac.uk/~ROGER/corpora.html).
|
91 |
- **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
|
92 |
- Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
|
93 |
+
- **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones
|
94 |
- **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc
|
95 |
|
96 |
And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
|
|
|
136 |
|
137 |
# Training Details
|
138 |
|
139 |
+
## System
|
140 |
+
|
141 |
+
- System configuration: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
|
142 |
+
- Runtime: Python 3.10.12
|
143 |
+
- Hardware: NVIDIA A10 GPU with 24GB GDDR6 dedicated memory
|
144 |
+
- CPU Cores: 30 logical cores @ 2.59GHz
|
145 |
+
- Disk Space: Approximately 1.3TB
|
146 |
+
|
147 |
## Training Data
|
148 |
The model was trained on a curated subset of the DailyDialog and BNC corpora (2014 spoken), focusing on sentences 2-5 words in length, with manual introduction of typos and removal of spaces for robustness in text correction tasks.You can see the code to pre-process this [here](https://github.com/willwade/dailyDialogCorrections/tree/main)
|
149 |
|
|
|
272 |
## Model Architecture and Objective
|
273 |
The model follows the T5 architecture, fine-tuned for the specific task of text correction with a focus on typo correction and space insertion.
|
274 |
|
|
|
|
|
|
|
275 |
|
276 |
# Citation
|
277 |
|