willwade commited on
Commit
c88183e
·
verified ·
1 Parent(s): 0a569d1

adding new facts

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -40,6 +40,7 @@ The primary task of this model is **Text Correction**, with a focus on:
40
 
41
  This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
42
 
 
43
 
44
  ## Usage
45
 
@@ -94,6 +95,7 @@ It has been training on
94
  - [Coedit](https://huggingface.co/datasets/grammarly/coedit)
95
  - [Conversation Enders](https://huggingface.co/Chakshu/conversation_ender)
96
  - [Conversation Starters](https://huggingface.co/Langame/conversation-starters)
 
97
 
98
 
99
  Then injecting typos from a range of places
@@ -103,13 +105,16 @@ Then injecting typos from a range of places
103
  - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
104
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
105
  - **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones
106
- - **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc
107
 
108
  And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
109
  Note we use a ``grammar: `` prefix for each sentence in training.
110
 
111
  Full script to build the [dataset is here](https://colab.research.google.com/drive/1VkKU9KKIWkWQZ-pPzdDFLeRnwFxdWUtq?usp=sharing)
112
 
 
 
 
113
 
114
  ## Developed by:
115
  - **Name**: Will Wade
 
40
 
41
  This model is aimed to support processing user-generated content where informal language, abbreviations, and typos are prevalent, aiming to improve text clarity for further processing or human reading.
42
 
43
+ **Note - as of 15 March this model is really tuned to fix positional errors on a qwerty keyboard. **
44
 
45
  ## Usage
46
 
 
95
  - [Coedit](https://huggingface.co/datasets/grammarly/coedit)
96
  - [Conversation Enders](https://huggingface.co/Chakshu/conversation_ender)
97
  - [Conversation Starters](https://huggingface.co/Langame/conversation-starters)
98
+ - 5% AAC-Like Open Subtitles (Private dataset with thanks to Keith Vertanen)
99
 
100
 
101
  Then injecting typos from a range of places
 
105
  - **TOEFL Spell** A dataset of Spelling Annotations for English language learner essays written for TOEFL exams.
106
  - Find this [here](https://github.com/EducationalTestingService/TOEFL-Spell/tree/master)
107
  - **Homonyms** We replace words in BNC and Dialy Dialog occasionally with homonyms from this list https://github.com/pimentel/homophones
108
+ - **Our own typo augment function** This would make likely errors found in a English Qwerty layout as well as subsitutions, deletions etc. Full weightings can be seen in our training script
109
 
110
  And then compressing versions of the sentences (i.e. removing spaces)- both correct and typod we add to our dataset. (This is to solve a problem where some people write without spaces)
111
  Note we use a ``grammar: `` prefix for each sentence in training.
112
 
113
  Full script to build the [dataset is here](https://colab.research.google.com/drive/1VkKU9KKIWkWQZ-pPzdDFLeRnwFxdWUtq?usp=sharing)
114
 
115
+ ## To Do:
116
+
117
+ We really want to be able to deal with errors in switch scanning which maybe be linear ABC, Frequency or block scanning. Its relatively straightforward but its thinking the best way forward for this..
118
 
119
  ## Developed by:
120
  - **Name**: Will Wade