File size: 4,342 Bytes
3d59b10 8ecb874 3d59b10 8ecb874 fd3823b 8ecb874 590f9b2 8ecb874 fbd6a4a 8ecb874 9782f98 8ecb874 fd3823b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
language: en
widget:
- text: "He had also stgruggled with addiction during his time in Congress ."
- text: "The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence ."
- text: "Letterma also apologized two his staff for the satyation ."
- text: "Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint ."
- text: "It is left to the directors to figure out hpw to bring the stry across to tye audience ."
---
# Typo Detector
## Dataset Information
For this specific task, I used [NeuSpell](https://github.com/neuspell/neuspell) corpus as my raw data.
## Evaluation
The following tables summarize the scores obtained by model overall and per each class.
| # | precision | recall | f1-score | support |
|:------------:|:---------:|:--------:|:--------:|:--------:|
| TYPO | 0.992332 | 0.985997 | 0.989154 | 416054.0 |
| micro avg | 0.992332 | 0.985997 | 0.989154 | 416054.0 |
| macro avg | 0.992332 | 0.985997 | 0.989154 | 416054.0 |
| weighted avg | 0.992332 | 0.985997 | 0.989154 | 416054.0 |
## How to use
You use this model with Transformers pipeline for NER (token-classification).
### Installing requirements
```bash
pip install transformers
```
### Prediction using pipeline
```python
import torch
from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name_or_path = "m3hrdadfi/typo-detector-distilbert-en"
config = AutoConfig.from_pretrained(model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForTokenClassification.from_pretrained(model_name_or_path, config=config)
nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="average")
```
```python
sentences = [
"He had also stgruggled with addiction during his time in Congress .",
"The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .",
"Letterma also apologized two his staff for the satyation .",
"Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .",
"It is left to the directors to figure out hpw to bring the stry across to tye audience .",
]
for sentence in sentences:
typos = [sentence[r["start"]: r["end"]] for r in nlp(sentence)]
detected = sentence
for typo in typos:
detected = detected.replace(typo, f'<i>{typo}</i>')
print(" [Input]: ", sentence)
print("[Detected]: ", detected)
print("-" * 130)
```
Output:
```text
[Input]: He had also stgruggled with addiction during his time in Congress .
[Detected]: He had also <i>stgruggled</i> with addiction during his time in Congress .
----------------------------------------------------------------------------------------------------------------------------------
[Input]: The review thoroughla assessed all aspects of JLENS SuR and CPG esign maturit and confidence .
[Detected]: The review <i>thoroughla</i> assessed all aspects of JLENS SuR and CPG <i>esign</i> <i>maturit</i> and confidence .
----------------------------------------------------------------------------------------------------------------------------------
[Input]: Letterma also apologized two his staff for the satyation .
[Detected]: <i>Letterma</i> also apologized <i>two</i> his staff for the <i>satyation</i> .
----------------------------------------------------------------------------------------------------------------------------------
[Input]: Vincent Jay had earlier won France 's first gold in gthe 10km biathlon sprint .
[Detected]: Vincent Jay had earlier won France 's first gold in <i>gthe</i> 10km biathlon sprint .
----------------------------------------------------------------------------------------------------------------------------------
[Input]: It is left to the directors to figure out hpw to bring the stry across to tye audience .
[Detected]: It is left to the directors to figure out <i>hpw</i> to bring the <i>stry</i> across to <i>tye</i> audience .
----------------------------------------------------------------------------------------------------------------------------------
```
## Questions?
Post a Github issue on the [TypoDetector Issues](https://github.com/m3hrdadfi/typo-detector/issues) repo. |