docusco-bert / README.md
browndw's picture
Upload README.md
4b4d5d5
|
raw
history blame
11.6 kB
---
language: en
datasets: COCA
---
# docusco-bert
## Model description
**docusco-bert** is a fine-tuned BERT model that is ready to use for **token classification**. The model was trained on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [**DocuScope**](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below.
## About DocuScope
DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by **David Kaufer** and **Suguru Ishizaki** since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).
DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865).
## Intended uses & limitations
#### How to use
The model was trained on data with tags formatted using [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), like those used in common tasks like Named Entity Recogition (NER). Thus, you can use this model with a Transformers NER *pipeline*.
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide."
ds_results = nlp(example)
print(ds_results)
```
#### Limitations and bias
This model is limited by its training dataset of American English texts. Moreover, the current version is trained on only a small subset of the corpus. The goal is to train later versions on more data, which should increase accuracy.
## Training data
This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain 1500 randomly sampled texts from each of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.
#### # of texts/chunks/tokens per dataset
Dataset |Texts |Chunks |Tokens
-|-|-|-
Train |7500 |1,167,584 |32,203,828
Test |500 |58,117 |1,567,997
## Training procedure
This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommended hyperparameters from the [original BERT paper](https://arxiv.org/pdf/1810.04805).
## Eval results
### Overall
metric|test
-|-
f1 |.743
accuracy |.801
### By category
category|precision|recall|f1-score|support
-|-|-|-|-
AcademicTerms|0.76|0.77|0.76|140805
AcademicWritingMoves|0.36|0.46|0.40|8182
Character|0.74|0.78|0.76|123856
Citation|0.73|0.81|0.77|13428
CitationAuthority|0.55|0.49|0.51|4552
CitationHedged|0.58|0.89|0.70|285
ConfidenceHedged|0.76|0.84|0.79|14765
ConfidenceHigh|0.64|0.72|0.68|11462
ConfidenceLow|0.70|0.39|0.50|380
Contingent|0.68|0.69|0.69|9537
Description|0.60|0.67|0.63|108186
Facilitate|0.63|0.63|0.63|7421
FirstPerson|0.62|0.73|0.67|6235
ForceStressed|0.65|0.72|0.69|37910
Future|0.63|0.69|0.66|9049
InformationChange|0.64|0.72|0.68|14560
InformationChangeNegative|0.59|0.57|0.58|1840
InformationChangePositive|0.61|0.58|0.60|4265
InformationExposition|0.80|0.83|0.82|84977
InformationPlace|0.80|0.82|0.81|18783
InformationReportVerbs|0.71|0.79|0.75|17572
InformationStates|0.74|0.80|0.77|21048
InformationTopics|0.69|0.72|0.70|58677
Inquiry|0.50|0.58|0.53|12735
Interactive|0.64|0.70|0.67|18135
MetadiscourseCohesive|0.90|0.93|0.92|33312
MetadiscourseInteractive|0.54|0.62|0.58|6888
Narrative|0.70|0.76|0.73|116896
Negative|0.63|0.69|0.66|60534
Positive|0.60|0.67|0.63|54374
PublicTerms|0.70|0.74|0.72|38229
Reasoning|0.71|0.76|0.74|30157
Responsibility|0.59|0.63|0.61|3451
Strategic|0.60|0.62|0.61|28064
SyntacticComplexity|0.83|0.87|0.85|297387
Uncertainty|0.43|0.44|0.43|2915
Updates|0.52|0.53|0.53|6156
-|-|-|-|-
micro|avg|0.72|0.77|0.74|1427008
macro|avg|0.65|0.69|0.67|1427008
weighted|avg|0.72|0.77|0.74|1427008
## DocuScope Category Descriptions
Category (Cluster)|Description|Examples
-|-|-
Academic Terms|Abstract, rare, specialized, or disciplinary-specific terms that are indicative of informationally dense writing|*market price*, *storage capacity*, *regulatory*, *distribution*
Academic Writing Moves|Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales (1981) and Cotos et al. (2015, 2017)|*in the first section*, *the problem is that*, *payment methodology*, *point of contention*
Character|References multiple dimensions of a character or human being as a social agent, both individual and collective|*Pauline*, *her*, *personnel*, *representatives*
Citation|Language that indicates the attribution of information to, or citation of, another source.|*according to*, *is proposing that*, *quotes from*
Citation Authorized|Referencing the citation of another source that is represented as true and not arguable|*confirm that*, *provide evidence*, *common sense*
Citation Hedged|Referencing the citation of another source that is presented as arguable|*suggest that*, *just one opinion*
Confidence Hedged|Referencing language that presents a claim as uncertain|*tends to get*, *maybe*, *it seems that*
Confidence High|Referencing language that presents a claim with certainty|*most likely*, *ensure that*, *know that*, *obviously*
Confidence Low|Referencing language that presents a claim as extremely unlikely|*unlikely*, *out of the question*, *impossible*
Contingent|Referencing contingency, typically contingency in the world, rather than contingency in one's knowledge|*subject to*, *if possible*, *just in case*, *hypothetically*
Description|Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects|*stay quiet*, *gas-fired*, *solar panels*, *soft*, *on my desk*
Facilitate|Language that enables or directs one through specific tasks and actions|*let me*, *worth a try*, *I would suggest*
First Person|This cluster captures first person.|*I*, *as soon as I*, *we have been*
Force Stressed|Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms|*really good*, *the sooner the better*, *necessary*
Future|Referencing future actions, states, or desires|*will be*, *hope to*, *expected changes*
Information Change|Referencing changes of information, particularly changes that are more neutral|*changes*, *revised*, *growth*, *modification to*
Information Change Negative|Referencing negative change|*going downhill*, *slow erosion*, *get worse*
Information Change Positive|Referencing positive change|*improving*, *accrued interest*, *boost morale*
Information Exposition|Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons|*final amount*, *several*, *three*, *compare*, *80%*
Information Place|Language designating places|*the city*, *surrounding areas*, *Houston*, *home*
Information Report Verbs|Informational verbs and verb phrases of reporting|*report*, *posted*, *release*, *point out*
Information States|Referencing information states, or states of being|*is*, *are*, *existing*, *been*
Information Topics|Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text|*time*, *money*, *stock price*, *phone interview*
Inquiry|Referencing inquiry, or language that points to some kind of inquiry or investigation|*find out*, *let me know if you have any questions*, *wondering if*
Interactive|Addresses from the author to the reader or from persons in the text to other persons. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention-getters, feedback, interactive genre markers, and the use of the second person.|*can you*, *thank you for*, *please see*, *sounds good to me*
Metadiscourse Cohesive|The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive|*or*, *but*, *also*, *on the other hand*, *notwithstanding*, *that being said*
Metadiscourse Interactive|The use of words to build cohesive markers that interact with the reader|*I agree*, *let’s talk*, *by the way*
Narrative|Language that involves people, description, and events extending in time|*today*, *tomorrow*, *during the*, *this weekend*
Negative|Referencing dimensions of negativity, including negative acts, emotions, relations, and values|*does not*, *sorry for*, *problems*, *confusion*
Positive|Referencing dimensions of positivity, including actions, emotions, relations, and values|*thanks*, *approval*, *agreement*, *looks good*
Public Terms|Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility|*discussion*, *amendment*, *corporation*, *authority*, *settlement*
Reasoning|Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise|*because*, *therefore*, *analysis*, *even if*, *as a result*, *indicating that*
Responsibility|Referencing the language of responsibility|*supposed to*, *requirements*, *obligations*
Strategic|This dimension is active when the text structures strategies activism, advantage-seeking, game-playing cognition, plans, and goal-seeking.|*plan*, *trying to*, *strategy*, *decision*, *coordinate*, *look at the*
Syntactic Complexity|The features in this category are often what are called “function words,” like determiners and prepositions.|*the*, *to*, *for*, *in*, *a lot of*
Uncertainty|References uncertainty, when confidence levels are unknown|*kind of*, *I have no idea*, *for some reason*
Updates|References updates that anticipate someone searching for information and receiving it|*already*, *a new*, *now that*, *here are some*
### BibTeX entry and citation info
```
@incollection{ishizaki2012computer,
title = {Computer-aided rhetorical analysis},
author = {Ishizaki, Suguru and Kaufer, David},
booktitle= {Applied natural language processing: Identification, investigation and resolution},
pages = {276--296},
year = {2012},
publisher= {IGI Global},
url = {https://www.igi-global.com/chapter/content/61054}
}
```
```
@article{DBLP:journals/corr/abs-1810-04805,
author = {Jacob Devlin and
Ming{-}Wei Chang and
Kenton Lee and
Kristina Toutanova},
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
Understanding},
journal = {CoRR},
volume = {abs/1810.04805},
year = {2018},
url = {http://arxiv.org/abs/1810.04805},
archivePrefix = {arXiv},
eprint = {1810.04805},
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```