|
--- |
|
language: en |
|
datasets: COCA |
|
--- |
|
# docusco-bert |
|
|
|
## Model description |
|
|
|
**docusco-bert** is a fine-tuned BERT model that is ready to use for **token classification**. The model was trained on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [**DocuScope**](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below. |
|
|
|
## About DocuScope |
|
DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by **David Kaufer** and **Suguru Ishizaki** since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic). |
|
|
|
DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865). |
|
|
|
## Intended uses & limitations |
|
|
|
#### How to use |
|
|
|
The model was trained on data with tags formatted using [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), like those used in common tasks like Named Entity Recogition (NER). Thus, you can use this model with a Transformers NER *pipeline*. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from transformers import pipeline |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert") |
|
model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert") |
|
|
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
|
example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide." |
|
|
|
ds_results = nlp(example) |
|
print(ds_results) |
|
``` |
|
|
|
#### Limitations and bias |
|
|
|
This model is limited by its training dataset of American English texts. Moreover, the current version is trained on only a small subset of the corpus. The goal is to train later versions on more data, which should increase accuracy. |
|
|
|
## Training data |
|
|
|
This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain 1500 randomly sampled texts from each of 5 text-types: Academic, Fiction, Magazine, News, and Spoken. |
|
|
|
#### # of texts/chunks/tokens per dataset |
|
Dataset |Texts |Chunks |Tokens |
|
-|-|-|- |
|
Train |7500 |1,167,584 |32,203,828 |
|
Test |500 |58,117 |1,567,997 |
|
|
|
## Training procedure |
|
|
|
This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommended hyperparameters from the [original BERT paper](https://arxiv.org/pdf/1810.04805). |
|
|
|
## Eval results |
|
### Overall |
|
metric|test |
|
-|- |
|
f1 |.743 |
|
accuracy |.801 |
|
|
|
### By category |
|
category|precision|recall|f1-score|support |
|
-|-|-|-|- |
|
AcademicTerms|0.76|0.77|0.76|140805 |
|
AcademicWritingMoves|0.36|0.46|0.40|8182 |
|
Character|0.74|0.78|0.76|123856 |
|
Citation|0.73|0.81|0.77|13428 |
|
CitationAuthority|0.55|0.49|0.51|4552 |
|
CitationHedged|0.58|0.89|0.70|285 |
|
ConfidenceHedged|0.76|0.84|0.79|14765 |
|
ConfidenceHigh|0.64|0.72|0.68|11462 |
|
ConfidenceLow|0.70|0.39|0.50|380 |
|
Contingent|0.68|0.69|0.69|9537 |
|
Description|0.60|0.67|0.63|108186 |
|
Facilitate|0.63|0.63|0.63|7421 |
|
FirstPerson|0.62|0.73|0.67|6235 |
|
ForceStressed|0.65|0.72|0.69|37910 |
|
Future|0.63|0.69|0.66|9049 |
|
InformationChange|0.64|0.72|0.68|14560 |
|
InformationChangeNegative|0.59|0.57|0.58|1840 |
|
InformationChangePositive|0.61|0.58|0.60|4265 |
|
InformationExposition|0.80|0.83|0.82|84977 |
|
InformationPlace|0.80|0.82|0.81|18783 |
|
InformationReportVerbs|0.71|0.79|0.75|17572 |
|
InformationStates|0.74|0.80|0.77|21048 |
|
InformationTopics|0.69|0.72|0.70|58677 |
|
Inquiry|0.50|0.58|0.53|12735 |
|
Interactive|0.64|0.70|0.67|18135 |
|
MetadiscourseCohesive|0.90|0.93|0.92|33312 |
|
MetadiscourseInteractive|0.54|0.62|0.58|6888 |
|
Narrative|0.70|0.76|0.73|116896 |
|
Negative|0.63|0.69|0.66|60534 |
|
Positive|0.60|0.67|0.63|54374 |
|
PublicTerms|0.70|0.74|0.72|38229 |
|
Reasoning|0.71|0.76|0.74|30157 |
|
Responsibility|0.59|0.63|0.61|3451 |
|
Strategic|0.60|0.62|0.61|28064 |
|
SyntacticComplexity|0.83|0.87|0.85|297387 |
|
Uncertainty|0.43|0.44|0.43|2915 |
|
Updates|0.52|0.53|0.53|6156 |
|
-|-|-|-|- |
|
micro|avg|0.72|0.77|0.74|1427008 |
|
macro|avg|0.65|0.69|0.67|1427008 |
|
weighted|avg|0.72|0.77|0.74|1427008 |
|
|
|
|
|
## DocuScope Category Descriptions |
|
|
|
Category (Cluster)|Description|Examples |
|
-|-|- |
|
Academic Terms|Abstract, rare, specialized, or disciplinary-specific terms that are indicative of informationally dense writing|*market price*, *storage capacity*, *regulatory*, *distribution* |
|
Academic Writing Moves|Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales (1981) and Cotos et al. (2015, 2017)|*in the first section*, *the problem is that*, *payment methodology*, *point of contention* |
|
Character|References multiple dimensions of a character or human being as a social agent, both individual and collective|*Pauline*, *her*, *personnel*, *representatives* |
|
Citation|Language that indicates the attribution of information to, or citation of, another source.|*according to*, *is proposing that*, *quotes from* |
|
Citation Authorized|Referencing the citation of another source that is represented as true and not arguable|*confirm that*, *provide evidence*, *common sense* |
|
Citation Hedged|Referencing the citation of another source that is presented as arguable|*suggest that*, *just one opinion* |
|
Confidence Hedged|Referencing language that presents a claim as uncertain|*tends to get*, *maybe*, *it seems that* |
|
Confidence High|Referencing language that presents a claim with certainty|*most likely*, *ensure that*, *know that*, *obviously* |
|
Confidence Low|Referencing language that presents a claim as extremely unlikely|*unlikely*, *out of the question*, *impossible* |
|
Contingent|Referencing contingency, typically contingency in the world, rather than contingency in one's knowledge|*subject to*, *if possible*, *just in case*, *hypothetically* |
|
Description|Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects|*stay quiet*, *gas-fired*, *solar panels*, *soft*, *on my desk* |
|
Facilitate|Language that enables or directs one through specific tasks and actions|*let me*, *worth a try*, *I would suggest* |
|
First Person|This cluster captures first person.|*I*, *as soon as I*, *we have been* |
|
Force Stressed|Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms|*really good*, *the sooner the better*, *necessary* |
|
Future|Referencing future actions, states, or desires|*will be*, *hope to*, *expected changes* |
|
Information Change|Referencing changes of information, particularly changes that are more neutral|*changes*, *revised*, *growth*, *modification to* |
|
Information Change Negative|Referencing negative change|*going downhill*, *slow erosion*, *get worse* |
|
Information Change Positive|Referencing positive change|*improving*, *accrued interest*, *boost morale* |
|
Information Exposition|Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons|*final amount*, *several*, *three*, *compare*, *80%* |
|
Information Place|Language designating places|*the city*, *surrounding areas*, *Houston*, *home* |
|
Information Report Verbs|Informational verbs and verb phrases of reporting|*report*, *posted*, *release*, *point out* |
|
Information States|Referencing information states, or states of being|*is*, *are*, *existing*, *been* |
|
Information Topics|Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text|*time*, *money*, *stock price*, *phone interview* |
|
Inquiry|Referencing inquiry, or language that points to some kind of inquiry or investigation|*find out*, *let me know if you have any questions*, *wondering if* |
|
Interactive|Addresses from the author to the reader or from persons in the text to other persons. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention-getters, feedback, interactive genre markers, and the use of the second person.|*can you*, *thank you for*, *please see*, *sounds good to me* |
|
Metadiscourse Cohesive|The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive|*or*, *but*, *also*, *on the other hand*, *notwithstanding*, *that being said* |
|
Metadiscourse Interactive|The use of words to build cohesive markers that interact with the reader|*I agree*, *let’s talk*, *by the way* |
|
Narrative|Language that involves people, description, and events extending in time|*today*, *tomorrow*, *during the*, *this weekend* |
|
Negative|Referencing dimensions of negativity, including negative acts, emotions, relations, and values|*does not*, *sorry for*, *problems*, *confusion* |
|
Positive|Referencing dimensions of positivity, including actions, emotions, relations, and values|*thanks*, *approval*, *agreement*, *looks good* |
|
Public Terms|Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility|*discussion*, *amendment*, *corporation*, *authority*, *settlement* |
|
Reasoning|Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise|*because*, *therefore*, *analysis*, *even if*, *as a result*, *indicating that* |
|
Responsibility|Referencing the language of responsibility|*supposed to*, *requirements*, *obligations* |
|
Strategic|This dimension is active when the text structures strategies activism, advantage-seeking, game-playing cognition, plans, and goal-seeking.|*plan*, *trying to*, *strategy*, *decision*, *coordinate*, *look at the* |
|
Syntactic Complexity|The features in this category are often what are called “function words,” like determiners and prepositions.|*the*, *to*, *for*, *in*, *a lot of* |
|
Uncertainty|References uncertainty, when confidence levels are unknown|*kind of*, *I have no idea*, *for some reason* |
|
Updates|References updates that anticipate someone searching for information and receiving it|*already*, *a new*, *now that*, *here are some* |
|
|
|
|
|
|
|
### BibTeX entry and citation info |
|
``` |
|
@incollection{ishizaki2012computer, |
|
title = {Computer-aided rhetorical analysis}, |
|
author = {Ishizaki, Suguru and Kaufer, David}, |
|
booktitle= {Applied natural language processing: Identification, investigation and resolution}, |
|
pages = {276--296}, |
|
year = {2012}, |
|
publisher= {IGI Global}, |
|
url = {https://www.igi-global.com/chapter/content/61054} |
|
} |
|
``` |
|
``` |
|
@article{DBLP:journals/corr/abs-1810-04805, |
|
author = {Jacob Devlin and |
|
Ming{-}Wei Chang and |
|
Kenton Lee and |
|
Kristina Toutanova}, |
|
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language |
|
Understanding}, |
|
journal = {CoRR}, |
|
volume = {abs/1810.04805}, |
|
year = {2018}, |
|
url = {http://arxiv.org/abs/1810.04805}, |
|
archivePrefix = {arXiv}, |
|
eprint = {1810.04805}, |
|
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, |
|
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, |
|
bibsource = {dblp computer science bibliography, https://dblp.org} |
|
} |
|
``` |
|
|
|
|