|
--- |
|
language: en |
|
datasets: COCA |
|
--- |
|
# docusco-bert |
|
|
|
## Model description |
|
|
|
**docusco-bert** is a fine-tuned BERT model that is ready to use for **token classification**. The model was trained on data sampled from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [**DocuScope**](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below. |
|
|
|
## About DocuScope |
|
DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by **David Kaufer** and **Suguru Ishizaki** since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic). |
|
|
|
DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865). |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
#### How to use |
|
|
|
The model was trained on data with tags formatted using [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), like those used in common tasks like Named Entity Recogition (NER). Thus, you can use this model with a Transformers NER *pipeline*. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from transformers import pipeline |
|
tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert") |
|
model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert") |
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
|
example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide." |
|
ds_results = nlp(example) |
|
print(ds_results) |
|
``` |
|
|
|
#### Limitations and bias |
|
|
|
This model is limited by its training dataset of American English texts. Moreover, the current version is trained on only a small subset of the corpus. The goal is to train later versions on more data, which should increase accuracy. |
|
|
|
## Training data |
|
|
|
This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain chunks of text randomly sampled of 5 text-types: Academic, Fiction, Magazine, News, and Spoken. |
|
|
|
Typically, BERT models are trained on sentence segments. However, DocuScope tags can span setences. Thus, data were split into chunks that don't split **B + I** sequences and end with sentence-final punctuation marks (i.e., period, quesiton mark or exclamaiton point). |
|
|
|
Additionally, the order of the chunks was randomized prior to sampling, and statified sampling was used to provide enough training data for low-frequency caegories. The resulting training data consist of: |
|
|
|
* 21,460,177 tokens |
|
* 15,796,305 chunks |
|
|
|
The specific counts for each category appear in the following table. |
|
|
|
Category|Count |
|
-|- |
|
O|3528038 |
|
Syntactic Complexity|2032808 |
|
Character|1413771 |
|
Description|1224744 |
|
Narrative|1159201 |
|
Negative|651012 |
|
Academic Terms|620932 |
|
Interactive|594908 |
|
Information Exposition|578228 |
|
Positive|463914 |
|
Force Stressed|432631 |
|
Information Topics|394155 |
|
First Person|249744 |
|
Metadiscourse Cohesive|240822 |
|
Strategic|238255 |
|
Public Terms|234213 |
|
Reasoning|213775 |
|
Information Place|187249 |
|
Information States|173146 |
|
Information ReportVerbs|119092 |
|
Confidence High|112861 |
|
Confidence Hedged|110008 |
|
Future|96101 |
|
Inquiry|94995 |
|
Contingent|94860 |
|
Information Change|89063 |
|
Metadiscourse Interactive|84033 |
|
Updates|81424 |
|
Citation|71241 |
|
Facilitate|50451 |
|
Uncertainty|35644 |
|
Academic WritingMoves|29352 |
|
Information ChangePositive|28475 |
|
Responsibility|25362 |
|
Citation Authority|22414 |
|
Information ChangeNegative|15612 |
|
Confidence Low|2876 |
|
Citation Hedged|895 |
|
-|- |
|
Total|15796305 |
|
|
|
## Training procedure |
|
|
|
This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommended hyperparameters from the [original BERT paper](https://arxiv.org/pdf/1810.04805). |
|
|
|
## Eval results |
|
### Overall |
|
metric|test |
|
-|- |
|
f1 |.927 |
|
accuracy |.943 |
|
|
|
### By category |
|
category|precision|recall|f1-score|support |
|
-|-|-|-|- |
|
AcademicTerms|0.91|0.92|0.92|486399 |
|
AcademicWritingMoves|0.76|0.82|0.79|20017 |
|
Character|0.94|0.95|0.94|1260272 |
|
Citation|0.92|0.94|0.93|50812 |
|
CitationAuthority|0.86|0.88|0.87|17798 |
|
CitationHedged|0.91|0.94|0.92|632 |
|
ConfidenceHedged|0.94|0.96|0.95|90393 |
|
ConfidenceHigh|0.92|0.94|0.93|113569 |
|
ConfidenceLow|0.79|0.81|0.80|2556 |
|
Contingent|0.92|0.94|0.93|81366 |
|
Description|0.87|0.89|0.88|1098598 |
|
Facilitate|0.87|0.90|0.89|41760 |
|
FirstPerson|0.96|0.98|0.97|330658 |
|
ForceStressed|0.93|0.94|0.93|436188 |
|
Future|0.90|0.93|0.92|93365 |
|
InformationChange|0.88|0.91|0.89|72813 |
|
InformationChangeNegative|0.83|0.85|0.84|12740 |
|
InformationChangePositive|0.82|0.86|0.84|22994 |
|
InformationExposition|0.94|0.95|0.95|468078 |
|
InformationPlace|0.95|0.96|0.96|147688 |
|
InformationReportVerbs|0.91|0.93|0.92|95563 |
|
InformationStates|0.95|0.95|0.95|139429 |
|
InformationTopics|0.90|0.92|0.91|328152 |
|
Inquiry|0.85|0.89|0.87|79030 |
|
Interactive|0.95|0.96|0.95|602857 |
|
MetadiscourseCohesive|0.97|0.98|0.98|195548 |
|
MetadiscourseInteractive|0.92|0.94|0.93|73159 |
|
Narrative|0.92|0.94|0.93|1023452 |
|
Negative|0.88|0.89|0.88|645810 |
|
Positive|0.87|0.89|0.88|409775 |
|
PublicTerms|0.91|0.92|0.91|184108 |
|
Reasoning|0.93|0.95|0.94|169208 |
|
Responsibility|0.83|0.87|0.85|21819 |
|
Strategic|0.88|0.90|0.89|193768 |
|
SyntacticComplexity|0.95|0.96|0.96|1635918 |
|
Uncertainty|0.87|0.91|0.89|33684 |
|
Updates|0.91|0.93|0.92|77760 |
|
-|-|-|-|- |
|
micro avg|0.92|0.93|0.93|10757736 |
|
macro avg|0.90|0.92|0.91|10757736 |
|
weighted avg|0.92|0.93|0.93|10757736 |
|
|
|
|
|
## DocuScope Category Descriptions |
|
|
|
Category (Cluster)|Description|Examples |
|
-|-|- |
|
Academic Terms|Abstract, rare, specialized, or disciplinary-specific terms that are indicative of informationally dense writing|*market price*, *storage capacity*, *regulatory*, *distribution* |
|
Academic Writing Moves|Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales (1981) and Cotos et al. (2015, 2017)|*in the first section*, *the problem is that*, *payment methodology*, *point of contention* |
|
Character|References multiple dimensions of a character or human being as a social agent, both individual and collective|*Pauline*, *her*, *personnel*, *representatives* |
|
Citation|Language that indicates the attribution of information to, or citation of, another source.|*according to*, *is proposing that*, *quotes from* |
|
Citation Authorized|Referencing the citation of another source that is represented as true and not arguable|*confirm that*, *provide evidence*, *common sense* |
|
Citation Hedged|Referencing the citation of another source that is presented as arguable|*suggest that*, *just one opinion* |
|
Confidence Hedged|Referencing language that presents a claim as uncertain|*tends to get*, *maybe*, *it seems that* |
|
Confidence High|Referencing language that presents a claim with certainty|*most likely*, *ensure that*, *know that*, *obviously* |
|
Confidence Low|Referencing language that presents a claim as extremely unlikely|*unlikely*, *out of the question*, *impossible* |
|
Contingent|Referencing contingency, typically contingency in the world, rather than contingency in one's knowledge|*subject to*, *if possible*, *just in case*, *hypothetically* |
|
Description|Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects|*stay quiet*, *gas-fired*, *solar panels*, *soft*, *on my desk* |
|
Facilitate|Language that enables or directs one through specific tasks and actions|*let me*, *worth a try*, *I would suggest* |
|
First Person|This cluster captures first person.|*I*, *as soon as I*, *we have been* |
|
Force Stressed|Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms|*really good*, *the sooner the better*, *necessary* |
|
Future|Referencing future actions, states, or desires|*will be*, *hope to*, *expected changes* |
|
Information Change|Referencing changes of information, particularly changes that are more neutral|*changes*, *revised*, *growth*, *modification to* |
|
Information Change Negative|Referencing negative change|*going downhill*, *slow erosion*, *get worse* |
|
Information Change Positive|Referencing positive change|*improving*, *accrued interest*, *boost morale* |
|
Information Exposition|Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons|*final amount*, *several*, *three*, *compare*, *80%* |
|
Information Place|Language designating places|*the city*, *surrounding areas*, *Houston*, *home* |
|
Information Report Verbs|Informational verbs and verb phrases of reporting|*report*, *posted*, *release*, *point out* |
|
Information States|Referencing information states, or states of being|*is*, *are*, *existing*, *been* |
|
Information Topics|Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text|*time*, *money*, *stock price*, *phone interview* |
|
Inquiry|Referencing inquiry, or language that points to some kind of inquiry or investigation|*find out*, *let me know if you have any questions*, *wondering if* |
|
Interactive|Addresses from the author to the reader or from persons in the text to other persons. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention-getters, feedback, interactive genre markers, and the use of the second person.|*can you*, *thank you for*, *please see*, *sounds good to me* |
|
Metadiscourse Cohesive|The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive|*or*, *but*, *also*, *on the other hand*, *notwithstanding*, *that being said* |
|
Metadiscourse Interactive|The use of words to build cohesive markers that interact with the reader|*I agree*, *let’s talk*, *by the way* |
|
Narrative|Language that involves people, description, and events extending in time|*today*, *tomorrow*, *during the*, *this weekend* |
|
Negative|Referencing dimensions of negativity, including negative acts, emotions, relations, and values|*does not*, *sorry for*, *problems*, *confusion* |
|
Positive|Referencing dimensions of positivity, including actions, emotions, relations, and values|*thanks*, *approval*, *agreement*, *looks good* |
|
Public Terms|Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility|*discussion*, *amendment*, *corporation*, *authority*, *settlement* |
|
Reasoning|Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise|*because*, *therefore*, *analysis*, *even if*, *as a result*, *indicating that* |
|
Responsibility|Referencing the language of responsibility|*supposed to*, *requirements*, *obligations* |
|
Strategic|This dimension is active when the text structures strategies activism, advantage-seeking, game-playing cognition, plans, and goal-seeking.|*plan*, *trying to*, *strategy*, *decision*, *coordinate*, *look at the* |
|
Syntactic Complexity|The features in this category are often what are called “function words,” like determiners and prepositions.|*the*, *to*, *for*, *in*, *a lot of* |
|
Uncertainty|References uncertainty, when confidence levels are unknown|*kind of*, *I have no idea*, *for some reason* |
|
Updates|References updates that anticipate someone searching for information and receiving it|*already*, *a new*, *now that*, *here are some* |
|
|
|
|
|
|
|
### BibTeX entry and citation info |
|
``` |
|
@incollection{ishizaki2012computer, |
|
title = {Computer-aided rhetorical analysis}, |
|
author = {Ishizaki, Suguru and Kaufer, David}, |
|
booktitle= {Applied natural language processing: Identification, investigation and resolution}, |
|
pages = {276--296}, |
|
year = {2012}, |
|
publisher= {IGI Global}, |
|
url = {https://www.igi-global.com/chapter/content/61054} |
|
} |
|
``` |
|
``` |
|
@article{DBLP:journals/corr/abs-1810-04805, |
|
author = {Jacob Devlin and |
|
Ming{-}Wei Chang and |
|
Kenton Lee and |
|
Kristina Toutanova}, |
|
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language |
|
Understanding}, |
|
journal = {CoRR}, |
|
volume = {abs/1810.04805}, |
|
year = {2018}, |
|
url = {http://arxiv.org/abs/1810.04805}, |
|
archivePrefix = {arXiv}, |
|
eprint = {1810.04805}, |
|
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, |
|
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, |
|
bibsource = {dblp computer science bibliography, https://dblp.org} |
|
} |
|
``` |
|
|