docusco-bert / README.md

Upload README.md

4b4d5d5 about 3 years ago

11.6 kB

	---
	language: en
	datasets: COCA
	---
	# docusco-bert

	## Model description

	docusco-bert is a fine-tuned BERT model that is ready to use for token classification. The model was trained on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [DocuScope](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below.

	## About DocuScope
	DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by David Kaufer and Suguru Ishizaki since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).

	DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865).

	## Intended uses & limitations

	#### How to use

	The model was trained on data with tags formatted using [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), like those used in common tasks like Named Entity Recogition (NER). Thus, you can use this model with a Transformers NER pipeline.

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
	model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")

	nlp = pipeline("ner", model=model, tokenizer=tokenizer)
	example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide."

	ds_results = nlp(example)
	print(ds_results)
	```

	#### Limitations and bias

	This model is limited by its training dataset of American English texts. Moreover, the current version is trained on only a small subset of the corpus. The goal is to train later versions on more data, which should increase accuracy.

	## Training data

	This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain 1500 randomly sampled texts from each of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.

	#### # of texts/chunks/tokens per dataset
	Dataset \|Texts \|Chunks \|Tokens
	-\|-\|-\|-
	Train \|7500 \|1,167,584 \|32,203,828
	Test \|500 \|58,117 \|1,567,997

	## Training procedure

	This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommended hyperparameters from the [original BERT paper](https://arxiv.org/pdf/1810.04805).

	## Eval results
	### Overall
	metric\|test
	-\|-
	f1 \|.743
	accuracy \|.801

	### By category
	category\|precision\|recall\|f1-score\|support
	-\|-\|-\|-\|-
	AcademicTerms\|0.76\|0.77\|0.76\|140805
	AcademicWritingMoves\|0.36\|0.46\|0.40\|8182
	Character\|0.74\|0.78\|0.76\|123856
	Citation\|0.73\|0.81\|0.77\|13428
	CitationAuthority\|0.55\|0.49\|0.51\|4552
	CitationHedged\|0.58\|0.89\|0.70\|285
	ConfidenceHedged\|0.76\|0.84\|0.79\|14765
	ConfidenceHigh\|0.64\|0.72\|0.68\|11462
	ConfidenceLow\|0.70\|0.39\|0.50\|380
	Contingent\|0.68\|0.69\|0.69\|9537
	Description\|0.60\|0.67\|0.63\|108186
	Facilitate\|0.63\|0.63\|0.63\|7421
	FirstPerson\|0.62\|0.73\|0.67\|6235
	ForceStressed\|0.65\|0.72\|0.69\|37910
	Future\|0.63\|0.69\|0.66\|9049
	InformationChange\|0.64\|0.72\|0.68\|14560
	InformationChangeNegative\|0.59\|0.57\|0.58\|1840
	InformationChangePositive\|0.61\|0.58\|0.60\|4265
	InformationExposition\|0.80\|0.83\|0.82\|84977
	InformationPlace\|0.80\|0.82\|0.81\|18783
	InformationReportVerbs\|0.71\|0.79\|0.75\|17572
	InformationStates\|0.74\|0.80\|0.77\|21048
	InformationTopics\|0.69\|0.72\|0.70\|58677
	Inquiry\|0.50\|0.58\|0.53\|12735
	Interactive\|0.64\|0.70\|0.67\|18135
	MetadiscourseCohesive\|0.90\|0.93\|0.92\|33312
	MetadiscourseInteractive\|0.54\|0.62\|0.58\|6888
	Narrative\|0.70\|0.76\|0.73\|116896
	Negative\|0.63\|0.69\|0.66\|60534
	Positive\|0.60\|0.67\|0.63\|54374
	PublicTerms\|0.70\|0.74\|0.72\|38229
	Reasoning\|0.71\|0.76\|0.74\|30157
	Responsibility\|0.59\|0.63\|0.61\|3451
	Strategic\|0.60\|0.62\|0.61\|28064
	SyntacticComplexity\|0.83\|0.87\|0.85\|297387
	Uncertainty\|0.43\|0.44\|0.43\|2915
	Updates\|0.52\|0.53\|0.53\|6156
	-\|-\|-\|-\|-
	micro\|avg\|0.72\|0.77\|0.74\|1427008
	macro\|avg\|0.65\|0.69\|0.67\|1427008
	weighted\|avg\|0.72\|0.77\|0.74\|1427008


	## DocuScope Category Descriptions

	Category (Cluster)\|Description\|Examples
	-\|-\|-
	Academic Terms\|Abstract, rare, specialized, or disciplinary-specific terms that are indicative of informationally dense writing\|market price, storage capacity, regulatory, distribution
	Academic Writing Moves\|Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales (1981) and Cotos et al. (2015, 2017)\|in the first section, the problem is that, payment methodology, point of contention
	Character\|References multiple dimensions of a character or human being as a social agent, both individual and collective\|Pauline, her, personnel, representatives
	Citation\|Language that indicates the attribution of information to, or citation of, another source.\|according to, is proposing that, quotes from
	Citation Authorized\|Referencing the citation of another source that is represented as true and not arguable\|confirm that, provide evidence, common sense
	Citation Hedged\|Referencing the citation of another source that is presented as arguable\|suggest that, just one opinion
	Confidence Hedged\|Referencing language that presents a claim as uncertain\|tends to get, maybe, it seems that
	Confidence High\|Referencing language that presents a claim with certainty\|most likely, ensure that, know that, obviously
	Confidence Low\|Referencing language that presents a claim as extremely unlikely\|unlikely, out of the question, impossible
	Contingent\|Referencing contingency, typically contingency in the world, rather than contingency in one's knowledge\|subject to, if possible, just in case, hypothetically
	Description\|Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects\|stay quiet, gas-fired, solar panels, soft, on my desk
	Facilitate\|Language that enables or directs one through specific tasks and actions\|let me, worth a try, I would suggest
	First Person\|This cluster captures first person.\|I, as soon as I, we have been
	Force Stressed\|Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms\|really good, the sooner the better, necessary
	Future\|Referencing future actions, states, or desires\|will be, hope to, expected changes
	Information Change\|Referencing changes of information, particularly changes that are more neutral\|changes, revised, growth, modification to
	Information Change Negative\|Referencing negative change\|going downhill, slow erosion, get worse
	Information Change Positive\|Referencing positive change\|improving, accrued interest, boost morale
	Information Exposition\|Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons\|final amount, several, three, compare, 80%
	Information Place\|Language designating places\|the city, surrounding areas, Houston, home
	Information Report Verbs\|Informational verbs and verb phrases of reporting\|report, posted, release, point out
	Information States\|Referencing information states, or states of being\|is, are, existing, been
	Information Topics\|Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text\|time, money, stock price, phone interview
	Inquiry\|Referencing inquiry, or language that points to some kind of inquiry or investigation\|find out, let me know if you have any questions, wondering if
	Interactive\|Addresses from the author to the reader or from persons in the text to other persons. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention-getters, feedback, interactive genre markers, and the use of the second person.\|can you, thank you for, please see, sounds good to me
	Metadiscourse Cohesive\|The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive\|or, but, also, on the other hand, notwithstanding, that being said
	Metadiscourse Interactive\|The use of words to build cohesive markers that interact with the reader\|I agree, let’s talk, by the way
	Narrative\|Language that involves people, description, and events extending in time\|today, tomorrow, during the, this weekend
	Negative\|Referencing dimensions of negativity, including negative acts, emotions, relations, and values\|does not, sorry for, problems, confusion
	Positive\|Referencing dimensions of positivity, including actions, emotions, relations, and values\|thanks, approval, agreement, looks good
	Public Terms\|Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility\|discussion, amendment, corporation, authority, settlement
	Reasoning\|Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise\|because, therefore, analysis, even if, as a result, indicating that
	Responsibility\|Referencing the language of responsibility\|supposed to, requirements, obligations
	Strategic\|This dimension is active when the text structures strategies activism, advantage-seeking, game-playing cognition, plans, and goal-seeking.\|plan, trying to, strategy, decision, coordinate, look at the
	Syntactic Complexity\|The features in this category are often what are called “function words,” like determiners and prepositions.\|the, to, for, in, a lot of
	Uncertainty\|References uncertainty, when confidence levels are unknown\|kind of, I have no idea, for some reason
	Updates\|References updates that anticipate someone searching for information and receiving it\|already, a new, now that, here are some



	### BibTeX entry and citation info
	```
	@incollection{ishizaki2012computer,
	title = {Computer-aided rhetorical analysis},
	author = {Ishizaki, Suguru and Kaufer, David},
	booktitle= {Applied natural language processing: Identification, investigation and resolution},
	pages = {276--296},
	year = {2012},
	publisher= {IGI Global},
	url = {https://www.igi-global.com/chapter/content/61054}
	}
	```
	```
	@article{DBLP:journals/corr/abs-1810-04805,
	author = {Jacob Devlin and
	Ming{-}Wei Chang and
	Kenton Lee and
	Kristina Toutanova},
	title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
	Understanding},
	journal = {CoRR},
	volume = {abs/1810.04805},
	year = {2018},
	url = {http://arxiv.org/abs/1810.04805},
	archivePrefix = {arXiv},
	eprint = {1810.04805},
	timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
	biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
	bibsource = {dblp computer science bibliography, https://dblp.org}
	}
	```

	---
	language: en
	datasets: COCA
	---
	# docusco-bert

	## Model description

	docusco-bert is a fine-tuned BERT model that is ready to use for token classification. The model was trained on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [DocuScope](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below.

	## About DocuScope
	DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by David Kaufer and Suguru Ishizaki since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).

	DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865).

	## Intended uses & limitations

	#### How to use

	The model was trained on data with tags formatted using [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), like those used in common tasks like Named Entity Recogition (NER). Thus, you can use this model with a Transformers NER pipeline.

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
	model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")

	nlp = pipeline("ner", model=model, tokenizer=tokenizer)
	example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide."

	ds_results = nlp(example)
	print(ds_results)
	```

	#### Limitations and bias

	This model is limited by its training dataset of American English texts. Moreover, the current version is trained on only a small subset of the corpus. The goal is to train later versions on more data, which should increase accuracy.

	## Training data

	This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain 1500 randomly sampled texts from each of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.

	#### # of texts/chunks/tokens per dataset
	Dataset \|Texts \|Chunks \|Tokens
	-\|-\|-\|-
	Train \|7500 \|1,167,584 \|32,203,828
	Test \|500 \|58,117 \|1,567,997

	## Training procedure

	This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommended hyperparameters from the [original BERT paper](https://arxiv.org/pdf/1810.04805).

	## Eval results
	### Overall
	metric\|test
	-\|-
	f1 \|.743
	accuracy \|.801

	### By category
	category\|precision\|recall\|f1-score\|support
	-\|-\|-\|-\|-
	AcademicTerms\|0.76\|0.77\|0.76\|140805
	AcademicWritingMoves\|0.36\|0.46\|0.40\|8182
	Character\|0.74\|0.78\|0.76\|123856
	Citation\|0.73\|0.81\|0.77\|13428
	CitationAuthority\|0.55\|0.49\|0.51\|4552
	CitationHedged\|0.58\|0.89\|0.70\|285
	ConfidenceHedged\|0.76\|0.84\|0.79\|14765
	ConfidenceHigh\|0.64\|0.72\|0.68\|11462
	ConfidenceLow\|0.70\|0.39\|0.50\|380
	Contingent\|0.68\|0.69\|0.69\|9537
	Description\|0.60\|0.67\|0.63\|108186
	Facilitate\|0.63\|0.63\|0.63\|7421
	FirstPerson\|0.62\|0.73\|0.67\|6235
	ForceStressed\|0.65\|0.72\|0.69\|37910
	Future\|0.63\|0.69\|0.66\|9049
	InformationChange\|0.64\|0.72\|0.68\|14560
	InformationChangeNegative\|0.59\|0.57\|0.58\|1840
	InformationChangePositive\|0.61\|0.58\|0.60\|4265
	InformationExposition\|0.80\|0.83\|0.82\|84977
	InformationPlace\|0.80\|0.82\|0.81\|18783
	InformationReportVerbs\|0.71\|0.79\|0.75\|17572
	InformationStates\|0.74\|0.80\|0.77\|21048
	InformationTopics\|0.69\|0.72\|0.70\|58677
	Inquiry\|0.50\|0.58\|0.53\|12735
	Interactive\|0.64\|0.70\|0.67\|18135
	MetadiscourseCohesive\|0.90\|0.93\|0.92\|33312
	MetadiscourseInteractive\|0.54\|0.62\|0.58\|6888
	Narrative\|0.70\|0.76\|0.73\|116896
	Negative\|0.63\|0.69\|0.66\|60534
	Positive\|0.60\|0.67\|0.63\|54374
	PublicTerms\|0.70\|0.74\|0.72\|38229
	Reasoning\|0.71\|0.76\|0.74\|30157
	Responsibility\|0.59\|0.63\|0.61\|3451
	Strategic\|0.60\|0.62\|0.61\|28064
	SyntacticComplexity\|0.83\|0.87\|0.85\|297387
	Uncertainty\|0.43\|0.44\|0.43\|2915
	Updates\|0.52\|0.53\|0.53\|6156
	-\|-\|-\|-\|-
	micro\|avg\|0.72\|0.77\|0.74\|1427008
	macro\|avg\|0.65\|0.69\|0.67\|1427008
	weighted\|avg\|0.72\|0.77\|0.74\|1427008


	## DocuScope Category Descriptions

	Category (Cluster)\|Description\|Examples
	-\|-\|-
	Academic Terms\|Abstract, rare, specialized, or disciplinary-specific terms that are indicative of informationally dense writing\|market price, storage capacity, regulatory, distribution
	Academic Writing Moves\|Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales (1981) and Cotos et al. (2015, 2017)\|in the first section, the problem is that, payment methodology, point of contention
	Character\|References multiple dimensions of a character or human being as a social agent, both individual and collective\|Pauline, her, personnel, representatives
	Citation\|Language that indicates the attribution of information to, or citation of, another source.\|according to, is proposing that, quotes from
	Citation Authorized\|Referencing the citation of another source that is represented as true and not arguable\|confirm that, provide evidence, common sense
	Citation Hedged\|Referencing the citation of another source that is presented as arguable\|suggest that, just one opinion
	Confidence Hedged\|Referencing language that presents a claim as uncertain\|tends to get, maybe, it seems that
	Confidence High\|Referencing language that presents a claim with certainty\|most likely, ensure that, know that, obviously
	Confidence Low\|Referencing language that presents a claim as extremely unlikely\|unlikely, out of the question, impossible
	Contingent\|Referencing contingency, typically contingency in the world, rather than contingency in one's knowledge\|subject to, if possible, just in case, hypothetically
	Description\|Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects\|stay quiet, gas-fired, solar panels, soft, on my desk
	Facilitate\|Language that enables or directs one through specific tasks and actions\|let me, worth a try, I would suggest
	First Person\|This cluster captures first person.\|I, as soon as I, we have been
	Force Stressed\|Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms\|really good, the sooner the better, necessary
	Future\|Referencing future actions, states, or desires\|will be, hope to, expected changes
	Information Change\|Referencing changes of information, particularly changes that are more neutral\|changes, revised, growth, modification to
	Information Change Negative\|Referencing negative change\|going downhill, slow erosion, get worse
	Information Change Positive\|Referencing positive change\|improving, accrued interest, boost morale
	Information Exposition\|Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons\|final amount, several, three, compare, 80%
	Information Place\|Language designating places\|the city, surrounding areas, Houston, home
	Information Report Verbs\|Informational verbs and verb phrases of reporting\|report, posted, release, point out
	Information States\|Referencing information states, or states of being\|is, are, existing, been
	Information Topics\|Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text\|time, money, stock price, phone interview
	Inquiry\|Referencing inquiry, or language that points to some kind of inquiry or investigation\|find out, let me know if you have any questions, wondering if
	Interactive\|Addresses from the author to the reader or from persons in the text to other persons. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention-getters, feedback, interactive genre markers, and the use of the second person.\|can you, thank you for, please see, sounds good to me
	Metadiscourse Cohesive\|The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive\|or, but, also, on the other hand, notwithstanding, that being said
	Metadiscourse Interactive\|The use of words to build cohesive markers that interact with the reader\|I agree, let’s talk, by the way
	Narrative\|Language that involves people, description, and events extending in time\|today, tomorrow, during the, this weekend
	Negative\|Referencing dimensions of negativity, including negative acts, emotions, relations, and values\|does not, sorry for, problems, confusion
	Positive\|Referencing dimensions of positivity, including actions, emotions, relations, and values\|thanks, approval, agreement, looks good
	Public Terms\|Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility\|discussion, amendment, corporation, authority, settlement
	Reasoning\|Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise\|because, therefore, analysis, even if, as a result, indicating that
	Responsibility\|Referencing the language of responsibility\|supposed to, requirements, obligations
	Strategic\|This dimension is active when the text structures strategies activism, advantage-seeking, game-playing cognition, plans, and goal-seeking.\|plan, trying to, strategy, decision, coordinate, look at the
	Syntactic Complexity\|The features in this category are often what are called “function words,” like determiners and prepositions.\|the, to, for, in, a lot of
	Uncertainty\|References uncertainty, when confidence levels are unknown\|kind of, I have no idea, for some reason
	Updates\|References updates that anticipate someone searching for information and receiving it\|already, a new, now that, here are some



	### BibTeX entry and citation info
	```
	@incollection{ishizaki2012computer,
	title = {Computer-aided rhetorical analysis},
	author = {Ishizaki, Suguru and Kaufer, David},
	booktitle= {Applied natural language processing: Identification, investigation and resolution},
	pages = {276--296},
	year = {2012},
	publisher= {IGI Global},
	url = {https://www.igi-global.com/chapter/content/61054}
	}
	```
	```
	@article{DBLP:journals/corr/abs-1810-04805,
	author = {Jacob Devlin and
	Ming{-}Wei Chang and
	Kenton Lee and
	Kristina Toutanova},
	title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
	Understanding},
	journal = {CoRR},
	volume = {abs/1810.04805},
	year = {2018},
	url = {http://arxiv.org/abs/1810.04805},
	archivePrefix = {arXiv},
	eprint = {1810.04805},
	timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
	biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
	bibsource = {dblp computer science bibliography, https://dblp.org}
	}
	```