browndw commited on
Commit
108c904
·
1 Parent(s): 4238a92

commit from user

Browse files
Files changed (6) hide show
  1. README.md +170 -0
  2. config.json +180 -0
  3. pytorch_model.bin +3 -0
  4. tf_model.h5 +3 -0
  5. tokenizer_config.json +1 -0
  6. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets: COCA
4
+ ---
5
+ # docusco-bert
6
+
7
+ ## Model description
8
+
9
+ **docusco-bert** is a fine-tuned BERT model that is ready to use for **token classification**. The model was trained on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [DocuScope](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below.
10
+
11
+ ## About DocuScope
12
+ DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by David Kaufer and Suguru Ishizaki since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).
13
+
14
+ ## Intended uses & limitations
15
+
16
+ #### How to use
17
+
18
+ The model was trained on data with tags formatted using [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), like those used in common tasks like Named Entity Recogition (NER). Thus, you can use this model with a Transformers NER *pipeline*.
19
+
20
+ ```python
21
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
22
+ from transformers import pipeline
23
+
24
+ tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
25
+ model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")
26
+
27
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
28
+ example = "My name is Wolfgang and I live in Berlin"
29
+
30
+ ner_results = nlp(example)
31
+ print(ner_results)
32
+ ```
33
+
34
+ #### Limitations and bias
35
+
36
+ This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.
37
+
38
+ ## Training data
39
+
40
+ This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain 1500 randomly sampled texts from each of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.
41
+
42
+ #### # of texts/chunks/tokens per dataset
43
+ Dataset |Texts |Chunks |Tokens
44
+ -|-|-|-
45
+ Train |7500 |1,167,584 |32,203,828
46
+ Test |500 |58,117 |1,567,997
47
+
48
+ ## Training procedure
49
+
50
+ This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommended hyperparameters from the [original BERT paper](https://arxiv.org/pdf/1810.04805).
51
+
52
+ ## Eval results
53
+ ### Overall
54
+ metric|test
55
+ -|-
56
+ f1 |66.3
57
+ accuracy |74.7
58
+
59
+ ### By category
60
+ precision|recall|f1-score|support
61
+ -|-|-|-
62
+ AcademicTerms|0.69|0.70|0.69|54204
63
+ AcademicWritingMoves|0.31|0.40|0.35|2860
64
+ Character|0.68|0.70|0.69|86213
65
+ Citation|0.61|0.47|0.53|4798
66
+ CitationAuthority|0.48|0.39|0.43|1871
67
+ CitationHedged|0.58|0.81|0.67|135
68
+ ConfidenceHedged|0.65|0.74|0.69|8209
69
+ ConfidenceHigh|0.51|0.61|0.56|8755
70
+ ConfidenceLow|0.88|0.03|0.07|202
71
+ Contingent|0.58|0.60|0.59|6802
72
+ Description|0.56|0.62|0.59|98697
73
+ Facilitate|0.57|0.54|0.55|3597
74
+ FirstPerson|0.61|0.70|0.65|21285
75
+ ForceStressed|0.55|0.59|0.57|29420
76
+ Future|0.53|0.60|0.56|7253
77
+ InformationChange|0.56|0.61|0.59|7427
78
+ InformationChangeNegative|0.52|0.50|0.51|1338
79
+ InformationChangePositive|0.53|0.48|0.51|2161
80
+ InformationExposition|0.75|0.77|0.76|43144
81
+ InformationPlace|0.78|0.84|0.81|14058
82
+ InformationReportVerbs|0.64|0.69|0.66|8418
83
+ InformationStates|0.69|0.74|0.71|11195
84
+ InformationTopics|0.61|0.65|0.63|30214
85
+ Inquiry|0.37|0.45|0.41|7109
86
+ Interactive|0.59|0.64|0.61|32523
87
+ MetadiscourseCohesive|0.89|0.91|0.90|17301
88
+ MetadiscourseInteractive|0.40|0.48|0.44|4245
89
+ Narrative|0.63|0.69|0.66|91799
90
+ Negative|0.55|0.60|0.58|49234
91
+ Positive|0.49|0.57|0.53|32228
92
+ PublicTerms|0.63|0.64|0.63|18302
93
+ Reasoning|0.71|0.75|0.73|16178
94
+ Responsibility|0.51|0.43|0.47|1837
95
+ Strategic|0.47|0.52|0.50|16889
96
+ SyntacticComplexity|0.78|0.81|0.80|156361
97
+ Uncertainty|0.36|0.32|0.34|2680
98
+ Updates|0.44|0.39|0.41|6036
99
+ -|-|-|-
100
+ micro|avg|0.64|0.68|0.66|904978
101
+ macro|avg|0.59|0.59|0.58|904978
102
+ weighted|avg|0.64|0.68|0.66|904978
103
+
104
+
105
+ ## DocuScope Category Descriptions
106
+
107
+ Category (Cluster)|Description|Examples
108
+ -|-|-
109
+ Academic Terms|Abstract, rare, specialized, or disciplinary-specific terms that are indicative of informationally dense writing|*market price*, *storage capacity*, *regulatory*, *distribution*
110
+ Academic Writing Moves|Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales (1981) and Cotos et al. (2015, 2017)|*in the first section*, *the problem is that*, *payment methodology*, *point of contention*
111
+ Character|References multiple dimensions of a character or human being as a social agent, both individual and collective|*Pauline*, *her*, *personnel*, *representatives*
112
+ Citation|Language that indicates the attribution of information to, or citation of, another source.|*according to*, *is proposing that*, *quotes from*
113
+ Citation Authorized|Referencing the citation of another source that is represented as true and not arguable|*confirm that*, *provide evidence*, *common sense*
114
+ Citation Hedged|Referencing the citation of another source that is presented as arguable|*suggest that*, *just one opinion*
115
+ Confidence Hedged|Referencing language that presents a claim as uncertain|*tends to get*, *maybe*, *it seems that*
116
+ Confidence High|Referencing language that presents a claim with certainty|*most likely*, *ensure that*, *know that*, *obviously*
117
+ Confidence Low|Referencing language that presents a claim as extremely unlikely|*unlikely*, *out of the question*, *impossible*
118
+ Contingent|Referencing contingency, typically contingency in the world, rather than contingency in one's knowledge|*subject to*, *if possible*, *just in case*, *hypothetically*
119
+ Description|Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects|*stay quiet*, *gas-fired*, *solar panels*, *soft*, *on my desk*
120
+ Facilitate|Language that enables or directs one through specific tasks and actions|*let me*, *worth a try*, *I would suggest*
121
+ First Person|This cluster captures first person.|*I*, *as soon as I*, *we have been*
122
+ Force Stressed|Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms|*really good*, *the sooner the better*, *necessary*
123
+ Future|Referencing future actions, states, or desires|*will be*, *hope to*, *expected changes*
124
+ Information Change|Referencing changes of information, particularly changes that are more neutral|*changes*, *revised*, *growth*, *modification to*
125
+ Information Change Negative|Referencing negative change|*going downhill*, *slow erosion*, *get worse*
126
+ Information Change Positive|Referencing positive change|*improving*, *accrued interest*, *boost morale*
127
+ Information Exposition|Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons|*final amount*, *several*, *three*, *compare*, *80%*
128
+ Information Place|Language designating places|*the city*, *surrounding areas*, *Houston*, *home*
129
+ Information Report Verbs|Informational verbs and verb phrases of reporting|*report*, *posted*, *release*, *point out*
130
+ Information States|Referencing information states, or states of being|*is*, *are*, *existing*, *been*
131
+ Information Topics|Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text|*time*, *money*, *stock price*, *phone interview*
132
+ Inquiry|Referencing inquiry, or language that points to some kind of inquiry or investigation|*find out*, *let me know if you have any questions*, *wondering if*
133
+ Interactive|Addresses from the author to the reader or from persons in the text to other persons. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention-getters, feedback, interactive genre markers, and the use of the second person.|*can you*, *thank you for*, *please see*, *sounds good to me*
134
+ Metadiscourse Cohesive|The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive|*or*, *but*, *also*, *on the other hand*, *notwithstanding*, *that being said*
135
+ Metadiscourse Interactive|The use of words to build cohesive markers that interact with the reader|*I agree*, *let’s talk*, *by the way*
136
+ Narrative|Language that involves people, description, and events extending in time|*today*, *tomorrow*, *during the*, *this weekend*
137
+ Negative|Referencing dimensions of negativity, including negative acts, emotions, relations, and values|*does not*, *sorry for*, *problems*, *confusion*
138
+ Positive|Referencing dimensions of positivity, including actions, emotions, relations, and values|*thanks*, *approval*, *agreement*, *looks good*
139
+ Public Terms|Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility|*discussion*, *amendment*, *corporation*, *authority*, *settlement*
140
+ Reasoning|Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise|*because*, *therefore*, *analysis*, *even if*, *as a result*, *indicating that*
141
+ Responsibility|Referencing the language of responsibility|*supposed to*, *requirements*, *obligations*
142
+ Strategic|This dimension is active when the text structures strategies activism, advantage-seeking, game-playing cognition, plans, and goal-seeking.|*plan*, *trying to*, *strategy*, *decision*, *coordinate*, *look at the*
143
+ Syntactic Complexity|The features in this category are often what are called “function words,” like determiners and prepositions.|*the*, *to*, *for*, *in*, *a lot of*
144
+ Uncertainty|References uncertainty, when confidence levels are unknown|*kind of*, *I have no idea*, *for some reason*
145
+ Updates|References updates that anticipate someone searching for information and receiving it|*already*, *a new*, *now that*, *here are some*
146
+
147
+
148
+
149
+ ### BibTeX entry and citation info
150
+
151
+ ```
152
+ @article{DBLP:journals/corr/abs-1810-04805,
153
+ author = {Jacob Devlin and
154
+ Ming{-}Wei Chang and
155
+ Kenton Lee and
156
+ Kristina Toutanova},
157
+ title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
158
+ Understanding},
159
+ journal = {CoRR},
160
+ volume = {abs/1810.04805},
161
+ year = {2018},
162
+ url = {http://arxiv.org/abs/1810.04805},
163
+ archivePrefix = {arXiv},
164
+ eprint = {1810.04805},
165
+ timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
166
+ biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
167
+ bibsource = {dblp computer science bibliography, https://dblp.org}
168
+ }
169
+ ```
170
+
config.json ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bert-base-cased",
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "B-Description",
13
+ "1": "B-Responsibility",
14
+ "2": "I-AcademicWritingMoves",
15
+ "3": "B-Strategic",
16
+ "4": "I-Contingent",
17
+ "5": "I-InformationStates",
18
+ "6": "I-Interactive",
19
+ "7": "B-SyntacticComplexity",
20
+ "8": "B-InformationChange",
21
+ "9": "I-Description",
22
+ "10": "I-Narrative",
23
+ "11": "B-InformationTopics",
24
+ "12": "B-MetadiscourseInteractive",
25
+ "13": "I-InformationPlace",
26
+ "14": "I-Responsibility",
27
+ "15": "I-Reasoning",
28
+ "16": "B-InformationExposition",
29
+ "17": "B-ForceStressed",
30
+ "18": "B-ConfidenceHedged",
31
+ "19": "B-Character",
32
+ "20": "B-Updates",
33
+ "21": "I-InformationReportVerbs",
34
+ "22": "B-InformationChangePositive",
35
+ "23": "B-PublicTerms",
36
+ "24": "I-MetadiscourseCohesive",
37
+ "25": "O",
38
+ "26": "B-AcademicTerms",
39
+ "27": "I-MetadiscourseInteractive",
40
+ "28": "I-Updates",
41
+ "29": "I-Negative",
42
+ "30": "B-InformationPlace",
43
+ "31": "B-Interactive",
44
+ "32": "I-AcademicTerms",
45
+ "33": "I-CitationAuthority",
46
+ "34": "I-Citation",
47
+ "35": "B-Narrative",
48
+ "36": "I-PublicTerms",
49
+ "37": "B-CitationAuthority",
50
+ "38": "B-Reasoning",
51
+ "39": "I-InformationExposition",
52
+ "40": "I-Facilitate",
53
+ "41": "B-FirstPerson",
54
+ "42": "I-ConfidenceHedged",
55
+ "43": "I-FirstPerson",
56
+ "44": "I-Character",
57
+ "45": "B-ConfidenceLow",
58
+ "46": "B-MetadiscourseCohesive",
59
+ "47": "B-InformationChangeNegative",
60
+ "48": "B-Uncertainty",
61
+ "49": "B-AcademicWritingMoves",
62
+ "50": "I-ConfidenceLow",
63
+ "51": "I-Strategic",
64
+ "52": "I-SyntacticComplexity",
65
+ "53": "B-Negative",
66
+ "54": "I-Inquiry",
67
+ "55": "I-InformationChangeNegative",
68
+ "56": "I-InformationTopics",
69
+ "57": "B-Future",
70
+ "58": "I-ConfidenceHigh",
71
+ "59": "B-Positive",
72
+ "60": "B-CitationHedged",
73
+ "61": "I-CitationHedged",
74
+ "62": "I-ForceStressed",
75
+ "63": "B-Inquiry",
76
+ "64": "I-InformationChangePositive",
77
+ "65": "B-ConfidenceHigh",
78
+ "66": "I-Uncertainty",
79
+ "67": "B-InformationReportVerbs",
80
+ "68": "I-InformationChange",
81
+ "69": "B-Citation",
82
+ "70": "B-InformationStates",
83
+ "71": "I-Future",
84
+ "72": "B-Facilitate",
85
+ "73": "I-Positive",
86
+ "74": "B-Contingent",
87
+ "75": "PAD"
88
+ },
89
+ "initializer_range": 0.02,
90
+ "intermediate_size": 3072,
91
+ "label2id": {
92
+ "B-AcademicTerms": 26,
93
+ "B-AcademicWritingMoves": 49,
94
+ "B-Character": 19,
95
+ "B-Citation": 69,
96
+ "B-CitationAuthority": 37,
97
+ "B-CitationHedged": 60,
98
+ "B-ConfidenceHedged": 18,
99
+ "B-ConfidenceHigh": 65,
100
+ "B-ConfidenceLow": 45,
101
+ "B-Contingent": 74,
102
+ "B-Description": 0,
103
+ "B-Facilitate": 72,
104
+ "B-FirstPerson": 41,
105
+ "B-ForceStressed": 17,
106
+ "B-Future": 57,
107
+ "B-InformationChange": 8,
108
+ "B-InformationChangeNegative": 47,
109
+ "B-InformationChangePositive": 22,
110
+ "B-InformationExposition": 16,
111
+ "B-InformationPlace": 30,
112
+ "B-InformationReportVerbs": 67,
113
+ "B-InformationStates": 70,
114
+ "B-InformationTopics": 11,
115
+ "B-Inquiry": 63,
116
+ "B-Interactive": 31,
117
+ "B-MetadiscourseCohesive": 46,
118
+ "B-MetadiscourseInteractive": 12,
119
+ "B-Narrative": 35,
120
+ "B-Negative": 53,
121
+ "B-Positive": 59,
122
+ "B-PublicTerms": 23,
123
+ "B-Reasoning": 38,
124
+ "B-Responsibility": 1,
125
+ "B-Strategic": 3,
126
+ "B-SyntacticComplexity": 7,
127
+ "B-Uncertainty": 48,
128
+ "B-Updates": 20,
129
+ "I-AcademicTerms": 32,
130
+ "I-AcademicWritingMoves": 2,
131
+ "I-Character": 44,
132
+ "I-Citation": 34,
133
+ "I-CitationAuthority": 33,
134
+ "I-CitationHedged": 61,
135
+ "I-ConfidenceHedged": 42,
136
+ "I-ConfidenceHigh": 58,
137
+ "I-ConfidenceLow": 50,
138
+ "I-Contingent": 4,
139
+ "I-Description": 9,
140
+ "I-Facilitate": 40,
141
+ "I-FirstPerson": 43,
142
+ "I-ForceStressed": 62,
143
+ "I-Future": 71,
144
+ "I-InformationChange": 68,
145
+ "I-InformationChangeNegative": 55,
146
+ "I-InformationChangePositive": 64,
147
+ "I-InformationExposition": 39,
148
+ "I-InformationPlace": 13,
149
+ "I-InformationReportVerbs": 21,
150
+ "I-InformationStates": 5,
151
+ "I-InformationTopics": 56,
152
+ "I-Inquiry": 54,
153
+ "I-Interactive": 6,
154
+ "I-MetadiscourseCohesive": 24,
155
+ "I-MetadiscourseInteractive": 27,
156
+ "I-Narrative": 10,
157
+ "I-Negative": 29,
158
+ "I-Positive": 73,
159
+ "I-PublicTerms": 36,
160
+ "I-Reasoning": 15,
161
+ "I-Responsibility": 14,
162
+ "I-Strategic": 51,
163
+ "I-SyntacticComplexity": 52,
164
+ "I-Uncertainty": 66,
165
+ "I-Updates": 28,
166
+ "O": 25,
167
+ "PAD": 75
168
+ },
169
+ "layer_norm_eps": 1e-12,
170
+ "max_position_embeddings": 512,
171
+ "model_type": "bert",
172
+ "num_attention_heads": 12,
173
+ "num_hidden_layers": 12,
174
+ "pad_token_id": 0,
175
+ "position_embedding_type": "absolute",
176
+ "transformers_version": "4.3.3",
177
+ "type_vocab_size": 2,
178
+ "use_cache": true,
179
+ "vocab_size": 28996
180
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89ce3ddfce5a3b7f1b62c7f7f16728642c9c81284505155b8683fae6e5e501b3
3
+ size 431198655
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d04ece69d04b890153ea3bd5c2ef5706f9181495a0778a2593c6118f7ce2dc3
3
+ size 526681800
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "max_len": 512, "init_inputs": []}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff