manueldeprada commited on
Commit
cd9b4f9
·
1 Parent(s): 283e3fd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause-clear
3
+ datasets:
4
+ - cnn_dailymail
5
+ language:
6
+ - en
7
+ metrics:
8
+ - f1
9
+ ---
10
+ # FactCC factuality prediction model
11
+
12
+ Original paper:
13
+
14
+ This model is trained to predict whether a summary is factual with respect to the original text. Basic usage:
15
+ ```
16
+ from transformers import BertForSequenceClassification, BertTokenizer
17
+ model_path = 'manueldeprada/FactCC'
18
+
19
+ tokenizer = BertTokenizer.from_pretrained(model_path)
20
+ model = BertForSequenceClassification.from_pretrained(model_path)
21
+
22
+ text='''The US has "passed the peak" on new coronavirus cases, the White House reported. They predict that some states would reopen this month.
23
+ The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.'''
24
+ wrong_summary = '''The pandemic has almost not affected the US'''
25
+
26
+ input_dict = tokenizer(text, wrong_summary, max_length=512, padding='max_length', truncation='only_first', return_tensors='pt')
27
+ logits = model(**input_dict).logits
28
+ pred = logits.argmax(dim=1)
29
+ model.config.id2label[pred.item()] # prints: INCORRECT
30
+ ```
31
+
32
+ It can also be used with a pipeline. Beware that since pipelines are not thought to be used with pair of sentences, and you have to use this double-list hack:
33
+ ```
34
+ >>> from transformers import pipeline
35
+
36
+ >>> pipe=pipeline(model="manueldeprada/FactCC")
37
+ >>> pipe([[[text1,summary1]],[[text2,summary2]]],truncation='only_first',padding='max_length')
38
+ # output [{'label': 'INCORRECT', 'score': 0.9979124665260315}, {'label': 'CORRECT', 'score': 0.879124665260315}]
39
+ ```
40
+
41
+ Example on how to perform batched inference to reproduce authors results on the test set:
42
+ ```
43
+ def batched_FactCC(text_l, summary_l, max_length=512):
44
+ input_dict = tokenizer(text_l, summary_l, max_length=max_length, padding='max_length', truncation='only_first', return_tensors='pt')
45
+ with torch.no_grad():
46
+ logits = model(**input_dict).logits
47
+ preds = logits.argmax(dim=1)
48
+ return logits, preds
49
+
50
+ texts = []
51
+ claims = []
52
+ labels = []
53
+ with open('factCC/annotated_data/test/data-dev.jsonl', 'r') as file:
54
+ for line in file:
55
+ obj = json.loads(line) # Load the JSON data from each line
56
+ texts.append(obj['text'])
57
+ claims.append(obj['claim'])
58
+ labels.append(model.config.label2id[o['label']])
59
+
60
+ preds = []
61
+ batch_size = 8
62
+ for i in tqdm(range(0, len(texts), batch_size)):
63
+ batch_texts = texts[i:i+batch_size]
64
+ batch_claims = claims[i:i+batch_size]
65
+ _, preds = fact_cc(batch_texts, batch_claims)
66
+ preds.extend(preds.tolist())
67
+
68
+ print(f"F1 micro: {f1_score(labels, preds, average='micro')}")
69
+ print(f"Balanced accuracy: {balanced_accuracy_score(labels, preds)}")
70
+ ```