NorGLM commited on
Commit
5905e8a
·
verified ·
1 Parent(s): 5cbfedc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -1
README.md CHANGED
@@ -1,3 +1,92 @@
1
  ---
2
- license: mit
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-sa-4.0
3
+ language:
4
+ - 'no'
5
+ pipeline_tag: text-classification
6
  ---
7
+
8
+ The Entailment Model is a pre-trained classifier to generate Entailment score for fact verification purpose.
9
+
10
+ Specifically, we fine-tune NorBERT on a collection of machine translated [VitaminC](https://huggingface.co/datasets/tals/vitaminc) dataset which is designed to determine whether the evidence supports assumption and is suitable for training a model on whether the given context entails the generated texts. Then, we employ the fine-tuned model as our Entailment model.
11
+
12
+ Prompt format:
13
+ ```
14
+ {article}[SEP]{positive_sample}
15
+ ```
16
+ Inference format:
17
+ ```
18
+ {article}[SEP]{generated_text}
19
+ ```
20
+
21
+ ## Run the Model
22
+ ```python
23
+ import torch
24
+ from transformers import AutoTokenizer, BertForSequenceClassification
25
+
26
+ model_id = "NorGLM/Entailment"
27
+ tokenizer = AutoTokenizer.from_pretrained(model_id, fast_tokenizer=True)
28
+ tokenizer.add_special_tokens({'pad_token': '[PAD]'})
29
+
30
+ model = BertForSequenceClassification.from_pretrained(
31
+ model_id
32
+ )
33
+ ```
34
+
35
+ ## Inference Example
36
+ ```python
37
+ from torch.utils.data import TensorDataset, DataLoader
38
+
39
+ def entailment_score(texts, references, generated_texts):
40
+ # Entailment: 1, Contradict: 0, Neutral: 2
41
+ # concatinate news articles and generated summaries as input
42
+ input_texts = [t + ' [SEP] '+ g for t,g in zip(texts, generated_texts)]
43
+ # Set the maximum sequence length according to NorBERT config.
44
+ MAX_LEN = 512
45
+ batch_size = 16
46
+
47
+ test_inputs = tokenizer(text=input_texts, add_special_tokens=True, return_attention_mask = True, return_tensors="pt", padding=True, truncation=True, max_length=MAX_LEN)
48
+ validation_data = TensorDataset(test_inputs['input_ids'],test_inputs['attention_mask'])
49
+ validation_dataloader = DataLoader(validation_data,batch_size=batch_size)
50
+
51
+ model.eval()
52
+
53
+ results = []
54
+ num_batches = 1
55
+ for batch in validation_dataloader:
56
+ # Add batch to GPU
57
+ batch = tuple(t.to(device) for t in batch)
58
+ # Unpack the inputs from our dataloader
59
+ b_input_ids, b_input_mask = batch
60
+ # Telling the model not to compute or store gradients, saving memory and speeding up validation
61
+ with torch.no_grad():
62
+ # Forward pass, calculate logit predictions
63
+ logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
64
+
65
+ # Move logits and labels to CPU
66
+ logits = logits[0].to('cpu').numpy()
67
+ pred_flat = np.argmax(logits, axis=1).flatten()
68
+
69
+ results.extend(pred_flat)
70
+ num_batches += 1
71
+
72
+ ent_ratio = results.count(1) / float(len(results))
73
+ neu_ratio = results.count(2) / float(len(results))
74
+ con_ratio = results.count(0) / float(len(results))
75
+ print("Entailment ratio: {}; Neutral ratio: {}; Contradict ratio: {}.".format(ent_ratio, neu_ratio, con_ratio))
76
+ return ent_ratio, neu_ratio, con_ratio
77
+
78
+ # load evaluation text
79
+ eva_file_name = <input csv file for evaluation>
80
+ eval_df = pd.read_csv(eva_file_name)
81
+
82
+ remove_str = 'Token indices sequence length is longer than 2048.'
83
+ eval_df = eval_df[eval_df!=remove_str]
84
+ eval_df = eval_df.dropna()
85
+ references = eval_df['positive_sample'].to_list()
86
+ hypo_list = eval_df['generated_text'].to_list()
87
+ articles = eval_df['article'].to_list()
88
+ ent_ratio, neu_ratio, con_ratio = entailment_score(articles, references, hypo_list)
89
+ ```
90
+
91
+
92
+