add details
Browse files
README.md
CHANGED
@@ -69,19 +69,66 @@ inference:
|
|
69 |
|
70 |
# Longformer Encoder-Decoder (LED) fine-tuned on Booksum
|
71 |
|
72 |
-
This model is a fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on
|
|
|
|
|
|
|
73 |
|
74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
-
## Intended uses & limitations
|
79 |
|
80 |
-
|
81 |
|
82 |
## Training and evaluation data
|
83 |
|
84 |
-
|
|
|
85 |
|
86 |
## Training procedure
|
87 |
|
|
|
69 |
|
70 |
# Longformer Encoder-Decoder (LED) fine-tuned on Booksum
|
71 |
|
72 |
+
- This model is a fine-tuned version of [allenai/led-large-16384](https://huggingface.co/allenai/led-large-16384) on the booksum dataset.
|
73 |
+
- the goal was to create a model that can generalize well and is useful in summarizing lots of text in academic and daily usage.
|
74 |
+
- all the parameters for generation on the API are the same as [the base model](https://huggingface.co/pszemraj/led-base-book-summary) for easy comparison between versions.
|
75 |
+
- works well on lots of text, can hand 16384 tokens/batch.
|
76 |
|
77 |
+
---
|
78 |
+
|
79 |
+
# Usage - Basics
|
80 |
+
|
81 |
+
- it is recommended to use `encoder_no_repeat_ngram_size=3` when calling the pipeline object to improve summary quality.
|
82 |
+
- this param forces the model to use new vocabulary and create an abstractive summary, otherwise it may l compile the best _extractive_ summary from the input provided.
|
83 |
+
- create the pipeline object:
|
84 |
+
|
85 |
+
```
|
86 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
87 |
+
from transformers import pipeline
|
88 |
+
|
89 |
+
hf_name = 'pszemraj/led-base-book-summary'
|
90 |
+
|
91 |
+
_model = AutoModelForSeq2SeqLM.from_pretrained(
|
92 |
+
hf_name,
|
93 |
+
low_cpu_mem_usage=True,
|
94 |
+
)
|
95 |
+
|
96 |
+
_tokenizer = AutoTokenizer.from_pretrained(
|
97 |
+
hf_name
|
98 |
+
)
|
99 |
+
|
100 |
+
|
101 |
+
summarizer = pipeline(
|
102 |
+
"summarization",
|
103 |
+
model=_model,
|
104 |
+
tokenizer=_tokenizer
|
105 |
+
)
|
106 |
+
```
|
107 |
+
|
108 |
+
- put words into the pipeline object:
|
109 |
+
|
110 |
+
```
|
111 |
+
wall_of_text = "your words here"
|
112 |
|
113 |
+
result = summarizer(
|
114 |
+
wall_of_text,
|
115 |
+
min_length=16,
|
116 |
+
max_length=256,
|
117 |
+
no_repeat_ngram_size=3,
|
118 |
+
encoder_no_repeat_ngram_size =3,
|
119 |
+
clean_up_tokenization_spaces=True,
|
120 |
+
repetition_penalty=3.7,
|
121 |
+
num_beams=4,
|
122 |
+
early_stopping=True,
|
123 |
+
)
|
124 |
|
|
|
125 |
|
126 |
+
```
|
127 |
|
128 |
## Training and evaluation data
|
129 |
|
130 |
+
- the [booksum](https://arxiv.org/abs/2105.08209) dataset
|
131 |
+
- During training, the input text was the text of the chapter, and the output was the summary text
|
132 |
|
133 |
## Training procedure
|
134 |
|