update model
Browse files- README.md +65 -11
- adapter_model.safetensors +1 -1
- ggml-adapter-model.bin +1 -1
README.md
CHANGED
@@ -71,10 +71,10 @@ wandb_log_model:
|
|
71 |
|
72 |
gradient_accumulation_steps: 8
|
73 |
micro_batch_size: 1
|
74 |
-
num_epochs:
|
75 |
optimizer: adamw_bnb_8bit
|
76 |
lr_scheduler: cosine
|
77 |
-
learning_rate: 0.
|
78 |
|
79 |
train_on_inputs: false
|
80 |
group_by_length: false
|
@@ -118,25 +118,79 @@ This is a LoRA for the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.
|
|
118 |
|
119 |
## Model description
|
120 |
|
121 |
-
Given text extracted from pages of a sustainability report, this model extracts the scope 1, 2 and 3 emissions in JSON format. The JSON object also contains the pages containing this information. For example, the [2022 sustainability report by the Bristol-Myers Squibb Company](https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf) leads to the following output: `{"scope_1":202290,"scope_2":161907,"scope_3":1696100,"sources":[88,89]}`.
|
|
|
|
|
122 |
|
123 |
## Intended uses & limitations
|
124 |
|
125 |
-
The model is intended to be used together with the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the `inference.py` script from the [
|
|
|
|
|
|
|
|
|
126 |
|
127 |
-
|
128 |
|
129 |
-
python inference.py --
|
130 |
|
131 |
Compare to base model without LoRA:
|
132 |
|
133 |
-
python inference.py --
|
134 |
|
135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
136 |
|
137 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
138 |
|
139 |
-
|
140 |
|
141 |
## Training procedure
|
142 |
|
@@ -169,4 +223,4 @@ The following hyperparameters were used during training:
|
|
169 |
- Transformers 4.37.1
|
170 |
- Pytorch 2.0.1
|
171 |
- Datasets 2.16.1
|
172 |
-
- Tokenizers 0.15.0
|
|
|
71 |
|
72 |
gradient_accumulation_steps: 8
|
73 |
micro_batch_size: 1
|
74 |
+
num_epochs: 4
|
75 |
optimizer: adamw_bnb_8bit
|
76 |
lr_scheduler: cosine
|
77 |
+
learning_rate: 0.00002
|
78 |
|
79 |
train_on_inputs: false
|
80 |
group_by_length: false
|
|
|
118 |
|
119 |
## Model description
|
120 |
|
121 |
+
Given text extracted from pages of a sustainability report, this model extracts the scope 1, 2 and 3 emissions in JSON format. The JSON object also contains the pages containing this information. For example, the [2022 sustainability report by the Bristol-Myers Squibb Company](https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf) leads to the following output: `{"scope_1":202290,"scope_2":161907,"scope_3":1696100,"sources":[88,89]}`.
|
122 |
+
|
123 |
+
Reaches an emission value extraction accuracy of 65\% (up from 46\% of the base model) and a source citation accuracy of 69\% (base model: 52\%) on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports) dataset. For more information, refer to the [GitHub repo](https://github.com/nopperl/corporate_emission_reports).
|
124 |
|
125 |
## Intended uses & limitations
|
126 |
|
127 |
+
The model is intended to be used together with the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the `inference.py` script from the [accompanying python package](https://github.com/nopperl/corporate_emission_reports). The script ensures that the prompt string and token ids exactly match the ones used for training.
|
128 |
+
|
129 |
+
### Example usage
|
130 |
+
|
131 |
+
#### CLI
|
132 |
|
133 |
+
Using [transformers](https://github.com/huggingface/transformers) as inference engine:
|
134 |
|
135 |
+
python -m corporate_emissions_reports.inference.py --model_path mistralai/Mistral-7B-Instruct-v0.2 --lora nopperl/emissions-extraction-lora --model_context_size 32768 --engine hf https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
|
136 |
|
137 |
Compare to base model without LoRA:
|
138 |
|
139 |
+
python -m corporate_emissions_reports.inference.py --model_path mistralai/Mistral-7B-Instruct-v0.2 --model_context_size 32768 --engine hf https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
|
140 |
|
141 |
+
Alternatively, it is possible to use [llama.cpp](https://github.com/ggerganov/llama.cpp) as inference engine. In this case, follow the installation instructions of the [package readme](https://github.com/nopperl/corporate_emission_reports/blob/main/README.md). In particular, the model needs to be downloaded beforehand. Then:
|
142 |
+
|
143 |
+
python -m corporate_emissions_reports.inference.py --model mistral --lora ./emissions-extraction-lora/ggml-adapter-model.bin https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
|
144 |
+
|
145 |
+
Compare to base model without LoRA:
|
146 |
+
|
147 |
+
python -m corporate_emissions_reports.inference.py --model mistral https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
|
148 |
+
|
149 |
+
#### Programmatically
|
150 |
+
|
151 |
+
The package also provides a function for inference from python code:
|
152 |
|
153 |
+
from corporate_emission_reports.inference import extract_emissions
|
154 |
+
document_path = "https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf"
|
155 |
+
model_kwargs = {} # Optional arguments with are passed to the HF model
|
156 |
+
emissions = extract_emissions(document_path, "mistralai/Mistral-7B-Instruct-v0.2", lora="nopperl/emissions-extraction-lora", engine="hf", **model_kwargs)
|
157 |
+
|
158 |
+
It's also possible to use it directly with [transformers](https://github.com/huggingface/transformers):
|
159 |
+
|
160 |
+
```
|
161 |
+
from corporate_emission_reports.inference import construct_prompt
|
162 |
+
from peft import AutoPeftModelForCausalLM
|
163 |
+
from transformers import AutoTokenizer
|
164 |
+
document_path = "https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf"
|
165 |
+
lora_path = "nopperl/emissions-extraction-lora"
|
166 |
+
tokenizer = AutoTokenizer.from_pretrained(lora_path)
|
167 |
+
prompt_text = construct_prompt(document_path, tokenizer)
|
168 |
+
model = AutoPeftModelForCausalLM.from_pretrained(lora_path)
|
169 |
+
prompt_tokenized = tokenizer.encode(prompt_text, return_tensors="pt").to(model.device)
|
170 |
+
outputs = model.generate(prompt_tokenized, max_new_tokens=120)
|
171 |
+
output = outputs[0][prompt_tokenized.shape[1]:]
|
172 |
+
```
|
173 |
+
|
174 |
+
Additionally, it is possible to enforce valid JSON output and convert it into a Pydantic object using [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer):
|
175 |
+
|
176 |
+
```
|
177 |
+
from corporate_emission_reports.pydantic_types import Emissions
|
178 |
+
from lmformatenforcer import JsonSchemaParser
|
179 |
+
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
|
180 |
+
...
|
181 |
+
parser = JsonSchemaParser(Emissions.model_json_schema())
|
182 |
+
prefix_function = build_transformers_prefix_allowed_tokens_fn(tokenizer, parser)
|
183 |
+
outputs = model.generate(prompt_tokenized, max_new_tokens=120, prefix_allowed_tokens_fn=prefix_function)
|
184 |
+
output = outputs[0][prompt_tokenized.shape[1]:]
|
185 |
+
if tokenizer.eos_token:
|
186 |
+
output = output[:-1]
|
187 |
+
output = tokenizer.decode(output)
|
188 |
+
return Emissions.model_validate_json(output, strict=True)
|
189 |
+
```
|
190 |
+
|
191 |
+
## Training and evaluation data
|
192 |
|
193 |
+
Finetuned on the [sustainability-report-emissions-instruction-style](https://huggingface.co/datasets/nopperl/sustainability-report-emissions-instruction-style) dataset and evaluated on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports).
|
194 |
|
195 |
## Training procedure
|
196 |
|
|
|
223 |
- Transformers 4.37.1
|
224 |
- Pytorch 2.0.1
|
225 |
- Datasets 2.16.1
|
226 |
+
- Tokenizers 0.15.0
|
adapter_model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 167832688
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c9cf7ae7c20b80e1a17041b5e0f8b12788db0bc46943fa01b4ebeb96f8059615
|
3 |
size 167832688
|
ggml-adapter-model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 335572992
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d915ae9cd2bd2f1909eea73bdba5a7b14ac5b423e5f32d5ab45f82c4ffbfccf8
|
3 |
size 335572992
|