FrancescoPeriti
/

Llama2Dictionary

@@ -1,15 +1,22 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
@@ -49,6 +56,120 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 [More Information Needed]
 ### Out-of-Scope Use
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

 ---
 library_name: transformers
+tags: [Llama2Dictionary]
 ---
+# Llama2Dictionary
 <!-- Provide a quick summary of what the model is/does. -->
+```FrancescoPeriti/Llama2Dictionary``` is a fine-tuned version of the ```meta-llama/Llama-2-7b-chat-hf```.
+Thus, to use it, visit the AI at Meta website, accept the Meta License, and submit the [form](https://llama.meta.com/llama-downloads/).
+To use ```FrancescoPeriti/Llama2Dictionary```, you will need to login with your hugginface token (hereonforth, ```[HF-TOKEN]```).
 ## Model Details
+This model is fine-tuned on English datasets of sense definitions. Given a target word and a usage example, the model generates a sense definition for the target word in-context.
+You can find more details in the paper [Automatically Generated Definitions and their utility for Modeling Word Meaning](link) by Francesco Periti, David Alfter, Nina Tahmasebi.
 ### Model Description
 [More Information Needed]
+```python
+import torch
+import warnings
+from peft import PeftModel # parameter-efficient fine-tuning
+from datasets import Dataset
+from huggingface_hub import login
+from typing import (Literal, Sequence,TypedDict)
+from transformers import AutoTokenizer, AutoModelForCausalLM
+login([HF-TOKEN]) # e.g., hf_aGPI...ELal
+model_name = "meta-llama/Llama-2-7b-chat-hf" # chat model
+ft_model_name = "FrancescoPeriti/Llama2Dictionary" # fine-tuned model
+# load models
+chat_model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')
+lama2dictionary = PeftModel.from_pretrained(chat_model, ft_model_name)
+lama2dictionary.eval()
+# load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    model_name,
+    padding_side="left",
+    add_eos_token=True,
+    add_bos_token=True,
+)
+tokenizer.pad_token = tokenizer.eos_token
+# end of sequence for stop condition
+eos_tokens = [tokenizer.encode(token, add_special_tokens=False)[0]
+              for token in [';', ' ;', '.', ' .']]
+eos_tokens.append(tokenizer.eos_token_id)
+# chat format
+Role = Literal["system", "user"]
+class Message(TypedDict):
+    role: Role
+    content: str
+Dialog = Sequence[Message]
+# load dataset
+examples = [{'target': 'jam', 'example': 'The traffic jam on the highway made everyone late for work.'},
+            {'target': 'jam', 'example': 'I spread a generous layer of strawberry jam on my toast this morning'}]
+dataset = Dataset.from_list(examples)
+# apply template
+def apply_chat_template(tokenizer, dataset):
+    system_message = "You are a lexicographer familiar with providing concise definitions of word meanings."
+    template = 'Please provide a concise definition for the meaning of the word "{}" in the following sentence: {}'
+    def apply_chat_template_func(record):
+        dialog: Dialog = (Message(role='system', content=system_message),
+                          Message(role='user', content=template.format(record['target'], record['example'])))
+        prompt = tokenizer.decode(tokenizer.apply_chat_template(dialog, add_generation_prompt=True))
+        return {'text': prompt}
+    return dataset.map(apply_chat_template_func)
+dataset = apply_chat_template(tokenizer, dataset)
+# tokenization
+max_length = 512
+def formatting_func(record):
+    return record['text']
+def tokenization(dataset):
+    result = tokenizer(formatting_func(dataset),
+                       truncation=True,
+                       max_length=max_length,
+                       padding="max_length",
+                       add_special_tokens=False)
+    return result
+tokenized_dataset = dataset.map(tokenization)
+# definition generation
+batch_size = 32
+max_time = 4.5 # sec
+sense_definitions = list()
+with torch.no_grad():
+    for i in range(0, len(tokenized_dataset), batch_size):
+        batch = tokenized_test_dataset[i:i + batch_size]
+        model_input = dict()
+        for k in ['input_ids', 'attention_mask']:
+            model_input[k] = torch.tensor(batch[k]).to('cuda')
+        output_ids = ft_model.generate(**model_input,
+                                       max_length = max_length * batch_size,
+                                       forced_eos_token_id = eos_tokens,
+                                       max_time = max_time * batch_size,
+                                       eos_token_id = eos_tokens,
+                                       temperature = 0.00001,
+                                       pad_token_id = tokenizer.eos_token_id)
+        answers = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
+        for j, answer in enumerate(answers):
+            answer = answer.split('[/INST]')[-1].strip(" .,;:")
+            if 'SYS>>' in answer:
+                answer=''
+                warnings.warn("Something went wrong. The input example might be too long; try reducing it.")
+            sense_definitions.append(answer.replace('\n', ' ') + '\n')
+# output
+dataset = dataset.add_column('definition', output)
+for row in dataset:
+    print(f"Target: {row['target']}\nExample: {row['example']}\nSense definition: {row['definition']}")
+```
 ### Out-of-Scope Use
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->