mmoreirast commited on
Commit
eb5e5e3
·
verified ·
1 Parent(s): d03fabf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -3
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - mmoreirast/medicine-training-pt
5
+ - mmoreirast/medicine-evaluation-pt
6
+ language:
7
+ - pt
8
+ metrics:
9
+ - perplexity
10
+ library_name: transformers
11
+ tags:
12
+ - llama-2
13
+ - pt
14
+ - medicine
15
+ ---
16
+ # Doctor Llama Chat
17
+
18
+
19
+ This repository contains a version of [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) fine-tuned on the [aira-med-training-pt](https://huggingface.co/datasets/mmoreirast/aira-med-training-pt) dataset.
20
+
21
+ The main objective of the Doctor Llama model was to study the step-by-step process involved in fine-tuning models in Portuguese, taking into account the challenges encountered in the medical field.
22
+
23
+ This model was created as part of the course completion project for **Biomedical Informatics at the Federal University of Paraná**. For more information, access the full text at the following link.
24
+
25
+ ## Author
26
+ Mariana Moreira dos Santos ([LinkedIn](https://www.linkedin.com/in/mmoreirast/))
27
+
28
+ ## Code
29
+ You can check the codes used to fine-tune the model at the following [Google Colab](https://colab.research.google.com/drive/1SvJvTcH3IRnsEv72UxkVmV0oClCZARtE?usp=sharing) link.
30
+
31
+ ## Fine-tuning details
32
+ - **Base model:** [TeenyTinyLlama 460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m)
33
+ - **Context length:** 2048 tokens
34
+ - **Dataset for fine-tuning:** [medicine-training-pt](mmoreirast/medicine-training-pt)
35
+ - **Dataset for evaluation:** [medicine-evaluation-pt](https://huggingface.co/datasets/mmoreirast/medicine-evaluation-pt)
36
+ - **Language:** Portuguese
37
+ - **GPU:** NVIDIA A100-SXM4-40GB
38
+ - **Training time**: ~5 hours
39
+
40
+ ## Parameters
41
+ - **Number of Epochs:** 4
42
+ - **Batch size:** 8
43
+ - **Optimizer:** torch.optim.AdamW (warmup_steps = 1e3, learning_rate = 1e-5, epsilon = 1e-8)
44
+
45
+ ## Evaluations
46
+
47
+
48
+ | Model |Perplexity |Evaluation Loss |
49
+ |---------------------------|-----------------|-------------------|
50
+ | TeenyTinyLlama 160m | 22.51 | 3.11 |
51
+ | **Doctor Llama 160m** | 15.68 | 2.75 |
52
+ | TeenyTinyLlama 460m | 13.09 | 2.57 |
53
+ | **Doctor Llama 460m** | 10.94 | 2.39 |
54
+ | TeenyTinyLlama 460m Chat | 21.22 | 3.05 |
55
+ | **Doctor Llama Chat** | 11.13 | 2.41 |
56
+
57
+
58
+ ## Basic usage
59
+ Using the `pipeline`:
60
+
61
+ ```python
62
+ from transformers import pipeline
63
+
64
+ generator = pipeline("text-generation", model="mmoreirast/Doctor-Llama-460m")
65
+
66
+ completions = generator("Me fale sobre o sistema nervoso", num_return_sequences=2, max_new_tokens=100)
67
+
68
+ for comp in completions:
69
+ print(f"🤖 {comp['generated_text']}")
70
+ ```
71
+
72
+ Using the `AutoTokenizer` and `AutoModelForCausalLM`:
73
+
74
+ ```python
75
+ from transformers import AutoTokenizer, AutoModelForCausalLM
76
+ import torch
77
+
78
+ # Load model and the tokenizer
79
+ tokenizer = AutoTokenizer.from_pretrained("mmoreirast/Doctor-Llama-460m", revision='main')
80
+ model = AutoModelForCausalLM.from_pretrained("mmoreirast/Doctor-Llama-460m", revision='main')
81
+
82
+ # Pass the model to your device
83
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
84
+ model.eval()
85
+ model.to(device)
86
+
87
+ # Tokenize the inputs and pass them to the device
88
+ inputs = tokenizer("Me fale sobre o sistema nervoso", return_tensors="pt").to(device)
89
+
90
+ # Generate some text
91
+ completions = model.generate(**inputs, num_return_sequences=2, max_new_tokens=100)
92
+
93
+ # Print the generated text
94
+ for i, completion in enumerate(completions):
95
+ print(f'🤖 {tokenizer.decode(completion)}')
96
+ ```
97
+ ## Intended Uses
98
+
99
+ The main objective of the Doctor Llama model was to study the step-by-step process involved in fine-tuning models in Portuguese, taking into account the challenges encountered in the medical field. You may also further fine-tune and adapt Doctor Llama for deployment, as long as your use is following the Apache 2.0 license. If you decide to use pre-trained Doctor Llama as a basis for your fine-tuned model, please conduct your own risk and bias assessment.
100
+
101
+ ## Out-of-scope Use
102
+
103
+ Doctor Llama is not intended for deployment. It is not a product and should not be used for human-facing interactions.
104
+
105
+ Doctor Llama models are Brazilian Portuguese language only and are not suitable for translation or generating text in other languages.
106
+
107
+ ## Limitations
108
+
109
+ As described in the Teeny Tiny Llama model, the Doctor Llama also has the following limitations:
110
+
111
+ - **Hallucinations:** This model can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, i.e., hallucination.
112
+
113
+ - **Biases and Toxicity:** This model inherits the social and historical stereotypes from the data used to train it. Given these biases, the model can produce toxic content, i.e., harmful, offensive, or detrimental to individuals, groups, or communities.
114
+
115
+ - **Unreliable Code:** The model may produce incorrect code snippets and statements. These code generations should not be treated as suggestions or accurate solutions.
116
+
117
+ - **Language Limitations:** The model is primarily designed to understand standard Brazilian Portuguese. Other languages might challenge its comprehension, leading to potential misinterpretations or errors in response.
118
+
119
+ - **Repetition and Verbosity:** The model may get stuck on repetition loops (especially if the repetition penalty during generations is set to a meager value) or produce verbose responses unrelated to the prompt it was given.
120
+
121
+ Hence, even though our models are released with a permissive license, we urge users to perform their risk analysis on these models if intending to use them for real-world applications and also have humans moderating the outputs of these models in applications where they will interact with an audience, guaranteeing users are always aware they are interacting with a language model.
122
+
123
+ ## Cite as 🤗
124
+ ```latex
125
+ @misc{moreira2024docllama,
126
+ title = {Um Estudo sobre LLMs em Português para a Área Médica},
127
+ author = {Mariana Moreira dos Santos, André Ricardo Abed Grégio},
128
+ url = {},
129
+ year={2024}
130
+ }
131
+ ```
132
+ ## Acknowledgements
133
+ The TeenyTinyLlama base models used here were created by Nicholas Kluge Corrêa and his team. For more information, visit [TeenyTinyLlama](https://huggingface.co/collections/nicholasKluge/teenytinyllama-6582ea8129e72d1ea4d384f1).
134
+
135
+ ## License
136
+ Doctor Llama is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.