Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,28 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
|
5 |
+
# MiniSymposium Demo Release
|
6 |
+
MiniSymposium is an experimental QLora model that I created based on Mistral 7b. I created it attempting to achieve these goals:
|
7 |
+
|
8 |
+
1. Demonstrate the untapped potential of using a small, focused dataset of handwritten examples instead of training on a large amount of synthetic GPT outputs
|
9 |
+
2. Create a dataset that allows the model to explore different possible answers from multiple perspectives before reaching a conclusion.
|
10 |
+
3. Develop a model that performs well across various prompt formats, rather than overfitting to a specific kind of format
|
11 |
+
|
12 |
+
The current trend in QLora/Lora-based finetuning (and finetuning in general for local LLMs) is to use large synthetic datasets. These are usually GPT datasets that are trained with higher learning rates.
|
13 |
+
|
14 |
+
However, I believe there is a lot of potential in using small, hand-written datasets with low learning rates, even if it's for general-purpose instruction following, as long as you train it for many epochs on a learning rate low enough to avoid overfitting.
|
15 |
+
|
16 |
+
This approach, I hypothesize, helps the model to leam the deeper pattem of instruction following instead of fitting toward shallow data biases (like "As an AI made by OpenAI" and other GPT-isms) that ignore deeper instruction following patterns.
|
17 |
+
|
18 |
+
My initial configuration for this QLora model used a constant learning rate of 1e-6 (0.000001), which resulted in overfitting after approximately 100 epochs. The model started reproducing the original dataset amost verbatim and exhibited poor generalization across different prompt formats, including obvious hallucinations & also Chinese language outputs for some reason.
|
19 |
+
|
20 |
+
However, turning down the learning rate to 1/10th of (1e-7, which is 0.0000001) significantly improved the model. I trained for about ~10 hours on my RTX 3060 to 600 epochs; I think it's still a little undertrained, but I encourage people to try the demo model out.
|
21 |
+
|
22 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/WOebITt3TuTUPZSdi4VDV.png)
|
23 |
+
|
24 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/vyWIl_NBf6COtFoW29w7I.png)
|
25 |
+
|
26 |
+
The dataset is about 200 lines worth of data in the special format that has multiple 'perspectives' to the same answer.
|
27 |
+
|
28 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6491e00e057b0928b3e07b75/hZ4cmmSTUk1R6WCS54Ees.png)
|