File size: 9,291 Bytes
5aa7b2d 4c3d3c6 5aa7b2d ffb89c5 b8c1842 4c3d3c6 5aa7b2d 09ec5a3 bb71039 09ec5a3 a4c852f 09ec5a3 f1d7a8f 09ec5a3 f1d7a8f cafb59c 09ec5a3 cafb59c a6db4d4 ebe19d3 a6db4d4 09ec5a3 f1d7a8f 09ec5a3 68d9f23 4c3d3c6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
---
language:
- en
license: apache-2.0
datasets:
- HuggingFaceTB/cosmopedia
inference:
parameters:
temperature: 0.6
top_p: 0.95
top_k: 50
repetition_penalty: 1.2
widget:
- text: Photosynthesis is
example_title: Textbook
group: Completion
- text: '<s> [INST] How to take care of plants? [/INST] '
example_title: Wikihow
group: Completion
- text: '<s> [INST] Generate a story about a flying cat [/INST] '
example_title: Story
group: Completion
model-index:
- name: cosmo-1b
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 38.57
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceTB/cosmo-1b
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 55.13
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceTB/cosmo-1b
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 26.69
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceTB/cosmo-1b
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 38.15
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceTB/cosmo-1b
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 55.49
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceTB/cosmo-1b
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 5.53
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=HuggingFaceTB/cosmo-1b
name: Open LLM Leaderboard
---
# Model Summary
This is a 1.8B model trained on [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) synthetic dataset.
# Training dataset
The training corpus consisted of 30B tokens, 25B of which are synthetic from Cosmopedia. Since we didn't explore the synthetic generation of code, we augmented the dataset with 5B tokens of non-synthetic sources like the `code-python-0.60-to-1.00` and `web-0.50-to-1.00` subsets of [AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText). We also added 1M files from [The Stack](https://huggingface.co/datasets/bigcode/the-stack)'s Jupyter Notebooks, converted to script. They tend to have educational code interleaved with text.
We also included [ultrachat](https://huggingface.co/datasets/stingning/ultrachat) formatted in the chat format of `LlaMa` models, so we don't have to instruction-tune the model after the pre-training. Additionally, we upsampled twice the data from these seed sources to help with commonsense and reasoning: stories, AutoMathText & KhanAcademy.
We trained for 6 epochs, resulting in a model trained on 180B tokens with a sequence length of 2k, a global batch size of 1.3M tokens and a learning rate of 3e-4 with a cosine schedule for 140k steps.
We used the tokenizer from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1/).
# How to use
Although the model wasn't instruction-tuned after the pre-training. However, given that included UltraChat in the pre-training , you can use it in a Chat format using:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/cosmo-1b")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/cosmo-1b").to(device)
prompt = "Generate a story involving a dog, an astronaut and a baker"
prompt= tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
output = model.generate(**inputs, max_length=300, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.2)
print(tokenizer.decode(output[0]))
```
Output:
```
<s><s> [INST] Generate a story involving a dog, an astronaut and a baker [/INST] Once upon a time, there was a sweet little terrier named Max who lived in the bustling city of New York. He loved nothing more than chasing after his owner, Dr. Sarah Johnson, as she worked tirelessly on her latest invention - a spaceship that would take humans to distant galaxies!
One day, Dr. Johnson decided it was time for her to leave Earth's atmosphere behind and embark on this exciting adventure with her loyal companion, Max. She knew he had always been fascinated by space travel, so she hoped he would be just as excited about the journey ahead.
As they boarded their rocket ship and blasted off into outer space, Max felt both nervous and thrilled at the same time. His ears perked up every time they passed clouds or saw stars twinkling far out from earth. But as days turned into weeks, Max started feeling homesick. The vast emptiness around him made him feel lonely and isolated.
Meanwhile back on planet Earth, Mr. Baker was busy baking cookies when suddenly, an idea popped into his head. Why not send some treats along with Dr. Johnson's family? It might make them all feel better knowing that someone else was also having fun exploring the universe.
```
You can also use the model in text completion mode i.e without applying the chat template, but it might not follow isntructions.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/cosmo-1b")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/cosmo-1b").to(device)
prompt = "Photosynthesis is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
output = model.generate(**inputs, max_length=300, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.2)
print(tokenizer.decode(output[0]))
```
Output:
```
<s> Photosynthesis is the process by which green plants, algae and some bacteria convert light energy into chemical energy in order to fuel their metabolic processes. The reaction takes place within specialized cells called chloroplasts. This article focuses on the electron transport chain (ETC), a critical part of photosystem II where most of the solar-driven electrons are passed through before being reduced to water.
```
# Evaluation
Below are the evaluation results of Cosmo-1B. The model is better than TinyLlama 1.1B on ARC-easy, ARC-challenge, OpenBookQA and MMLU, and has comparable performance to Qwen-1.5-1B on ARC-challenge and OpenBookQA.
However, we notice some perfoamnce gaps compared to Phi-1.5 suggesting a better synthetic generation quality which can be related to the LLM used for generation, topic coverage or prompts.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/GgWzl6k9BO9jGhGd5O45y.png)
# Limitations
This is a small 1.8B model trained on synthetic data, so it might hallucinate, give incomplete or incorrect answers.
# Training
## Model
- **Architecture:** Llama-2
- **Pretraining steps:** 120k
- **Pretraining tokens:** 180B
- **Precision:** bfloat16
## Hardware
- **GPUs:** 160 H100
- **Training time:** 15hours
The training loss:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/rJobY7F6tqTAvIox1ZGKR.png)
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceTB__cosmo-1b)
| Metric |Value|
|---------------------------------|----:|
|Avg. |36.59|
|AI2 Reasoning Challenge (25-Shot)|38.57|
|HellaSwag (10-Shot) |55.13|
|MMLU (5-Shot) |26.69|
|TruthfulQA (0-shot) |38.15|
|Winogrande (5-shot) |55.49|
|GSM8k (5-shot) | 5.53|
|