leaderboard-pr-bot's picture
Adding Evaluation Results
9abc4f8
|
raw
history blame
3.99 kB
---
license: apache-2.0
---
**Note: internal model, not ready for use**
This is an intermediate model used as base-model for further pythia 12b SFT-8 experiments.
It was trained on a wider set of instruction-tuning datasets for >12.5k steps with batch-size 128 and a context size of 2048.
The gpt4all dataset had "as a language model" *contamination* (>1.8k entries). We added filtering later, but this model (pre-v8) was trained on the raw unfildered gpt4all dataset.
- wandb: https://wandb.ai/open-assistant/supervised-finetuning/runs/sytsyhrp
- [sampling report](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-pretrained%2F2023-05-05_OpenAssistant_pythia-12b-pre-v8-12_5k-steps_sampling_noprefix2.json)
Datasets:
```
pretrain:
num_train_epochs: 1
weight_decay: 0.0
use_custom_sampler: true
sort_by_length: false
datasets:
- gpteacher_roleplay:
val_split: 0.05
- red_pajama:
fraction: 0.25
max_val_set: 1000
- wizardlm_70k:
val_split: 0.05
max_val_set: 500
- joke:
val_split: 0.05
- poem_instructions:
val_split: 0.025
- oa_stackexchange:
val_split: 0.05
fraction: 0.1
max_val_set: 1000
- tell_a_joke:
val_split: 0.05
max_val_set: 250
- webgpt:
val_split: 0.05
max_val_set: 250
- gpt4all:
val_split: 0.01
max_val_set: 1000
- alpaca_gpt4:
val_split: 0.025
max_val_set: 250
- code_alpaca:
val_split: 0.05
max_val_set: 250
- vicuna:
max_val_set: 250
- oig_file:
source_url: https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl
max_count: 10000
min_length: 250
val_split: 0.05
max_val_set: 250
- minimath:
val_split: 0.05
- humaneval_mbpp_codegen_qa:
val_split: 0.05
- humaneval_mbpp_testgen_qa:
val_split: 0.05
- grade_school_math_instructions:
val_split: 0.05
- recipes:
val_split: 0.05
- cmu_wiki_qa:
val_split: 0.05
- oa_wiki_qa_bart_10000row:
val_split: 0.05
max_val_set: 250
- prosocial_dialogue:
fraction: 0.1
max_val_set: 250
- explain_prosocial:
fraction: 0.075
max_val_set: 250
- soda:
fraction: 0.25
max_val_set: 1000
- oa_leet10k:
val_split: 0.05
max_val_set: 250
- dolly15k:
val_split: 0.05
max_val_set: 300
```
Pythia:
```
pythia-12b-pretrain:
dtype: fp16
log_dir: "pythia_log_12b"
learning_rate: 6e-6
model_name: EleutherAI/pythia-12b-deduped
output_dir: pythia_model_12b
weight_decay: 0.0
max_length: 2048
warmup_steps: 100
gradient_checkpointing: true
gradient_accumulation_steps: 4
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
eval_steps: 251
save_steps: 500
num_train_epochs: 1
save_total_limit: 2
deepspeed_config: configs/zero_config_pretrain.json
```
Command used: `deepspeed trainer_sft.py --show_dataset_stats --configs defaults pythia-12b-pretrain pretrain --cache_dir .cache/ --output_dir .saved/pythia-12b-super-pretrain2 --deepspeed`
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_OpenAssistant__pythia-12b-pre-v8-12.5k-steps)
| Metric | Value |
|-----------------------|---------------------------|
| Avg. | 35.93 |
| ARC (25-shot) | 41.47 |
| HellaSwag (10-shot) | 68.8 |
| MMLU (5-shot) | 26.58 |
| TruthfulQA (0-shot) | 36.82 |
| Winogrande (5-shot) | 65.27 |
| GSM8K (5-shot) | 7.66 |
| DROP (3-shot) | 4.89 |