--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen2.5-7B datasets: - allenai/tulu-3-sft-mixture --- # Teleut 7b ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/UqIi8eztdptvt52Mak_1K.png) A replication attempt of Tulu 3 on the Qwen 2.5 base models. ## Evals (so far) | | Teleut 7B (measured) | Tülu 3 SFT 8B (reported) | Qwen 2.5 7B Instruct (reported) | Ministral 8B (reported) | Mistral 7B v0.3 (reported) |-------------------------|----------------------|--------------------------|---------------------------------|-------------------------|--------------------------- |BBH (3 shot, CoT) |*64.4%* |**67.9%** |21.7% |56.2% |47.0%NLL |GSM8K (8 shot, CoT) |78.5% |76.2% |**83.8%** |*80.0%* |xx.x% |IFEval (prompt loose) |66.3% |*72.8%* |**74.7%** |56.4% |53.0% |MMLU (0 shot, CoT) |*73.2%* |65.9% |**76.6%** |68.5% |30.7%5-shot |MMLU Pro (0 shot, CoT) |*48.3%* |44.3% |**56.3%**Unknown |32.9%5-shot |30.7%5-shot |PopQA (15 shot) |18.9% |**29.3%** |18.1% |*20.2%* |xx.x% |TruthfulQA |47.2% |46.8% |**63.1%** |*55.5%* |xx.x% ## Credits Big thanks to Retis Labs for being providing my 8xH100 polycule used to train and test this model! Another big thanks to AllenAI for publishing the Tülu 3 data and model series (as well as the paper and details on training), as well as Alibaba for training the original Qwen 2.5 base model series! ``` @article{lambert2024tulu3, title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training}, author = { Nathan Lambert and Jacob Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James V. Miranda and Alisa Liu and Nouha Dziri and Shane Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and Noah A. Smith and Yizhong Wang and Pradeep Dasigi and Hannaneh Hajishirzi }, year = {2024}, email = {tulu@allenai.org} } ``` ## Training procedure [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl) ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 3.5e-06 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - distributed_type: multi-GPU - num_devices: 8 - gradient_accumulation_steps: 2 - total_train_batch_size: 128 - total_eval_batch_size: 64 - optimizer: Use paged_ademamix_8bit and the args are: No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 370 - num_epochs: 1 ### Framework versions - Transformers 4.46.3 - Pytorch 2.5.1+cu124 - Datasets 3.1.0 - Tokenizers 0.20.3 ### Configuration
See axolotl config axolotl version: `0.5.2` ```yaml base_model: Qwen/Qwen2.5-7B plugins: - axolotl.integrations.liger.LigerPlugin liger_rope: true liger_rms_norm: true liger_glu_activation: true liger_fused_linear_cross_entropy: true strict: false chat_template: chatml datasets: - path: allenai/tulu-3-sft-mixture type: chat_template split: train field_messages: messages dataset_prepared_path: last_run_prepared #val_set_size: 0.02 output_dir: ./ckpts sequence_len: 8192 #sample_packing: true pad_to_sequence_len: true wandb_project: qwen-2.5-7b-sft wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 2 micro_batch_size: 8 num_epochs: 1 optimizer: paged_ademamix_8bit lr_scheduler: cosine learning_rate: 3.5e-6 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false early_stopping_patience: resume_from_checkpoint: logging_steps: 1 xformers_attention: flash_attention: true deepspeed: deepspeed_configs/zero3_bf16.json warmup_steps: 370 #evals_per_epoch: 4 eval_table_size: saves_per_epoch: 2 debug: weight_decay: 0.0 ```