cosmosage-v3 / README.md
Tijmen2's picture
Upload folder using huggingface_hub
98a9e82 verified
|
raw
history blame
3.64 kB
metadata
tags:
  - generated_from_trainer
model-index:
  - name: home/005/th5351/output
    results: []

Built with Axolotl

See axolotl config

axolotl version: 0.4.1

base_model: /home/005/th5351/models/cosmosage-llama3-8b-base/
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: llama3
datasets:
  - path: /home/005/th5351/datasets/combined_sft.jsonl
    type: chat_template
    chat_template: llama3
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    roles:
      system:
        - system
      user:
        - human
      assistant:
        - gpt
    

dataset_prepared_path: /home/005/th5351/output/last_run_prepared
val_set_size: 0.001
eval_sample_packing: false
output_dir: /home/005/th5351/output

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 5e-5
cosine_min_lr_ratio: 0.2
cosine_constant_lr_ratio: 0.8
max_grad_norm: 3.0

seed: 42

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 5
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: /home/005/th5351/packages/axolotl/deepspeed_configs/zero2.json
ddp_timeout: 3600000
weight_decay: 0.0
fsdp:
fsdp_config:

home/005/th5351/output

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

  • Loss: nan

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 32
  • total_eval_batch_size: 4
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 2

Training results

Training Loss Epoch Step Validation Loss
1.3757 0.0005 1 nan
0.8083 0.1999 388 nan
0.8005 0.3998 776 nan
0.7389 0.5998 1164 nan
0.7269 0.7997 1552 nan
0.7069 0.9996 1940 nan
0.5786 1.1613 2328 nan
0.5385 1.3613 2716 nan
0.5381 1.5612 3104 nan
0.5273 1.7611 3492 nan
0.527 1.9610 3880 nan

Framework versions

  • Transformers 4.41.1
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1