lapp0's picture
End of training
9a6e8b3 verified
|
raw
history blame
5.84 kB
metadata
base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_multi_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

  • Architecture: GPT2LMHeadModel
  • Total Parameters: 124,439,808
  • Data Type (dtype): torch.bfloat16
  • Model Size: 0.24 GB

Evaluation Metrics Comparison

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 43.25 61.25 11.6875 19.125
0 0 2473901162496.0 170424302305280.0 25.7744 25.131 99.479 12.455 4060086272.0 71468255805440.0
2500 0.0404 960.0 8064.0 6.1231 25.2285 99.094 12.407 652.0 6816.0
5000 0.0808 380.0 1896.0 5.0307 25.2563 98.985 12.393 270.0 286.0
7500 0.1212 230.0 824.0 4.5129 25.128 99.491 12.456 202.0 174.0
10000 0.1616 171.0 628.0 4.2261 25.2755 98.91 12.384 151.0 173.0
12500 0.2020 126.5 482.0 3.8533 25.2021 99.198 12.42 106.0 156.0
15000 0.2424 109.5 430.0 3.6650 25.2487 99.015 12.397 88.0 155.0
17500 0.2828 93.0 350.0 3.5198 25.1751 99.305 12.433 73.5 119.0
20000 0.3232 77.5 282.0 3.3352 25.2573 98.981 12.392 63.25 135.0
22500 0.3636 66.5 213.0 3.1511 25.1782 99.292 12.431 50.75 80.0
25000 0.4040 63.25 197.0 3.0803 25.2258 99.105 12.408 44.5 80.5
27500 0.4444 58.5 212.0 3.0299 25.2357 99.066 12.403 41.75 68.5
30000 0.4848 58.5 202.0 3.0169 25.2481 99.017 12.397 43.25 91.5
32500 0.5253 58.75 173.0 3.0014 25.2575 98.981 12.392 41.5 62.75
35000 0.5657 57.25 164.0 2.9385 25.2523 99.001 12.395 38.0 49.0
37500 0.6061 57.0 157.0 2.9163 25.1539 99.388 12.443 39.25 61.75
40000 0.6465 54.75 172.0 2.8984 25.2388 99.054 12.402 35.0 67.5
42500 0.6869 53.0 151.0 2.8789 25.2418 99.042 12.4 35.25 49.75
45000 0.7273 49.5 134.0 2.7753 25.2511 99.005 12.395 30.25 42.25
47500 0.7677 50.0 124.0 2.7506 25.2475 99.02 12.397 29.5 38.75
50000 0.8081 49.0 124.5 2.7361 25.2146 99.149 12.413 28.75 38.25
52500 0.8485 48.25 120.0 2.7262 25.1855 99.264 12.428 29.125 35.0
55000 0.8889 47.75 117.0 2.7099 25.2332 99.076 12.404 28.25 33.0
57500 0.9293 47.25 117.5 2.7045 25.2693 98.934 12.387 28.0 32.5
60000 0.9697 47.25 116.5 2.7013 25.2549 98.991 12.394 27.875 32.25
61875 1.0 47.25 116.5 2.7009 25.2212 99.123 12.41 28.0 32.25

Resource Usage Comparison

  • VRAM Use: 7.7830 GB

Distillation (Teacher -> Student) Architecture Difference:

  • Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
  • Total Parameters: 124,439,808 -> 124,439,808
  • Data Type (dtype): torch.bfloat16 -> torch.bfloat16
  • Model Size: 0.24 GB -> 0.24 GB
Module Diff Details


Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

  • Num Samples: 247,500
  • Subset: 20231101.en
  • Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=cos, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.5
  • num_epochs: 1.0
  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=cos, layer_mapper=layer-2))
  • train_embeddings: True
  • lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f14d416e830>
  • student_model_name_or_path: None
  • student_config_name_or_path: None
  • student_model_config: None
  • reinitialize_weights: None
  • copy_teacher_modules: [('lm_head', False)]
  • student_model_as_bitnet: True
  • student_model_compile: False
  • dropout: None
  • teacher_model_name_or_path: gpt2
  • teacher_load_in_8bit: False
  • teacher_load_in_4bit: False
  • teacher_model_compile: False
  • dataset_uri: wikimedia/wikipedia
  • dataset_subset: 20231101.en
  • dataset_split: train
  • dataset_column_name: text
  • dataset_sample_size: 250000
  • dataset_test_size: 0.01
  • gradient_accumulation_steps: 1
  • weight_decay: 0.0
  • max_grad_norm: 1.0
  • warmup_ratio: 0.5
  • warmup_steps: 0
  • gradient_checkpointing: True

Framework Versions

  • Distily 0.2.0
  • Transformers 4.44.1
  • Pytorch 2.5.0.dev20240821+cu121
  • Datasets 2.21.0