metadata

base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_multi_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

Architecture: GPT2LMHeadModel
Total Parameters: 124,439,808
Data Type (dtype): torch.bfloat16
Model Size: 0.24 GB

Evaluation Metrics Comparison

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.25	61.25					11.6875	19.125
0	0	2473901162496.0	170424302305280.0	25.7744	25.131	99.479	12.455	4060086272.0	71468255805440.0
2500	0.0404	960.0	8064.0	6.1231	25.2285	99.094	12.407	652.0	6816.0
5000	0.0808	380.0	1896.0	5.0307	25.2563	98.985	12.393	270.0	286.0
7500	0.1212	230.0	824.0	4.5129	25.128	99.491	12.456	202.0	174.0
10000	0.1616	171.0	628.0	4.2261	25.2755	98.91	12.384	151.0	173.0
12500	0.2020	126.5	482.0	3.8533	25.2021	99.198	12.42	106.0	156.0
15000	0.2424	109.5	430.0	3.6650	25.2487	99.015	12.397	88.0	155.0
17500	0.2828	93.0	350.0	3.5198	25.1751	99.305	12.433	73.5	119.0
20000	0.3232	77.5	282.0	3.3352	25.2573	98.981	12.392	63.25	135.0
22500	0.3636	66.5	213.0	3.1511	25.1782	99.292	12.431	50.75	80.0
25000	0.4040	63.25	197.0	3.0803	25.2258	99.105	12.408	44.5	80.5
27500	0.4444	58.5	212.0	3.0299	25.2357	99.066	12.403	41.75	68.5
30000	0.4848	58.5	202.0	3.0169	25.2481	99.017	12.397	43.25	91.5
32500	0.5253	58.75	173.0	3.0014	25.2575	98.981	12.392	41.5	62.75
35000	0.5657	57.25	164.0	2.9385	25.2523	99.001	12.395	38.0	49.0
37500	0.6061	57.0	157.0	2.9163	25.1539	99.388	12.443	39.25	61.75
40000	0.6465	54.75	172.0	2.8984	25.2388	99.054	12.402	35.0	67.5
42500	0.6869	53.0	151.0	2.8789	25.2418	99.042	12.4	35.25	49.75
45000	0.7273	49.5	134.0	2.7753	25.2511	99.005	12.395	30.25	42.25
47500	0.7677	50.0	124.0	2.7506	25.2475	99.02	12.397	29.5	38.75
50000	0.8081	49.0	124.5	2.7361	25.2146	99.149	12.413	28.75	38.25
52500	0.8485	48.25	120.0	2.7262	25.1855	99.264	12.428	29.125	35.0
55000	0.8889	47.75	117.0	2.7099	25.2332	99.076	12.404	28.25	33.0
57500	0.9293	47.25	117.5	2.7045	25.2693	98.934	12.387	28.0	32.5
60000	0.9697	47.25	116.5	2.7013	25.2549	98.991	12.394	27.875	32.25
61875	1.0	47.25	116.5	2.7009	25.2212	99.123	12.41	28.0	32.25

Resource Usage Comparison

VRAM Use: 7.7830 GB

Distillation (Teacher -> Student) Architecture Difference:

Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
Total Parameters: 124,439,808 -> 124,439,808
Data Type (dtype): torch.bfloat16 -> torch.bfloat16
Model Size: 0.24 GB -> 0.24 GB

Module Diff Details

Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

Num Samples: 247,500
Subset: 20231101.en
Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=cos, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0
distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=cos, layer_mapper=layer-2))
train_embeddings: True
lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f14d416e830>
student_model_name_or_path: None
student_config_name_or_path: None
student_model_config: None
reinitialize_weights: None
copy_teacher_modules: [('lm_head', False)]
student_model_as_bitnet: True
student_model_compile: False
dropout: None
teacher_model_name_or_path: gpt2
teacher_load_in_8bit: False
teacher_load_in_4bit: False
teacher_model_compile: False
dataset_uri: wikimedia/wikipedia
dataset_subset: 20231101.en
dataset_split: train
dataset_column_name: text
dataset_sample_size: 250000
dataset_test_size: 0.01
gradient_accumulation_steps: 1
weight_decay: 0.0
max_grad_norm: 1.0
warmup_ratio: 0.5
warmup_steps: 0
gradient_checkpointing: True

Framework Versions

Distily 0.2.0
Transformers 4.44.1
Pytorch 2.5.0.dev20240821+cu121
Datasets 2.21.0