metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.1
    results: []

distily_bench_obj_cross_v2.1

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 3650.0444
eval_frwikippl: 29470.7617
eval_zhwikippl: 52791.2461
eval_tinystoriesppl: 1183.5695
eval_loss: 5.1097
eval_runtime: 6.5331
eval_samples_per_second: 76.533
eval_steps_per_second: 9.643

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=0, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0004
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant_with_warmup
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0568 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	21321.3555	56774.5312	6.6010	6.5152	76.744	9.67	11289.9248	60744.7383
500	0.6464	3754.7207	29462.4434	5.1110	6.5528	76.303	9.614	1235.8627	53887.0117
773	0.9994	3650.0444	29470.7617	5.1097	6.5331	76.533	9.643	1183.5695	52791.2461

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0