lapp0's picture
End of training
31ddfdb verified
|
raw
history blame
2.58 kB
metadata
base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.1
    results: []

distily_bench_obj_cross_v2.1

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 3650.0444
  • eval_frwikippl: 29470.7617
  • eval_zhwikippl: 52791.2461
  • eval_tinystoriesppl: 1183.5695
  • eval_loss: 5.1097
  • eval_runtime: 6.5331
  • eval_samples_per_second: 76.533
  • eval_steps_per_second: 9.643

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=0, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 0.0004
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 64
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant_with_warmup
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0568 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 169.9865 47377.9414 3.9789 4998.1294
0 0 21321.3555 56774.5312 6.6010 6.5152 76.744 9.67 11289.9248 60744.7383
500 0.6464 3754.7207 29462.4434 5.1110 6.5528 76.303 9.614 1235.8627 53887.0117
773 0.9994 3650.0444 29470.7617 5.1097 6.5331 76.533 9.643 1183.5695 52791.2461

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0