i wanted to learn more about exposure bias mitigation in language models and came across ReMask. it's a neat idea, and i wanted to give it a go.

  • during training, the model processes input sequences twice - once with the full sequence & once with masked sequence.
  • computes model outputs for both.
  • divergence loss is computed as the average of forward and backward KL divergences.
  • final loss is a weighted sum of the cross entropy losses and the divergence loss.

impl on github

<|user|>
Could Moulin Rouge have been hypothetically used as Spain's Spanish American War triage center?
<|logic|>
The Moulin Rouge cabaret in France had a capacity of 850 people. Spain had 700-800 injured during Spanish American War.
<|answer|>
Downloads last month
130
Safetensors
Model size
134M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for aloobun/ReMask-135m

Quantizations
1 model