Error when running example code: size mismatches for every layer
This is the example code from the documentation for MegaForCausalLM (https://huggingface.co/docs/transformers/main/model_doc/mega):
from transformers import AutoTokenizer, MegaForCausalLM, AutoConfig
import torch
tokenizer = AutoTokenizer.from_pretrained("mnaylor/mega-base-wikitext")
config = AutoConfig.from_pretrained("mnaylor/mega-base-wikitext")
config.is_decoder = True
config.bidirectional = False
model = MegaForCausalLM.from_pretrained("mnaylor/mega-base-wikitext", config=config)
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
prediction_logits = outputs.logits
After installing Transformers from source, when I run the above code snippet on Colab, I get this error:
RuntimeError: Error(s) in loading state_dict for MegaForCausalLM:
size mismatch for mega.layers.0.mega_layer.ema_gate.damping_factor: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.0.mega_layer.ema_gate.decay_factor: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.0.mega_layer.ema_gate.ema_expansion_matrix: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.0.mega_layer.ema_gate.kernel_projection_matrix: copying a param with shape torch.Size([256, 16]) from checkpoint, the shape in current model is torch.Size([128, 16]).
size mismatch for mega.layers.1.mega_layer.ema_gate.damping_factor: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.1.mega_layer.ema_gate.decay_factor: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.1.mega_layer.ema_gate.ema_expansion_matrix: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.1.mega_layer.ema_gate.kernel_projection_matrix: copying a param with shape torch.Size([256, 16]) from checkpoint, the shape in current model is torch.Size([128, 16]).
size mismatch for mega.layers.2.mega_layer.ema_gate.damping_factor: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.2.mega_layer.ema_gate.decay_factor: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.2.mega_layer.ema_gate.ema_expansion_matrix: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.2.mega_layer.ema_gate.kernel_projection_matrix: copying a param with shape torch.Size([256, 16]) from checkpoint, the shape in current model is torch.Size([128, 16]).
size mismatch for mega.layers.3.mega_layer.ema_gate.damping_factor: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.3.mega_layer.ema_gate.decay_factor: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.3.mega_layer.ema_gate.ema_expansion_matrix: copying a param with shape torch.Size([256, 16, 1]) from checkpoint, the shape in current model is torch.Size([128, 16, 1]).
size mismatch for mega.layers.3.mega_layer.ema_gate.kernel_projection_matrix: copying a param with shape torch.Size([256, 16]) from checkpoint, the shape in current model is torch.Size([128, 16]).
You may consider adding ignore_mismatched_sizes=True
in the model from_pretrained
method.
Ahh, good catch that the example in the docs doesn't work. The reason is that the checkpoint is for a masked LM, and it seems that bidirectional and unidirectional models have different parameter sizes in the EMA layer. I must have missed that when setting it up initially. There is no unidirectional checkpoint at the moment as far as I'm aware.
As a side note, this isn't really the place for this sort of discussion - I am neither a Hugging Face employee nor a maintainer of the library. I just contributed the initial implementation into the Hugging Face library by translating the original repo, and I made this MLM checkpoint for that + getting started with BERT-like tasks. If you run into issues with Transformers package usage, you'll get better help by opening an issue on their GitHub. I'll try to answer the question in your other thread as well, but I hope you understand that I won't be able to provide much detailed or ongoing support.
Thank you! I appreciate your help. I will post this on their Github.