How does MergeKit's Moe Integration work?

#8
by arhanovich - opened

I have a few questions. In each prompt in that model, is only 1 of the expert models activated? Also can the expert models be in different parameter sizes? Is it necessary to specify in which prompts the specific expert will activate?

Two experts are selected for each token at each layer. They need to have the same parameter sizes and the same architectures too. Prompts are here to initialize the gates, but you could get better results using fine-tuning (a lot trickier however!)

Thank you for sharing your thoughts. You suggested that refining the combined MOE model could lead to improved outcomes. Does this mean we should approach SFT in a way that is same to a standard dense model? And I am curious about wheather continue pretrain on the merged model still works well. (because I want to extend the context length and inject some indomain knowledge.)

@muziyongshixin There are two different things that can be improved with this frankenMoE:

  • The router weights: positive prompts provide a good initialization, but further fine-tuning (continuous pretraining or SFT) could improve the expert selection
  • Standard layers: as you said, just like a dense model. I'd expect continuous pretraining to still work with this architecture.
This comment has been hidden

@muziyongshixin There are two different things that can be improved with this frankenMoE:

  • The router weights: positive prompts provide a good initialization, but further fine-tuning (continuous pretraining or SFT) could improve the expert selection
  • Standard layers: as you said, just like a dense model. I'd expect continuous pretraining to still work with this architecture.

When fine-tuning the MOE model, is it recommended to freeze the attention weights and MLP weights while only training the gating layer, or should the entire model be trained end-to-end? Any suggestions on this matter would be greatly appreciated. Thank you in advance.

There's no particular recommendation, it really depends on what you want to do (improve the selection process vs. improve the model).

Sign up or log in to comment