Update README.md

a7c9b55 verified about 2 months ago

5.19 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- MoE
	---
	# LLaMA-MoE-v2-3.8B (1+1/7) SFT

	[[💻 Code]](https://github.com/OpenSparseLLMs/LLaMA-MoE-v2) \| [[📃 Technical Report]](https://arxiv.org/pdf/2411.15708)

	LLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on [LLaMA3](https://github.com/facebookresearch/llama).
	We build LLaMA-MoE-v2 with the following two steps:
	1. Partition LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.
	2. Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.


	\| Model \| \#Activated Experts \| \#Experts \| \#Activated Params \| SFT Model \|
	\| :-----------------------: \| :-----------------: \| :-------: \| :----------------: \| :------------------------------------------------------------------------: \|
	\| LLaMA-MLP-MoE (2/8) \| 2 \| 8 \| 3.8B \| [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft) \|
	\| LLaMA-MLP-MoE (1+1/7) \| 2 \| 8 \| 3.8B \| [🤗 SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-residual-sft) \|


	## 🚀 QuickStart

	```python
	# python>=3.10

	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_dir = "llama-moe/LLaMA-MoE-v2-3_8B-residual-sft"
	tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
	model.eval()
	model.cuda()

	input_text = "Could you recommend me some mystery novels?"
	input_text = f"<\|start_header_id\|>user<\|end_header_id\|>\n\n{input_text}<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n"
	inputs = tokenizer(input_text, return_tensors="pt")
	input_ids = inputs["input_ids"].cuda()

	pred = model.generate(input_ids, max_length=200, temperature=1.0, do_sample=True, use_cache=True)
	print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
	"""
	I'd be delighted to recommend some mystery novels to you! Here are a few suggestions across various sub-genres:

	Classic Whodunit

	1. "And Then There Were None" by Agatha Christie - A timeless tale of ten strangers who are invited to an isolated island, only to be killed off one by one.
	2. "The Murder on the Orient Express" by Agatha Christie - A classic whodunit set on a luxurious train traveling from Istanbul to Paris, where a famous author goes missing.
	3. "The Devil in the White City" by Erik Larson - A non-fiction book that combines historical events with a mystery, exploring the 1893 World's Columbian Exposition in Chicago and the serial killer H.H. Holmes.

	Modern Whodunits

	1. "Gone Girl" by Gillian Flynn - A twisty, psychological thriller about a couple whose seemingly perfect ...
	"""
	```

	## 📊 Performance

	\| Model \| #Training Tokens \| MMLU(5) \| GSM8k(8) \| HumanEval(pass@10) \| IFEval \| BoolQ(32) \| SciQ \| PIQA \| ARC-c(25) \| TruthfulQA \| HellaSwag(10) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| [LLaMA3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) \| 15T \| 67.2 \| 76.5 \| 71.4 \| 76.5 \| 83.0 \| 93.2 \| 78.5 \| 61.9 \| 51.7 \| 78.8 \|
	\| [INCITE-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1) \| 1T \| 25.1 \| 2.1 \| 6.92 \| 30.1 \| 66.5 \| 94.7 \| 74.4 \| 40.2 \| 36.4 \| 65.6 \|
	\| [Sheared-LLaMA-2.7B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT) \| 50B \| 28.2 \| 1.9 \| 3.2 \| 28.8 \| 67.6 \| 75.8 \| 41.1 \| 47.6 \| 71.2 \| 39.0 \|
	\| [Gemma-2-2b](https://huggingface.co/google/gemma-2-2b-it) \| 2T \| 53.0 \| 26.3 \| 46.1 \| 34.9 \| 72.3 \| 75.8 \| 67.5 \| 52.6 \| 50.8 \| 69.0 \|
	\| [Salamandra-2b](https://huggingface.co/BSC-LT/salamandra-2b-instruct) \| 7.8T \| 25.1 \| 1.90 \| 5.82 \| 27.7 \| 68.0 \| 89.8 \| 74.7 \| 46.3 \| 43.4 \| 62.3 \|
	\| [SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) \| 11T \| 50.4 \| 38.5 \| 39.1 \| 29.0 \| 68.2 \| 84.3 \| 76.0 \| 53.2 \| 39.9 \| 72.6 \|
	\| [OpenMoE-3B-9B](https://huggingface.co/OrionZheng/openmoe-8b-chat) \| 1T \| 26.5 \| 1.36 \| 1.01 \| 31.2 \| 61.7 \| 68.4 \| 65.7 \| 33.3 \| 40.5 \| 56.5 \|
	\| [LLaMA-MoE-3B-7B](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft) \| 200B \| 28.2 \| 4.62 \| 12.0 \| 28.1 \| 68.1 \| 88.8 \| 77.9 \| 44.0 \| 33.3 \| 73.2 \|
	\| [OLMoE-1B-7B](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT) \| 1T \| 53.8 \| 40.9 \| 40.5 \| 35.5 \| 80.9 \| 94.9 \| 80.1 \| 55.6 \| 43.3 \| 79.6 \|
	\| MLP-MoE (8top2) \| 7B \| 40.6 \| 53.1 \| 53.5 \| 32.7 \| 74.6 \| 90.6 \| 69.3 \| 42.8 \| 45.6 \| 59.0 \|
	\| MLP-MoE (8top2) \| 8.4B \| 41.0 \| 59.6 \| 57.1 \| 31.7 \| 74.5 \| 90.2 \| 69.5 \| 43.3 \| 46.9 \| 58.1 \|
	\| MLP-MoE (1+7top1) \| 7B \| 42.7 \| 55.0 \| 51.2 \| 36.0 \| 76.9 \| 88.8 \| 67.9 \| 40.2 \| 46.9 \| 53.7 \|


	## 📃 Citation

	```bibtex
	@misc{llama-moe-v2,
	title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},
	author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},
	year={2024},
	month={Nov},
	url={https://arxiv.org/abs/2411.15708}
	}
	```