Update README.md

4b35527 verified 7 months ago

4.66 kB

	---
	base_model: meta-llama/Meta-Llama-3-8B
	inference: true
	model_type: llama
	pipeline_tag: text-generation
	tags:
	- sparse
	---

	# SparseLlama-3-8B-pruned_50.2of4

	This repo contains model files for a 2:4 (N:M) sparse [Meta-Llama-3-8B](meta-llama/Meta-Llama-3-8B) model pruned in one-shot with [SparseGPT](https://arxiv.org/abs/2301.00774), and then additionally retrained with the [SquareHead](https://arxiv.org/abs/2310.06927) knowledge distillation while maintaining the 2:4 sparsity mask.

	Note: This is still a work in progress and subject to change. We expect to release new weights with even better accuracy soon.

	## Running the model

	It can be run naively in transformers for testing purposes:
	```python
	# pip install transformers accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("nm-testing/SparseLlama-3-8B-pruned_50.2of4")
	model = AutoModelForCausalLM.from_pretrained("nm-testing/SparseLlama-3-8B-pruned_50.2of4", device_map="auto")

	input_text = "A poem about Machine Learning goes as follows:"
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	To take advantage of the 2:4 sparsity present, install [nm-vllm](https://github.com/neuralmagic/nm-vllm) for fast inference and low memory-usage:
	```bash
	pip install nm-vllm[sparse] --extra-index-url https://pypi.neuralmagic.com/simple
	```

	```python
	from vllm import LLM, SamplingParams

	model = LLM("nm-testing/SparseLlama-3-8B-pruned_50.2of4", sparsity="semi_structured_sparse_w16a16")

	prompt = "A poem about Machine Learning goes as follows:"
	sampling_params = SamplingParams(max_tokens=100, temperature=0)

	outputs = model.generate(prompt, sampling_params=sampling_params)
	print(outputs[0].outputs[0].text)
	```

	## Evaluation Benchmark Results

	Model evaluation results obtained via [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) following the configuration of [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard).

	\| Benchmark \| Meta-Llama-3-8B \| SparseLlama-3-8B-pruned_50.2of4<br>(this model) \|
	\|:----------------------------------------------:\|:-----------:\|:-----------------------------:\|
	\| [ARC-c](https://arxiv.org/abs/1911.01547)<br> 25-shot \| 59.47% \| 57.76% \|
	\| [MMLU](https://arxiv.org/abs/2009.03300)<br> 5-shot \| 65.29% \| 60.44% \|
	\| [HellaSwag](https://arxiv.org/abs/1905.07830)<br> 10-shot \|82.14% \| 79.97% \|
	\| [WinoGrande](https://arxiv.org/abs/1907.10641)<br> 5-shot \|77.27% \| 77.19% \|
	\| [GSM8K](https://arxiv.org/abs/2110.14168)<br> 5-shot \| 44.81% \| 47.92% \|
	\| [TruthfulQA](https://arxiv.org/abs/2109.07958)<br> 0-shot \| 43.96% \| 41.02% \|
	\| Average<br>Accuracy \| 62.16% \| 60.72% \|
	\| Recovery \| 100% \| 97.68% \|


	Model evaluation results obtained via [Mosaic Eval Gauntlet](https://github.com/mosaicml/llm-foundry/blob/main/scripts/eval/local_data/EVAL_GAUNTLET.md) following the configuration of [Eval Gauntlet v0.3](https://github.com/mosaicml/llm-foundry/blob/main/scripts/eval/yamls/eval_gauntlet_v0.3.yaml).

	\| Benchmark \| Meta-Llama-3-8B \| SparseLlama-3-8B-pruned_50.2of4<br>(this model) \|
	\|:------------------------:\|:----------------:\|:----------------------------------------------:\|
	\| World Knowledge \| 58.08% \| 54.61% \|
	\| Commonsense Reasoning \| 47.66% \| 47.62% \|
	\| Language Understanding \| 71.13% \| 67.58% \|
	\| Symbolic Problem Solving \| 38.44% \| 32.15% \|
	\| Reading Comprehension \| 57.48% \| 55.76% \|
	\| Average Accuracy \| 54.70% \| 51.54% \|
	\| Recovery \| 100% \| 94.22% \|


	## Help

	For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)

	## Acknowledgment

	This model is built with Meta Llama 3. For more details on its licence please check the model card of [Meta-Llama-3-8B](meta-llama/Meta-Llama-3-8B).