SparseLlama-3-8B-pruned_50.2of4

This repo contains model files for a 2:4 (N:M) sparse Meta-Llama-3-8B model pruned in one-shot with SparseGPT, and then additionally retrained with the SquareHead knowledge distillation while maintaining the 2:4 sparsity mask.

Note: This is still a work in progress and subject to change. We expect to release new weights with even better accuracy soon.

Running the model

It can be run naively in transformers for testing purposes:

# pip install transformers accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nm-testing/SparseLlama-3-8B-pruned_50.2of4")
model = AutoModelForCausalLM.from_pretrained("nm-testing/SparseLlama-3-8B-pruned_50.2of4", device_map="auto")

input_text = "A poem about Machine Learning goes as follows:"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

To take advantage of the 2:4 sparsity present, install nm-vllm for fast inference and low memory-usage:

pip install nm-vllm[sparse] --extra-index-url https://pypi.neuralmagic.com/simple
from vllm import LLM, SamplingParams

model = LLM("nm-testing/SparseLlama-3-8B-pruned_50.2of4", sparsity="semi_structured_sparse_w16a16")

prompt = "A poem about Machine Learning goes as follows:"
sampling_params = SamplingParams(max_tokens=100, temperature=0)

outputs = model.generate(prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Evaluation Benchmark Results

Model evaluation results obtained via lm-evaluation-harness following the configuration of Open LLM Leaderboard.

Benchmark Meta-Llama-3-8B SparseLlama-3-8B-pruned_50.2of4
(this model)
ARC-c
25-shot
59.47% 57.76%
MMLU
5-shot
65.29% 60.44%
HellaSwag
10-shot
82.14% 79.97%
WinoGrande
5-shot
77.27% 77.19%
GSM8K
5-shot
44.81% 47.92%
TruthfulQA
0-shot
43.96% 41.02%
Average
Accuracy
62.16% 60.72%
Recovery 100% 97.68%

Model evaluation results obtained via Mosaic Eval Gauntlet following the configuration of Eval Gauntlet v0.3.

Benchmark Meta-Llama-3-8B SparseLlama-3-8B-pruned_50.2of4
(this model)
World Knowledge 58.08% 54.61%
Commonsense Reasoning 47.66% 47.62%
Language Understanding 71.13% 67.58%
Symbolic Problem Solving 38.44% 32.15%
Reading Comprehension 57.48% 55.76%
Average Accuracy 54.70% 51.54%
Recovery 100% 94.22%

Help

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

Acknowledgment

This model is built with Meta Llama 3. For more details on its licence please check the model card of Meta-Llama-3-8B.

Downloads last month
65
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for neuralmagic/SparseLlama-3-8B-pruned_50.2of4

Finetuned
(380)
this model
Quantizations
1 model