Abstract
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2024)
- Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes (2024)
- Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models (2024)
- Model Compression and Efficient Inference for Large Language Models: A Survey (2024)
- LLM Inference Unveiled: Survey and Roofline Model Insights (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thanks for sharing your work! I was able to demonstrate the model healing process here while using ShortGPT's block influence metric for layer removal/pruning.
@gromovand @kushaltirumala @hassansh @pglo @danintheory super cool! any plans to release the code?
https://github.com/arcee-ai/PruneMe
We tried to replicate the results. It seems true. Deeper layers can be removed, and still, we can get a model that can generate text.
Working on reproducing this and similar pruning criteria here:
https://github.com/melisa-writer/short-transformers
Linear approximation of the last token is there, along with angular distances, bi score etc.
The goal of the library: choose your distance (layer importance metric), get cropped model. :rocket:
Revolutionary Layer Pruning: Are Deeper Layers Overrated?
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 38
Browse 38 models citing this paperDatasets citing this paper 0
No dataset linking this paper