BitDelta: Your Fine-Tune May Only Be Worth One Bit
Abstract
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ (2024)
- ApiQ: Finetuning of 2-Bit Quantized Large Language Model (2024)
- APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference (2024)
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (2024)
- LQER: Low-Rank Quantization Error Reconstruction for LLMs (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
So a 7B model would take 28GB VRAM for the base model plus 700MB for the delta. Great for data centers, but for consumers QLoRA, GPTQ, etc are far more efficient. I think most ordinary people will be happy to sacrifice a minor drop in accuracy for a huge drop in VRAM requirements.
can you combine this idea with this paper? https://huggingface.co/papers/2312.15166
you could probably compress that model quite a bit since it has 24 duplicated layers with (presumably) only a small change between the 2 copies
If it does work, then the idea could be applied to larger models without making them too expensive to run
@timothelaborie Sounds interesting, will take a look! For a model with 32 layers, the 16 extra layers in the depth up-scaled model can be represented as 1-bit deltas. Main concern would be if they use a lot of data for continued pre-training - BitDelta tends to fail if the weight delta is too large.
Eg. this happened when we tried to compress mixtral experts, which are hypothesized to be continue-pretrained from Mistral 7B .
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper