PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Abstract
This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.
Community
With LLM Studio, on a mid-range 8gb vram laptop, I get approx 30 token/sec with a 4bit 7B model
PowerInfer should allow me to run the full precision model at the same speed, or a 4bit quant at approx 240 token/sec (if the fuzzy calculations in my head are right)
This is absolutely nuts. Most exciting paper I've read for a while
If this, or a derivative of this, works with MoE models, this + mixtral is basically ChatGPT local, super fast, super private, super uncensored (on a mid-range laptop) ๐๐๐๐๐๐๐
Can't wait for the repo to come out. Great paper. ๐
Unlocking LLM Power on Consumer GPUs: Meet PowerInfer!
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 8
Browse 8 models citing this paperDatasets citing this paper 0
No dataset linking this paper