FA Increases possible context length @Q4
Using the Flash Attention implementation into KoboldCPP it is posible to fit 16K into 8GB of vram @Q4_K_M
When running an IGPU i can fit 16K @Q5_K_S with FA and 512 batch size into 8GB
For the usual use case, a monitor running on the gpu. It's still possible. This is with one monitor on my gpu using 16K context.
And FA support for cards without tensor cores is coming: https://github.com/LostRuins/koboldcpp/issues/844
That's great news. Hurray. I still keep my usual recommendation for now because of the Tensor Core reqs, but if that's lifted I'll add that as an added recommendations if speeds are good.
Just adding a small data point, with KoboldCPP compiled with this, with a Q8_K 11b model on 2 x 1080 Ti (Pascal) setup, I get:
~20.2 T/s avg (proc + gen) with FP32 FA enabled.
~13.4 T/s avg (proc + gen) with FP32 FA disabled.
So a significant improvement in my case. Whereas with FP16 FA, I saw a decrease. So it definitely has utility for a subset of users.
This and the PR graphs look very promising!
Using Nexesenex's KCPP since it already merged it, things look good, performance is good and it seems to work well.
Using Nexesenex's KCPP since it already merged it, things look good, performance is good and it seems to work well.
I've only seen a slight increase with processing with FA, from like 1Kt/s to 1.1Kt/s when ingesting 8k context (turing)
I imagine it'll be a big deal for pascal users though
I'm trying out how well it squishes phi3 context now
Does phi-3 have GQA?
I guess this might be the cause?
Hardware
Note that by default, the Phi-3-mini model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
NVIDIA A100
NVIDIA A6000
NVIDIA H100
If you want to run the model on:
NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with attn_implementation="eager"
Optimized inference on GPU, CPU, and Mobile: use the ONNX models 128K
Ps. i dont know how to check for GQA. Theres nothing stated on the models card
I noticed some higher token numbers too but didn't compare directly to get accurate measures. It's at least not worse, and, bigger context for the same amount of VRAM, win-win if the quality remains the same.
I can at least continue to act smug over the EXL2 users and cope that LlamaCpp is the best thing to ever exist.
I noticed some higher token numbers too but didn't compare directly to get accurate measures. It's at least not worse, and, bigger context for the same amount of VRAM, win-win if the quality remains the same.
I can at least continue to act smug over the EXL2 users and cope that LlamaCpp is the best thing to ever exist.
I haven't noticed any degredation of context quality, and there's been no issues on the official koboldcpp opened relating to context issues with FA
I used to run about 35T/s when llama3 first came out and that was at 8k context. So there's been major improvements somewhere? I've made no hardware changes at all π₯
Is it possible that the gains are also from CUDA 12?
Or did you test against CUDA 12 koboldcpp?
Is it possible that the gains are also from CUDA 12?
Or did you test against CUDA 12 koboldcpp?
Old testing was done with the CUDA 12 Nexesenex forks (their forks have been on Cublas 12+ since like V1.58?)
New testing uses Nexesenex forks too, Cublas 12.2
Looks like I gotta compile this test just to see how my speeds are. I am pascal so once I read about it I was super excited
It gets even better for older gpus
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.66d_b2902%2B2
I was using my Pascal w/ Cuda 12.2 before and it was good but these additional PRs for speedups are great, will try later but if it is even faster that's crazy.
We're eating good boys.
@saishf , I'll open a new discussion in LLM-Discussions for this topic, to keep things organized.
Since this seems quite relevant I'll move things to here so it's better documented:
https://huggingface.co/LWDCLS/LLM-Discussions/discussions/11