VRAM Estimates

#3
by ernestr - opened

Thanks so much for your reviews and merges!

Could you provide estimates of VRAM usage for the EXL2 quants given varying context size e.g. 16k & 32k (or point me to steps to allow me to calculate myself given the specific tokenizer and the size of the repo)?

I put that information on the EXL2 versions' model cards:

Max Context w/ 48 GB VRAM: (24 GB VRAM is not enough, even for 2.4bpw, use GGUF instead!)

  • 2.4bpw: 32K (32768 tokens) w/ 8-bit cache, 21K (21504 tokens) w/o 8-bit cache
  • 2.65bpw: 30K (30720 tokens) w/ 8-bit cache, 15K (15360 tokens) w/o 8-bit cache
  • 3.0bpw: 12K (12288 tokens) w/ 8-bit cache, 6K (6144 tokens) w/o 8-bit cache

Thanks! In case folks are curious about 64 GB of VRAM, here is where I maxed out with 3.5bpw. I'm getting ~3-4t/s with minimal context in cache.

3.5bpw: 20K (20,000 tokens) w/ 8-bit cache

image.png

At 5.0bpw with 4bit cache and full context I'm using 76.8gb of ram and its generating at 11-13t/s. This is with a A100 80gb. Also Wolfram I absolutely love this model thank you so much for making something this godly!

Thanks, guys, for all of this information. And now I want an A100, too! ;)

I'm happy how it turned out, but didn't do much besides merging and converting and quantizing the already godly components others provided. But I'm glad you like it so much! :)

I bought a mac m2 ultra with 192G ram recently. Can the EXL2 versions' model run on mac?

I put that information on the EXL2 versions' model cards:

Max Context w/ 48 GB VRAM: (24 GB VRAM is not enough, even for 2.4bpw, use GGUF instead!)

  • 2.4bpw: 32K (32768 tokens) w/ 8-bit cache, 21K (21504 tokens) w/o 8-bit cache
  • 2.65bpw: 30K (30720 tokens) w/ 8-bit cache, 15K (15360 tokens) w/o 8-bit cache
  • 3.0bpw: 12K (12288 tokens) w/ 8-bit cache, 6K (6144 tokens) w/o 8-bit cache

Just curious, I have a dual 3090 setup, I cannot run 3.0bpw on it at all. Even with 8k context on 4bit cache... Any tips on how I can get it to work?

Sign up or log in to comment