minimum vram?

by CHNtentes - opened 5 days ago

Discussion

CHNtentes

5 days ago

not very familiar with moe models. does it require 685GB or 37GB vram?

shuimu1337

5 days ago

need a100 x 10

surak

5 days ago

@CHNtentes it needs about 1tb vram

DeFactOfficial

4 days ago

What if you have a single GPU with 48GB VRAM and 1tb ordinary system RAM? Someone told me that it's possible to separate the layers so that only the active expert (37GB if using a Q8) is in VRAM at any given time, and the rest is in system RAM...

I have no doubt this is possible to do - but would the performance be even close to usable??

bullerwins

4 days ago

What if you have a single GPU with 48GB VRAM and 1tb ordinary system RAM? Someone told me that it's possible to separate the layers so that only the active expert (37GB if using a Q8) is in VRAM at any given time, and the rest is in system RAM...

I have no doubt this is possible to do - but would the performance be even close to usable??

you could try with vLLM as it has CPU offloading with
--cpu-offload-gb 900

breadlicker45

4 days ago

This comment has been hidden

kingo55

1 day ago

Is it feasible this will run on only 160gb VRAM with the right quantization?

bullerwins

1 day ago

Is it feasible this will run on only 160gb VRAM with the right quantization?

i mean, anything can theoretically be run anywhere if you quantize it enough. It's usually considered that at least 4bpw/Q4 is the minimum to retain good quality. So for Deepseek 3 what would equal to around 380GB VRAM (with a small context size). Once/if llama.cpp/GGUF is compatible, we can offload some layers to CPU RAM, being a MoE has the benefit of still maintaining decent speed even while on RAM.

So I would say a total of 400GB of VRAM+RAM would be necessary, the more proportion of VRAM the better.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment