When GGUF?
When GGUF?
for what? this is too big to run as GGUF on any reasonable AI rig
When GGUF?
This is up to the llama.cpp team but digging into the instruct model and technical report it "shouldn't" be too long now. It shares a lot in common with the DeepSeekV2.5 arch. Here is where I admit I am not a contributor to llama.cpp so calling it "easy" would be a massive blunder on my behalf so all respect to the llama.cpp contibutors and godspeed!
for what? this is too big to run as GGUF on any reasonable AI rig
After reading through the techincal report other than those who bought into the tinybox hype this is actually a very reasonable model to run on a standard enthusiast setup like mine. I for one am VERY excited to run this model locally.
for what? this is too big to run as GGUF on any reasonable AI rig
After reading through the techincal report other than those who bought into the tinybox hype this is actually a very reasonable model to run on a standard enthusiast setup like mine. I for one am VERY excited to run this model locally.
Is that this computer?
https://tinygrad.org/#tinybox
yeah but $15k - $25k is way out of scope for most people
for what? this is too big to run as GGUF on any reasonable AI rig
After reading through the techincal report other than those who bought into the tinybox hype this is actually a very reasonable model to run on a standard enthusiast setup like mine. I for one am VERY excited to run this model locally.
Is that this computer?
https://tinygrad.org/#tinybox
Yes this this is my custom build and no it didn't cost me anytwhere close to 15K. I think the tinybox is a borderline grift as it wouldn't be able to run this seemingly SOTA model locally...
yeah but $15k - $25k is way out of scope for most people
4 3090's IS a reasonable AI rig though as well as this not costing me the $15-$25k you estimated.
for what? this is too big to run as GGUF on any reasonable AI rig
After reading through the techincal report other than those who bought into the tinybox hype this is actually a very reasonable model to run on a standard enthusiast setup like mine. I for one am VERY excited to run this model locally.
Is that this computer?
https://tinygrad.org/#tinyboxYes this this is my custom build and no it didn't cost me anytwhere close to 15K. I think the tinybox is a borderline grift as it wouldn't be able to run this seemingly SOTA model locally...
I have 5 3090s, and, my current computer is not working well… Do you have any low-cost computer suggestions that work with 4 or 5 3090s?
I do wonder how fast Deepseek-v3 would be on 4 3090s, any thoughts?
yeah but $15k - $25k is way out of scope for most people
4 3090's IS a reasonable AI rig though as well as this not costing me the $15-$25k you estimated.
I went off of the link for these tinyboxes which are 15 to 25k dollars with 40k option shipping next year
for what? this is too big to run as GGUF on any reasonable AI rig
After reading through the techincal report other than those who bought into the tinybox hype this is actually a very reasonable model to run on a standard enthusiast setup like mine. I for one am VERY excited to run this model locally.
Is that this computer?
https://tinygrad.org/#tinyboxYes this this is my custom build and no it didn't cost me anytwhere close to 15K. I think the tinybox is a borderline grift as it wouldn't be able to run this seemingly SOTA model locally...
I have 5 3090s, and, my current computer is not working well… Do you have any low-cost computer suggestions that work with 4 or 5 3090s?
If you have 5 3090's what motherboard do you have? Are your 3090's NVLinked together? Managing a server grade system is HARD and I would consider anything above 4x3090's server grade.
yeah but $15k - $25k is way out of scope for most people
4 3090's IS a reasonable AI rig though as well as this not costing me the $15-$25k you estimated.
I went off of the link for these tinyboxes which are 15 to 25k dollars with 40k option shipping next year
That's fair, but I consider the tinybox's to be a borderline grift so I wouldn't buy them come hell or high water. Basically don't buy a tinybox. Just for the sake of your wallet. My DM's are open on X @Nottlespike if you want some help putting together an actually "budget" friendly AI rig that will work.
@lunahr
some of us rent cloud GPU's with GGUF based solutions to experiment with models like this, GGUF support is welcome but of course dependent on the llamacpp project getting updated.
GGUF in general scales well across hardware, sure its not the absolute fastest solution for multi-GPU (currently) but its still faster than huggingface transformers and comes close to the fastest solutions.
So buying an AI rig that can run this is out of reach for most hobbyists, renting machine that can run a quantized version of this should be doable and then you still get that DIY feel that me and many others appreciate by using the tools you like rather than an external site or API even if its more expensive to do so.
@lunahr some of us rent cloud GPU's with GGUF based solutions to experiment with models like this, GGUF support is welcome but of course dependent on the llamacpp project getting updated.
GGUF in general scales well across hardware, sure its not the absolute fastest solution for multi-GPU (currently) but its still faster than huggingface transformers and comes close to the fastest solutions.So buying an AI rig that can run this is out of reach for most hobbyists, renting machine that can run a quantized version of this should be doable and then you still get that DIY feel that me and many others appreciate by using the tools you like rather than an external site or API even if its more expensive to do so.
renting for that is at bare min even at 4bit 4 h200's worth of compute .. aka we talking in the area of 512 gb vram / ram
oh my mistake .. a tb of mem - still that will be slow as heck
oh my mistake .. a tb of mem - still that will be slow as heck
Why? If we implement it correctly into transformers I'm calculating ~10 tps and maybe more if I forgo the MTP and use it for speculative decoding. I have 8 memory channels and 3000Mhz OC'ed DIMMS so total bandwith of 384 GB/s where a 3070 has 448.0 GB/s.... so with the sparsity nah not slow.
slow as heck as i said .. mind you its 37b active params .. at ~10 tps single query inference .. batching on avx well .. i think for that kind of hassle let alone kv lookup speeds on ctx- api is the way way way cheaper and sane way .. given there pricing its essentially free - i mean YES YOU CAN RUN IT LOCAL .. with that setup - however / it will end up costing probably more in power then the api costs you in production ^^
slow as heck as i said .. mind you its 37b active params .. at ~10 tps single query inference .. batching on avx well .. i think for that kind of hassle let alone kv lookup speeds on ctx- api is the way way way cheaper and sane way .. given there pricing its essentially free - i mean YES YOU CAN RUN IT LOCAL .. with that setup - however / it will end up costing probably more in power then the api costs you in production ^^
Lmao. Getting 10/18 tps on a 685/671B model is SLOW? I do enjoy companies logging my API requests of course! No way that could possibly backfire! Why did I run Llama 3.1 405B locally? For prod? No for fun....
18 TPS is fast enough regardless of the params so I don't know what the hell you are talking about
Waiting for a Q4 gguf to test a 2x Xeon server with 512GB ram. Hoping to get 2 tok/sec using cpu only. 12xddr4 channels. Obviously just for kicks, the ds3 api is almost free at the moment.