14 6 3

abubakar

stormchaser

AI & ML interests

None yet

Recent Activity

reacted to bartowski's post with 👀 22 days ago

Looks like Q4_0_N_M file types are going away Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable) You can see the reference PR here: https://github.com/ggerganov/llama.cpp/pull/10446 So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms) As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those ! Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541 Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

reacted to bartowski's post with 👍 22 days ago

updated a collection about 1 month ago

papers

View all activity

Organizations

None yet

stormchaser's activity

reacted to bartowski's post with 👀👍 22 days ago

Post

12726

Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

15 replies

updated a collection about 1 month ago

papers

Collection

3 items • Updated about 1 month ago

upvoted a paper about 1 month ago

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 48

commented a paper about 2 months ago

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 44 •

New activity in KingNish/Realtime-FLUX 3 months ago

Красивая девушка и молодой юноша. Свидание в отеле. Кровать.

#9 opened 3 months ago by

Asanych

updated a collection 6 months ago

papers

Collection

3 items • Updated about 1 month ago

updated a collection 9 months ago

spaces-fav

Collection

1 item • Updated Apr 24, 2024

updated a model 9 months ago

stormchaser/llava-llama-3-8b-v1_1-Q6_K-GGUF

Image-Text-to-Text • Updated Apr 23, 2024 • 3

reacted to akhaliq's post with 👀 9 months ago

Post

4207

Mixture-of-Depths

Dynamically allocating compute in transformer-based language models

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258)

Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens (k) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-k routing mechanism. Since k is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the k tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

1 reply

New activity in YanweiLi/MGM-2B 9 months ago

can i get gguf files?

#1 opened 9 months ago by

stormchaser

upvoted a collection 9 months ago

MGM

Collection

Official model collection for the paper "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models" • 13 items • Updated May 3, 2024 • 47

liked a model 11 months ago

CohereForAI/aya-101

Text2Text Generation • Updated Mar 31, 2024 • 2.97k • 625

updated a collection 11 months ago

papers

Collection

3 items • Updated about 1 month ago

upvoted 2 papers 11 months ago

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Paper • 2402.12226 • Published Feb 19, 2024 • 41

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

Paper • 2402.06332 • Published Feb 9, 2024 • 18

reacted to natolambert's post with 👍 11 months ago

Post

Today, we’re releasing our first pretrained Open Language Models (OLMo) at the Allen Institute for AI (AI2), a set of 7 billion parameter models and one 1 billion parameter variant. This line of work was probably the main reason I joined AI2 and is the biggest lever I see possible to enact meaningful change in how AI is used, studied, and discussed in the short term.

Links at the top because that's what you want:
* Core 7B model: allenai/OLMo-7B
* 7B model twin (different GPU hardware): allenai/OLMo-7B-Twin-2T
* 1B model: allenai/OLMo-1B
* Dataset: allenai/dolma
* Paper (arxiv soon): https://allenai.org/olmo/olmo-paper.pdf
* My personal blog post: https://www.interconnects.ai/p/olmo

OLMo will represent a new type of LLM enabling new approaches to ML research and deployment, because on a key axis of openness, OLMo represents something entirely different. OLMo is built for scientists to be able to develop research directions at every point in the development process and execute on them, which was previously not available due to incomplete information and tools.

Depending on the evaluation methods, OLMo 1 is either the best 7 billion parameter base model available for download or one of the best. This relies on a new way of thinking where models are judged on parameter plus token budget, similar to how scaling laws are measured for LLMs.

We're just getting started, so please help us learn how to be more scientific with LLMs!

1 reply

New activity in stabilityai/stable-code-3b 11 months ago

is it trained on 1.3t or 4t tokens?

#8 opened 12 months ago by

stormchaser