Nicolas Patry

Narsil

AI & ML interests

None yet

Recent Activity

View all activity

Articles

Organizations

Hugging Face's profile picture Safetensors's profile picture BigScience Workshop's profile picture Hugging Face Internal Testing Organization's profile picture superb's profile picture Deepmind's profile picture Text Generation Inference's profile picture BigScience Catalogue Data Dev's profile picture HuggingFaceM4's profile picture Hugging Face H4's profile picture Hugging Face Extreme-Scale's profile picture H4 Red Team's profile picture Code Llama's profile picture gg-hf's profile picture On-device Squad's profile picture hsramall's profile picture Tinkering's profile picture gg-tt's profile picture Hugging Face Discord Community's profile picture Meta Llama's profile picture nltpt's profile picture s0409's profile picture

Posts 3

view post
Post
1057
Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !



3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co/docs/text-generation-inference/conceptual/chunking