Why llama-server is so limited to concurrent requests even using -cb and -np ? #13935

celsowm · 2025-05-31T00:13:22Z

celsowm
May 31, 2025

Hi
Why llama-server is so limited to concurrent requests even using -cb and -np ?
Sglang and vllm servers are not so limited like llama.cpp when you have even hundreds of requests

o3 response:

vLLM and SGLang were designed from the ground-up for token-level scheduling and smart KV-cache management, so one copy of the model can keep a GPU busy while dozens of users stream tokens concurrently.
llama-cpp’s server, by contrast, simply round-robins request-level contexts that each own a large, contiguous cache, so every extra user slices throughput and memory almost linearly.

besrym · 2025-06-05T09:17:18Z

besrym
Jun 5, 2025

Hi, this is mainly due to memory management (kv cache). vLLM uses paged attention, which reduces the wasted portion of the kv cache to as close to zero.

At the moment llama.cpp does not yet use this strategy, and I cannot tell you whether this is planned for the future. (See: 1955)

0 replies

ExtReMLapin · 2025-06-05T19:49:42Z

ExtReMLapin
Jun 5, 2025

Also see #10860

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why llama-server is so limited to concurrent requests even using -cb and -np ? #13935

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why llama-server is so limited to concurrent requests even using -cb and -np ? #13935

Uh oh!

Uh oh!

celsowm May 31, 2025

Replies: 2 comments

Uh oh!

besrym Jun 5, 2025

Uh oh!

ExtReMLapin Jun 5, 2025

celsowm
May 31, 2025

besrym
Jun 5, 2025

ExtReMLapin
Jun 5, 2025