Replies: 2 comments
-
Hi, this is mainly due to memory management (kv cache). vLLM uses paged attention, which reduces the wasted portion of the kv cache to as close to zero. At the moment llama.cpp does not yet use this strategy, and I cannot tell you whether this is planned for the future. (See: 1955) |
Beta Was this translation helpful? Give feedback.
0 replies
-
Also see #10860 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi
Why llama-server is so limited to concurrent requests even using -cb and -np ?
Sglang and vllm servers are not so limited like llama.cpp when you have even hundreds of requests
o3 response:
Beta Was this translation helpful? Give feedback.
All reactions