-
Notifications
You must be signed in to change notification settings - Fork 12.7k
llama : add high-throughput mode #14363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Right now I am comparatively less busy with my PhD so it would be a good time for me to write CUDA code that is still missing, if there is any. |
For now, these are the necessary CUDA changes:
// old
// q: [n_embd_k, n_batch, n_head, 1]
// k: [n_embd_k, n_kv, n_head_kv, 1]
// v: [n_embd_v, n_kv, n_head_kv, 1] !! not transposed !!
// mask: [n_kv, n_batch_pad, 1, 1] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
// res: [n_embd_v, n_head, n_batch, 1] !! permuted !!
GGML_API struct ggml_tensor * ggml_flash_attn_ext(
...);
// new - supports `n_seq` dimension:
// q: [n_embd_k, n_batch, n_head, n_seq]
// k: [n_embd_k, n_kv, n_head_kv, n_seq]
// v: [n_embd_v, n_kv, n_head_kv, n_seq] !! not transposed !!
// mask: [n_kv, n_batch_pad, n_seq, 1] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
// res: [n_embd_v, n_head, n_batch, n_seq] !! permuted !!
GGML_API struct ggml_tensor * ggml_flash_attn_ext(
...); CPU might also need to be extended (not sure yet)
Edit: the CPU versions of |
ab2a2bb
to
1b74b9d
Compare
c246784
to
06bb08a
Compare
82277da
to
4534123
Compare
2f577c5
to
30b4d4e
Compare
6179578
to
dfceb01
Compare
eb5856c
to
ee0f729
Compare
ee0f729
to
deae7cd
Compare
988d0cd
to
dbcfcaa
Compare
src/llama-kv-cache-unified.cpp
Outdated
v_cells[s].resize(kv_size); | ||
} | ||
|
||
// by default, all sequence ids are mapped to the 0th virtual sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to understand the purpose of virtual sequences.
- Is it to make the unified cache not unified?
- Should it be a separate cache type instead?
- why is
n_seq_virt
a number and not abool
of whether or not the cache is unified?- Is it to eventually allow
n_seq_max % n_seq_virt == 0
for a partially-unified cache?
- Is it to eventually allow
Are virtual sequences intended to be used with other types of caches eventually (e.g. recurrent)?- The concept here seems specific to the self-attention KV cache (unless I'm misunderstanding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today I found a better term instead of "virtual sequences": "streams". So I'll use "streams" here and will update the code later today or tomorrow.
Is it to make the unified cache not unified?
Roughly yes. The user will be able to select between unified (i.e. single stream) or non-unified (multiple streams). Each mode has advantages in different scenarios. Single stream is good when the sequences share large common prefixes. Multiple streams are good when the sequences are mostly or completely independent from each other.
The first iteration will support 1 stream (i.e. same as master
, vanilla unified KV cache) and n_seq_max
streams. The latter means that each sequence id is assigned to a separate stream.
In theory, we could assign multiple sequence ids to the same stream to get a partially-unified KV cache, but this would need extra work and it might not have any useful applications. So out of scope for now.
Should it be a separate cache type instead?
There is too much similar logic. Still thinking about it, but most likely it will end up in the same cache type.
The concept here seems specific to the self-attention KV cache (unless I'm misunderstanding)
Yes.
src/llama-batch.h
Outdated
// if sequential == true, the tokens in the ubatch will have increasing sequential sequence ids | ||
llama_ubatch split_equal(uint32_t n_ubatch, bool sequential); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are sequential seq_ids
required when virtual sequences are used?
Is it because a contiguous (along the virtual sequence dimension) slice of the KV cache is used?
I wonder if there could be a way to avoid this requirement with ggml_get_rows
and/or ggml_mul_mat_id
. Might not be worth the extra indirection, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are sequential seq_ids required when virtual sequences are used?
Is it because a contiguous (along the virtual sequence dimension) slice of the KV cache is used?
Yes, we make a view of the KV cache across the streams here:
llama.cpp/src/llama-kv-cache-unified.cpp
Lines 976 to 992 in dbcfcaa
ggml_tensor * llama_kv_cache_unified::get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const { | |
const int32_t ikv = map_layer_ids.at(il); | |
auto * k = layers[ikv].k; | |
const uint32_t ns = sinfo.s1 - sinfo.s0 + 1; | |
const uint64_t kv_size = get_size(); | |
return ggml_view_4d(ctx, k, | |
hparams.n_embd_head_k, hparams.n_head_kv(il), n_kv, ns, | |
ggml_row_size(k->type, hparams.n_embd_head_k), | |
ggml_row_size(k->type, hparams.n_embd_k_gqa(il)), | |
ggml_row_size(k->type, hparams.n_embd_k_gqa(il)*kv_size), | |
ggml_row_size(k->type, hparams.n_embd_k_gqa(il)*kv_size)*sinfo.s0); | |
} | |
The ns
var is the number of streams that participate in the current ubatch
. Their stream indices range from [s0, s1]
.
I wonder if there could be a way to avoid this requirement with ggml_get_rows and/or ggml_mul_mat_id. Might not be worth the extra indirection, though.
It should be possible. But I'm not sure if it would be worth - both in performance and in complexity. We can explore though.
src/llama-kv-cache-unified.cpp
Outdated
@@ -45,7 +46,7 @@ llama_kv_cache_unified::llama_kv_cache_unified( | |||
auto it = ctx_map.find(buft); | |||
if (it == ctx_map.end()) { | |||
ggml_init_params params = { | |||
/*.mem_size =*/ size_t(2u*n_layer_cache*ggml_tensor_overhead()), | |||
/*.mem_size =*/ size_t(2u*(1 + n_seq_virt)*n_layer_cache*ggml_tensor_overhead()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 1 +
intended? Why was it added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the per-stream views of the KV cache:
llama.cpp/src/llama-kv-cache-unified.cpp
Lines 125 to 133 in dbcfcaa
std::vector<ggml_tensor *> k_seq; | |
std::vector<ggml_tensor *> v_seq; | |
for (uint32_t s = 0; s < n_seq_virt; ++s) { | |
k_seq.push_back(ggml_view_2d(ctx, k, n_embd_k_gqa, kv_size, k->nb[1], s*k->nb[2])); | |
v_seq.push_back(ggml_view_2d(ctx, v, n_embd_v_gqa, kv_size, v->nb[1], s*v->nb[2])); | |
} | |
These are used to implement the llama_memory_seq_cp()
. This operation is no longer just assigning ids - it performs actual copy of the buffers in memory when we use multiple streams. Using these helper views, the operation is quite simple to implement:
llama.cpp/src/llama-kv-cache-unified.cpp
Lines 289 to 329 in dbcfcaa
bool is_full = true; | |
if (p0 > 0 && p0 + 1 < (int) get_size()) { | |
is_full = false; | |
} | |
if (p1 > 0 && p1 + 1 < (int) get_size()) { | |
is_full = false; | |
} | |
GGML_ASSERT(is_full && "seq_cp() is only supported for full KV buffers"); | |
//LLAMA_LOG_WARN("%s: copying KV buffer from %d (virt = %d) to %d (virt = %d)\n", __func__, seq_id_src, s0, seq_id_dst, s1); | |
for (uint32_t il = 0; il < layers.size(); ++il) { | |
const auto & layer = layers[il]; | |
ggml_backend_tensor_copy(layer.k_seq[s0], layer.k_seq[s1]); | |
ggml_backend_tensor_copy(layer.v_seq[s0], layer.v_seq[s1]); | |
// TODO: do we need synchronization here? | |
} | |
// TODO: support this: | |
GGML_ASSERT(v_cells[s0].get_has_shift() == false && "cannot copy a KV buffer that has a pending shift"); | |
v_cells[s1].reset(); | |
for (uint32_t i = 0; i < v_cells[s0].size(); ++i) { | |
if (v_cells[s0].seq_has(i, seq_id_src)) { | |
v_cells[s1].pos_set(i, v_cells[s0].pos_get(i)); | |
v_cells[s1].seq_add(i, seq_id_dst); | |
} | |
} | |
v_heads[s1] = v_heads[s0]; | |
//for (uint32_t s = 0; s < n_seq_virt; ++s) { | |
// LLAMA_LOG_WARN("%s: seq %d: min = %d, max = %d\n", __func__, s, v_cells[s].seq_pos_min(s), v_cells[s].seq_pos_max(s)); | |
//} | |
} |
Though we cannot copy partial sequences when using multiple streams.
src/llama-batch.cpp
Outdated
// accept only increasing sequence ids | ||
if (sequential) { | ||
add = add && (cur_seq_set.empty() || batch.seq_id[i][0] == last_seq_id + 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about decreasing sequence ids? Is the requirement that they are increasing, or that the included seq_ids
should be in a contiguous range?
(decreasing sequence ids might not really happen often in practice though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decreasing would also work - we just need continuous range. We can either add this, if there is an elegant way to search for this. Or we add some batch pre-processing step to move the complexity at a higher level. Or just delegate it to the user by warning when the batch is not arranged optimally.
dbcfcaa
to
33dcc3c
Compare
Is kv cache quant not supported yet? I've just updated to master, set LLAMA_SET_ROWS=1 and run llama-server with my usual command line but got the following error (even with very small context and just say "hi" from webui):
I removed |
@rujialiu You can see summary of the supported backends here: #14661. For example, with CUDA, only Btw, how are you using |
Thanks for the info! That message appeared before sending any request:
BTW: This is a surprisingly usable setup with 12GB VRAM to allow 4 concurrent coding agent with context length 128k. I'm been already using it with cline and able to get some non-trivial jobs done (though should be used with care) :D |
There was indeed a bug - will be fixed with #14733. |
@ggerganov It looks like mask is not correctly padded with parallel processing:
Edit: With
|
@CISC The diff --git a/src/llama-memory-hybrid.cpp b/src/llama-memory-hybrid.cpp
index d8e2086c8..ab6470bdf 100644
--- a/src/llama-memory-hybrid.cpp
+++ b/src/llama-memory-hybrid.cpp
@@ -31,21 +31,21 @@ llama_memory_hybrid::llama_memory_hybrid(
hparams(model.hparams),
mem_attn(new llama_kv_cache_unified(
model,
filter_attn == nullptr ?
[&](int32_t il) { return !hparams.is_recurrent(il); }
: filter_attn,
type_k,
type_v,
v_trans,
offload,
- 1,
+ false,
kv_size,
n_seq_max,
n_pad,
n_swa,
swa_type
)),
mem_recr(new llama_memory_recurrent(
model,
filter_recr == nullptr ?
[&](int32_t il) { return hparams.is_recurrent(il); } |
I tested again after #14712 is merged. I can confirm that my use case is working without concurrent access. The VRAM usage is reduced from 11GB->10GB at startup, I've used it for a while with cline without any problem. Then I used claude code to test concurrent access and, after some time, got assertion failed. Here is some log at the end:
The command line is almost the same except the server port number. |
@rujialiu Thanks for reporting these - very useful information. The CUDA backend needs to be updated to handle quantized KV cache with the new split KV mode (#14756). Btw, are you interested in writing a tutorial for using Claude Code with |
Yes, I'd love to share my experience! Recently I tried a lot of approaches to agentic coding with |
Since the code is not merged, I manually tried that branch, with the same command line (q4_0 kv cache):
I also tested q8 kv cache for a while (but a few minutes only), too. No crash/assertion failed. BTW: I'm planning to try Qwen3-Coder-480B-A35B-Instruct soon, try to tune what to offload to CPU and see how your new changes work. @ggerganov |
ok, i see 14820 is replaced by #14822 |
Yes, but make sure to apply the patch that I posted there to remove the extra alloc. Otherwise it will assert. |
Unfortunately, it exited without any message after a few minutes (the session is ~30 minutes but I think there is a gap of >15 minutes without access). At startup, VRAM usage is 9.8GB, and when it exited, it's 9.95GB:
You can add some more debug prints temporarily in that branch so I can give you more information next time. @ggerganov |
I'm unable to make it exit for the second time. I'm been using it till now (though not very intensively). One hour passed, and VRAM usage is now 10.07GB. I even explicitly asked cc to spawn parallel tasks (and can indeed see the requests are handled with different slots). So it looks like it's a very rare problem. Are you able to reproduce this with e.g. |
|
Yes. Now I think might be related peak memory. I can see that the peak RAM usage is 18GB greater than its current usage. After doing some math, I realized that at the "peak RAM" time it's very close to OOM. I think |
I believe when CUDA fails to allocate, it will print an error instead of silently exiting. Though not 100% sure. |
I actually mean CPU failed to allocate RAM, not CUDA. In my use case, most RAM is used by kv cache I think (I used -nkvo). It looks like it not only allocates memory, but frees a lot of memory after some idle time? |
I conducted an experiment using the L40S GPU and the Gemma 27B model (Q_4), and I noticed an unusual increase in latency. With 1 CCU, the latency consistently hovered around 600ms. However, after experiencing a high load with 32 CCU, I checked the latency again using 1 CCU, and it had risen to approximately 1500ms. Do you have any insights on this issue? Here is my command:
|
@prd-tuong-nguyen Most likely you didn't set |
@ggerganov I've already set it, but I've noticed that after a high load, my model consistently returns an empty response while the server logs indicate that it generates the maximum number of tokens (which is set to 36 in my configuration), and that’s why the latency increases. It may be a bug. |
It's possible that you run out of context - try to add |
target #14285
Overview
Improve multi-sequence decoding performance by avoiding the cross-sequence attention compute.
Note
To enable this functionality, there is a temporary requirement
LLAMA_SET_ROWS=1
to be set in your environment variable. In the future, this will become the default. See below for more info.If you try to use "split KV" cache and haven't added
LLAMA_SET_ROWS=1
you will see the following warning:Description
One significant drawback of the unified KV cache is that it leads to performing a lot of unnecessary computation in the attention when the unified buffer is shared between many large independent sequences. The reason is that we have to view this buffer continuously and therefore we end up computing large potions of "cross-sequence attention" which we then simply discard.
With this change, we add option to split the unified KV cache buffer into multiple buffers - one for each sequence. This decouples the sequences from each other and improves the performance and memory usage of the attention when more than one sequence is used. To achieve that, when the batch reaches the attention, we split it into multiple "streams":
llama.cpp/src/llama-graph.cpp
Lines 1035 to 1044 in c96c48c
Each stream has its own KV cache buffer and thus no longer "sees" the rest of the other streams - it attends only to the tokens that belong to the same stream.
With this approach we now have 2 modes:
The new "split" mode is enabled by default. However it requires the
LLAMA_SET_ROWS=1
environment variable to be set. Otherwise, a warning will be printed and the context will fallback to "unified" mode. In the future, after there is enoughggml_set_rows()
coverage in the backends (#14661) this will become the default mode.To force the old "unified" mode, use
--kv-unified
CLI arg.API Changes
bool llama_context_params::kv_unified
. Default isfalse
llama.cpp/include/llama.h
Lines 336 to 340 in fb8150d
Testing
Use
LLAMA_SET_ROWS=1 llama-[command] ...
Qwen 2.5 Coder 3B Q8_0, M2 Ultra
Geamma 3 4B Q8_0, M2 Ultra
Using a more real-world example with
llama-parallel
:TODO
ggml_soft_max_ext()
support for virtual sequencesllama_memory_seq_cp
support for virtual sequencessplit_equal
support sequential idsLLAMA_HT
become regular compute parametern_ctx
meaning (total vs per-sequence)Require(no longer needed)n_embd_v_gqa(il) == const
when FA is offNext PRs
(split_equal + padding)
and stream split[TAG_NO_CACHE_PAD]
ggml_set_rows()
is fully adoptedllama-parallel
to use different RNG seeds for the different clients