-
Notifications
You must be signed in to change notification settings - Fork 12.6k
graph : reduce splits for recurrent and hybrid models #14825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Amazing! I'll get numbers on metal tomorrow, but this looks like a great improvement. |
@gabe-l-hart Your investigation in #14743 (comment) (and nearby comments) really helped with this, thank you! The remaining 3 splits look like this on CUDA, for granite-4.0-tiny-preview:
Most of the model graph is in that last split. |
You could remove the last split by using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tiny comment suggestion, but otherwise this looks great. I combined your changes here with #14743 and saw the following speedup for G4 tiny:
Apple M3 Max / 64GB
./bin/llama-batched-bench -m ~/models/granite-4.0-tiny-preview/ggml-model-Q4_K_M.gguf -c 131072 -b 2048 -ub 512 -npp 256,2560 -ntg 256 -npl 1,2 -ngl 99
Current master
PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
---|---|---|---|---|---|---|---|---|---|
256 | 256 | 1 | 512 | 0.844 | 303.46 | 6.187 | 41.38 | 7.031 | 72.82 |
256 | 256 | 2 | 1024 | 1.636 | 313.02 | 10.742 | 47.66 | 12.377 | 82.73 |
2560 | 256 | 1 | 2816 | 8.300 | 308.44 | 6.308 | 40.58 | 14.608 | 192.77 |
2560 | 256 | 2 | 5632 | 16.473 | 310.80 | 11.213 | 45.66 | 27.687 | 203.42 |
With #14743
PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
---|---|---|---|---|---|---|---|---|---|
256 | 256 | 1 | 512 | 0.273 | 936.38 | 5.489 | 46.64 | 5.762 | 88.85 |
256 | 256 | 2 | 1024 | 0.510 | 1004.44 | 9.650 | 53.05 | 10.160 | 100.79 |
2560 | 256 | 1 | 2816 | 2.467 | 1037.78 | 5.646 | 45.34 | 8.113 | 347.10 |
2560 | 256 | 2 | 5632 | 4.971 | 1029.87 | 9.799 | 52.25 | 14.770 | 381.31 |
With both
PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
---|---|---|---|---|---|---|---|---|---|
256 | 256 | 1 | 512 | 0.270 | 946.54 | 4.697 | 54.50 | 4.967 | 103.07 |
256 | 256 | 2 | 1024 | 0.496 | 1031.39 | 8.748 | 58.53 | 9.244 | 110.77 |
2560 | 256 | 1 | 2816 | 2.433 | 1052.04 | 4.843 | 52.85 | 7.277 | 386.98 |
2560 | 256 | 2 | 5632 | 4.900 | 1044.89 | 8.846 | 57.88 | 13.746 | 409.73 |
I tried @slaren's suggestion by adding
|
@compilade any objection to me hitting the merge button on this? |
@ggerganov would it be OK to get this merged soon? It looks like @compilade may be otherwise occupied and I'm hoping to get it into my big bump PR in Ollama. |
* graph : avoid creating redundant s_copy views * graph : comment the s_copy views
This should fix the problem noticed in #14743 (comment) by @ggerganov. It should avoid making redundant views of
s_copy
.This was inspired by the potential solutions in #14743 (comment) and #14743 (comment), by @slaren and @gabe-l-hart, respectively. Thank you for looking into this problem!
The approach taken here is to create and store the two views of
s_copy
inllm_graph_input_rs
ass_copy_main
ands_copy_extra
when the graph input is created, instead of creating them inbuild_rs
(which is called on each model layer).The following tests were run on a build compiled with
-DGGML_CUDA=ON
.(with a NVIDIA GeForce RTX 3080)
With
./bin/llama-batched-bench -m /workspace/models/Granite-4.0-Tiny-Preview-62x915M-Q8_0.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512,1024 -ntg 256 -npl 1 -ngl 99
,master
:This PR:
This noticeably increases generation speed (the
S_TG t/s
column).Make sure to read the contributing guidelines before submitting a PR