Skip to content

graph : reduce splits for recurrent and hybrid models #14825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 31, 2025

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Jul 23, 2025

This should fix the problem noticed in #14743 (comment) by @ggerganov. It should avoid making redundant views of s_copy.

This was inspired by the potential solutions in #14743 (comment) and #14743 (comment), by @slaren and @gabe-l-hart, respectively. Thank you for looking into this problem!

The approach taken here is to create and store the two views of s_copy in llm_graph_input_rs as s_copy_main and s_copy_extra when the graph input is created, instead of creating them in build_rs (which is called on each model layer).

The following tests were run on a build compiled with -DGGML_CUDA=ON.
(with a NVIDIA GeForce RTX 3080)

With ./bin/llama-batched-bench -m /workspace/models/Granite-4.0-Tiny-Preview-62x915M-Q8_0.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512,1024 -ntg 256 -npl 1 -ngl 99,

master:

llama_context: graph nodes  = 3580
llama_context: graph splits = 17
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 256 1 384 0.059 2181.55 2.837 90.22 2.896 132.59
256 256 1 512 0.077 3329.60 2.795 91.59 2.872 178.27
512 256 1 768 0.132 3864.18 2.805 91.26 2.938 261.43
1024 256 1 1280 0.262 3910.52 2.809 91.15 3.070 416.87

This PR:

llama_context: graph nodes  = 3438
llama_context: graph splits = 3
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 256 1 384 0.058 2217.99 2.548 100.49 2.605 147.39
256 256 1 512 0.075 3412.65 2.500 102.39 2.575 198.82
512 256 1 768 0.131 3911.20 2.509 102.01 2.640 290.87
1024 256 1 1280 0.263 3893.83 2.516 101.76 2.779 460.63

This noticeably increases generation speed (the S_TG t/s column).


Make sure to read the contributing guidelines before submitting a PR

@compilade compilade added the performance Speed related topics label Jul 23, 2025
@gabe-l-hart
Copy link
Collaborator

Amazing! I'll get numbers on metal tomorrow, but this looks like a great improvement.

@compilade
Copy link
Collaborator Author

compilade commented Jul 23, 2025

@gabe-l-hart Your investigation in #14743 (comment) (and nearby comments) really helped with this, thank you!

The remaining 3 splits look like this on CUDA, for granite-4.0-tiny-preview:

## SPLIT #0: CUDA0 # 2 inputs: [ (view) (   0K)] [ (view) (   0K)]
node #  2 (     SCALE): cache_r_l0 (reshaped (   0K) [CUDA0    1.dst] use=0: cache_r_l0 (reshaped (   0K) [CUDA0    1.dst]
node #  4 (  GET_ROWS):               node_4 (  39K) [CUDA0    2.sup] use=1: cache_r_l0 (reshaped (  39K) [CUDA0    1.dst]      CUDA0# (view)#0 (   0K) [ NULL    4.cpy]
node #  6 (  GET_ROWS):               node_6 (   0K) [CUDA0    2.sup] use=1: cache_r_l0 (reshaped (  39K) [CUDA0    1.dst]      CUDA0# (view)#0 (   0K) [ NULL    4.cpy]
node #  8 (       CPY): cache_r_l0 (view) (c (   0K) [CUDA0    1.dst] use=0:               node_6 (   0K) [CUDA0    2.sup]    cache_r_l0 (view) (   0K) [CUDA0    1.dst]

## SPLIT #1: CPU # 0 inputs
node # 10 (  GET_ROWS):              node_10 (   6K) [  CPU   1.wgt0] use=1:    token_embd.weight (  76M) [  CPU    1.dst]               leaf_4 (   0K) [  CPU    1.inp]

## SPLIT #2: CUDA0 # 3 inputs: [node_10 (   6K)] [leaf_104 (  40K)] [leaf_742 (   0K)]                                                                                            
node # 11 (     SCALE):             inp_embd (   6K) [CUDA0    2.sup] use=2:      CUDA0#node_10#0 (   6K) [ NULL    4.cpy]                                                        
node # 12 (  RMS_NORM):               norm-0 (   6K) [CUDA0      usr] use=1:             inp_embd (   6K) [CUDA0    2.sup]                                                        
node # 13 (       MUL):          attn_norm-0 (   6K) [CUDA0   1.wgt1] use=1:               norm-0 (   6K) [CUDA0      usr] blk.0.attn_norm.weig (   6K) [CUDA0    1.dst]
node # 15 (   MUL_MAT):              node_15 (  25K) [CUDA0   1.wgt0] use=3:  blk.0.ssm_in.weight (  10M) [CUDA0    1.dst] attn_norm-0 (reshape (   6K) [CUDA0   4.vsrc]
node # 18 (    CONCAT):              node_18 (  52K) [CUDA0    2.sup] use=2:    node_4 (reshaped) (  39K) [CUDA0   4.vsrc]  (view) (transposed) (  13K) [CUDA0   4.vsrc]
node # 21 (       CPY): cache_r_l0 (view) (c (  39K) [CUDA0    1.dst] use=0:               (view) (  51K) [CUDA0   4.vsrc]    cache_r_l0 (view) (  39K) [CUDA0    1.dst]
...
node #3434 (  RMS_NORM):                 norm (   6K) [CUDA0    2.sup] use=1:             l_out-39 (   6K) [CUDA0    2.sup]
node #3435 (       MUL):          result_norm (   6K) [CUDA0   1.wgt1] use=1:                 norm (   6K) [CUDA0    2.sup]   output_norm.weight (   6K) [CUDA0    1.dst]
node #3436 (   MUL_MAT):            node_3436 ( 192K) [CUDA0   1.wgt0] use=1:    token_embd.weight (  76M) [CUDA0    1.dst]          result_norm (   6K) [CUDA0   1.wgt1]
node #3437 (     SCALE):        result_output ( 192K) [CUDA0    2.sup] use=0:            node_3436 ( 192K) [CUDA0   1.wgt0]

Most of the model graph is in that last split.

@slaren
Copy link
Member

slaren commented Jul 23, 2025

You could remove the last split by using ggml_build_forward_expand to make the CPU GET_ROWS on token_embd the first operation in the graph.

Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny comment suggestion, but otherwise this looks great. I combined your changes here with #14743 and saw the following speedup for G4 tiny:

Apple M3 Max / 64GB

./bin/llama-batched-bench -m ~/models/granite-4.0-tiny-preview/ggml-model-Q4_K_M.gguf -c 131072 -b 2048 -ub 512 -npp 256,2560 -ntg 256 -npl 1,2 -ngl 99

Current master

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
256 256 1 512 0.844 303.46 6.187 41.38 7.031 72.82
256 256 2 1024 1.636 313.02 10.742 47.66 12.377 82.73
2560 256 1 2816 8.300 308.44 6.308 40.58 14.608 192.77
2560 256 2 5632 16.473 310.80 11.213 45.66 27.687 203.42

With #14743

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
256 256 1 512 0.273 936.38 5.489 46.64 5.762 88.85
256 256 2 1024 0.510 1004.44 9.650 53.05 10.160 100.79
2560 256 1 2816 2.467 1037.78 5.646 45.34 8.113 347.10
2560 256 2 5632 4.971 1029.87 9.799 52.25 14.770 381.31

With both

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
256 256 1 512 0.270 946.54 4.697 54.50 4.967 103.07
256 256 2 1024 0.496 1031.39 8.748 58.53 9.244 110.77
2560 256 1 2816 2.433 1052.04 4.843 52.85 7.277 386.98
2560 256 2 5632 4.900 1044.89 8.846 57.88 13.746 409.73

@gabe-l-hart
Copy link
Collaborator

You could remove the last split by using ggml_build_forward_expand to make the CPU GET_ROWS on token_embd the first operation in the graph.

I tried @slaren's suggestion by adding ggml_build_forward_expand(gf, cur); here. It did successfully remove the extra split. There was no noticeable change in the perf numbers (on metal), but may be worth adding for the future?

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
256 256 1 512 0.266 960.74 4.767 53.70 5.034 101.71
256 256 2 1024 0.496 1031.28 8.545 59.92 9.041 113.26
2560 256 1 2816 2.436 1051.05 4.879 52.47 7.315 384.98
2560 256 2 5632 4.891 1046.76 8.755 58.48 13.646 412.72

@gabe-l-hart
Copy link
Collaborator

@compilade any objection to me hitting the merge button on this?

@gabe-l-hart
Copy link
Collaborator

@ggerganov would it be OK to get this merged soon? It looks like @compilade may be otherwise occupied and I'm hoping to get it into my big bump PR in Ollama.

@ggerganov ggerganov merged commit 66625a5 into master Jul 31, 2025
50 of 51 checks passed
@ggerganov ggerganov deleted the compilade/recurrent-reduce-splits branch July 31, 2025 05:03
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 1, 2025
* graph : avoid creating redundant s_copy views

* graph : comment the s_copy views
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants