graph : reduce splits for recurrent and hybrid models #14825

compilade · 2025-07-23T03:07:40Z

This should fix the problem noticed in #14743 (comment) by @ggerganov. It should avoid making redundant views of s_copy.

This was inspired by the potential solutions in #14743 (comment) and #14743 (comment), by @slaren and @gabe-l-hart, respectively. Thank you for looking into this problem!

The approach taken here is to create and store the two views of s_copy in llm_graph_input_rs as s_copy_main and s_copy_extra when the graph input is created, instead of creating them in build_rs (which is called on each model layer).

The following tests were run on a build compiled with -DGGML_CUDA=ON.
(with a NVIDIA GeForce RTX 3080)

With ./bin/llama-batched-bench -m /workspace/models/Granite-4.0-Tiny-Preview-62x915M-Q8_0.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512,1024 -ntg 256 -npl 1 -ngl 99,

master:

llama_context: graph nodes  = 3580
llama_context: graph splits = 17

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	256	1	384	0.059	2181.55	2.837	90.22	2.896	132.59
256	256	1	512	0.077	3329.60	2.795	91.59	2.872	178.27
512	256	1	768	0.132	3864.18	2.805	91.26	2.938	261.43
1024	256	1	1280	0.262	3910.52	2.809	91.15	3.070	416.87

This PR:

llama_context: graph nodes  = 3438
llama_context: graph splits = 3

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	256	1	384	0.058	2217.99	2.548	100.49	2.605	147.39
256	256	1	512	0.075	3412.65	2.500	102.39	2.575	198.82
512	256	1	768	0.131	3911.20	2.509	102.01	2.640	290.87
1024	256	1	1280	0.263	3893.83	2.516	101.76	2.779	460.63

This noticeably increases generation speed (the S_TG t/s column).

Make sure to read the contributing guidelines before submitting a PR

gabe-l-hart · 2025-07-23T03:32:23Z

Amazing! I'll get numbers on metal tomorrow, but this looks like a great improvement.

compilade · 2025-07-23T03:45:27Z

@gabe-l-hart Your investigation in #14743 (comment) (and nearby comments) really helped with this, thank you!

The remaining 3 splits look like this on CUDA, for granite-4.0-tiny-preview:

## SPLIT #0: CUDA0 # 2 inputs: [ (view) (   0K)] [ (view) (   0K)]
node #  2 (     SCALE): cache_r_l0 (reshaped (   0K) [CUDA0    1.dst] use=0: cache_r_l0 (reshaped (   0K) [CUDA0    1.dst]
node #  4 (  GET_ROWS):               node_4 (  39K) [CUDA0    2.sup] use=1: cache_r_l0 (reshaped (  39K) [CUDA0    1.dst]      CUDA0# (view)#0 (   0K) [ NULL    4.cpy]
node #  6 (  GET_ROWS):               node_6 (   0K) [CUDA0    2.sup] use=1: cache_r_l0 (reshaped (  39K) [CUDA0    1.dst]      CUDA0# (view)#0 (   0K) [ NULL    4.cpy]
node #  8 (       CPY): cache_r_l0 (view) (c (   0K) [CUDA0    1.dst] use=0:               node_6 (   0K) [CUDA0    2.sup]    cache_r_l0 (view) (   0K) [CUDA0    1.dst]

## SPLIT #1: CPU # 0 inputs
node # 10 (  GET_ROWS):              node_10 (   6K) [  CPU   1.wgt0] use=1:    token_embd.weight (  76M) [  CPU    1.dst]               leaf_4 (   0K) [  CPU    1.inp]

## SPLIT #2: CUDA0 # 3 inputs: [node_10 (   6K)] [leaf_104 (  40K)] [leaf_742 (   0K)]                                                                                            
node # 11 (     SCALE):             inp_embd (   6K) [CUDA0    2.sup] use=2:      CUDA0#node_10#0 (   6K) [ NULL    4.cpy]                                                        
node # 12 (  RMS_NORM):               norm-0 (   6K) [CUDA0      usr] use=1:             inp_embd (   6K) [CUDA0    2.sup]                                                        
node # 13 (       MUL):          attn_norm-0 (   6K) [CUDA0   1.wgt1] use=1:               norm-0 (   6K) [CUDA0      usr] blk.0.attn_norm.weig (   6K) [CUDA0    1.dst]
node # 15 (   MUL_MAT):              node_15 (  25K) [CUDA0   1.wgt0] use=3:  blk.0.ssm_in.weight (  10M) [CUDA0    1.dst] attn_norm-0 (reshape (   6K) [CUDA0   4.vsrc]
node # 18 (    CONCAT):              node_18 (  52K) [CUDA0    2.sup] use=2:    node_4 (reshaped) (  39K) [CUDA0   4.vsrc]  (view) (transposed) (  13K) [CUDA0   4.vsrc]
node # 21 (       CPY): cache_r_l0 (view) (c (  39K) [CUDA0    1.dst] use=0:               (view) (  51K) [CUDA0   4.vsrc]    cache_r_l0 (view) (  39K) [CUDA0    1.dst]
...
node #3434 (  RMS_NORM):                 norm (   6K) [CUDA0    2.sup] use=1:             l_out-39 (   6K) [CUDA0    2.sup]
node #3435 (       MUL):          result_norm (   6K) [CUDA0   1.wgt1] use=1:                 norm (   6K) [CUDA0    2.sup]   output_norm.weight (   6K) [CUDA0    1.dst]
node #3436 (   MUL_MAT):            node_3436 ( 192K) [CUDA0   1.wgt0] use=1:    token_embd.weight (  76M) [CUDA0    1.dst]          result_norm (   6K) [CUDA0   1.wgt1]
node #3437 (     SCALE):        result_output ( 192K) [CUDA0    2.sup] use=0:            node_3436 ( 192K) [CUDA0   1.wgt0]

Most of the model graph is in that last split.

slaren · 2025-07-23T10:28:54Z

You could remove the last split by using ggml_build_forward_expand to make the CPU GET_ROWS on token_embd the first operation in the graph.

gabe-l-hart

One tiny comment suggestion, but otherwise this looks great. I combined your changes here with #14743 and saw the following speedup for G4 tiny:

Apple M3 Max / 64GB

./bin/llama-batched-bench -m ~/models/granite-4.0-tiny-preview/ggml-model-Q4_K_M.gguf -c 131072 -b 2048 -ub 512 -npp 256,2560 -ntg 256 -npl 1,2 -ngl 99

Current `master`

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
256	256	1	512	0.844	303.46	6.187	41.38	7.031	72.82
256	256	2	1024	1.636	313.02	10.742	47.66	12.377	82.73
2560	256	1	2816	8.300	308.44	6.308	40.58	14.608	192.77
2560	256	2	5632	16.473	310.80	11.213	45.66	27.687	203.42

With #14743

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
256	256	1	512	0.273	936.38	5.489	46.64	5.762	88.85
256	256	2	1024	0.510	1004.44	9.650	53.05	10.160	100.79
2560	256	1	2816	2.467	1037.78	5.646	45.34	8.113	347.10
2560	256	2	5632	4.971	1029.87	9.799	52.25	14.770	381.31

With both

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
256	256	1	512	0.270	946.54	4.697	54.50	4.967	103.07
256	256	2	1024	0.496	1031.39	8.748	58.53	9.244	110.77
2560	256	1	2816	2.433	1052.04	4.843	52.85	7.277	386.98
2560	256	2	5632	4.900	1044.89	8.846	57.88	13.746	409.73

src/llama-graph.cpp

src/llama-graph.h

gabe-l-hart · 2025-07-23T13:29:15Z

You could remove the last split by using ggml_build_forward_expand to make the CPU GET_ROWS on token_embd the first operation in the graph.

I tried @slaren's suggestion by adding ggml_build_forward_expand(gf, cur); here. It did successfully remove the extra split. There was no noticeable change in the perf numbers (on metal), but may be worth adding for the future?

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
256	256	1	512	0.266	960.74	4.767	53.70	5.034	101.71
256	256	2	1024	0.496	1031.28	8.545	59.92	9.041	113.26
2560	256	1	2816	2.436	1051.05	4.879	52.47	7.315	384.98
2560	256	2	5632	4.891	1046.76	8.755	58.48	13.646	412.72

gabe-l-hart · 2025-07-25T15:38:10Z

@compilade any objection to me hitting the merge button on this?

gabe-l-hart · 2025-07-30T16:54:14Z

@ggerganov would it be OK to get this merged soon? It looks like @compilade may be otherwise occupied and I'm hoping to get it into my big bump PR in Ollama.

* graph : avoid creating redundant s_copy views * graph : comment the s_copy views

graph : avoid creating redundant s_copy views

ac14979

compilade added the performance Speed related topics label Jul 23, 2025

slaren approved these changes Jul 23, 2025

View reviewed changes

gabe-l-hart approved these changes Jul 23, 2025

View reviewed changes

src/llama-graph.cpp Show resolved Hide resolved

src/llama-graph.cpp Show resolved Hide resolved

src/llama-graph.h Outdated Show resolved Hide resolved

graph : comment the s_copy views

7b660a9

ggerganov approved these changes Jul 24, 2025

View reviewed changes

gabe-l-hart mentioned this pull request Jul 30, 2025

Granite four (llama.cpp bump 443e7e7+) ollama/ollama#11195

Open

3 tasks

ggerganov merged commit 66625a5 into master Jul 31, 2025
50 of 51 checks passed

ggerganov deleted the compilade/recurrent-reduce-splits branch July 31, 2025 05:03

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 1, 2025

graph : reduce splits for recurrent and hybrid models (ggml-org#14825)

c0f2999

* graph : avoid creating redundant s_copy views * graph : comment the s_copy views

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

graph : reduce splits for recurrent and hybrid models #14825

graph : reduce splits for recurrent and hybrid models #14825

Uh oh!

compilade commented Jul 23, 2025 •

edited

Loading

Uh oh!

gabe-l-hart commented Jul 23, 2025

Uh oh!

compilade commented Jul 23, 2025 •

edited

Loading

Uh oh!

slaren commented Jul 23, 2025

Uh oh!

gabe-l-hart left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Jul 23, 2025

Uh oh!

gabe-l-hart commented Jul 25, 2025

Uh oh!

gabe-l-hart commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

graph : reduce splits for recurrent and hybrid models #14825

graph : reduce splits for recurrent and hybrid models #14825

Uh oh!

Conversation

compilade commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Jul 23, 2025

Uh oh!

compilade commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jul 23, 2025

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Current master

With #14743

With both

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Jul 23, 2025

Uh oh!

gabe-l-hart commented Jul 25, 2025

Uh oh!

gabe-l-hart commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

compilade commented Jul 23, 2025 •

edited

Loading

compilade commented Jul 23, 2025 •

edited

Loading

Current `master`