Jan 2025: prompt processing performance comparison

Setup

Latest ik_llama.cpp (build: 3c5f8722 (3531)) and llama.cpp (build: 9f7add1c (4517))
LLaMA-3.1-8B-Instruct
Zen4 (Ryzen-7950X), AVX2 (Ryzen-5975WX) and ARM_NEON (M2-Max) CPUs
Quantization is done with llama.cpp, the same quantized model is run with llama.cpp and ik_llama.cpp (using run-time-repacking)
Performance is measured via llama-bench -m model -p 512 -n 0 -ngl 0 -t num_threads [-rtr 1]
BLAS is not used and Accelerate is disabled on the M2-Max so that the respective CPU implementation is tested
tinyBLAS is enabled in mainline llama.cpp (although it seems to all types provided by tinyBLAS are utilized with current llama.cpp)
No attempts are made to fix the automatic llama.cpp configuration (e.g., on the M2-Max I8 matrix multiplications get enabled despite the CPU not actually supporting them, which lowers somewhat Q4_0 performance)
The list of quantization types tested is long but not exhaustive (the QK_X_M/L and QX_1 were skipped)
To keep things simple, no Flash Attention is used

This is the script used to run the benchmarks

#! /bin/sh

quants="q8_0 q4_0 q5_0 q2_K_S q3_K_S q4_K_S q5_K_S q6_K iq2_xxs iq2_xs iq2_m iq3_xxs iq3_s iq4_xs iq4_nl"

imatrix="../../ik_llama.cpp/nbuild/il31_imat_2048.dat"
model="../../llama.cpp/models/il31_8B/Meta-Llama-3.1-8B-Instruct-F16.gguf"
ofile="perf_neon_jan_2025.out"
nt_pp=8
nt_tg="1,2,4,8"

for q in $quants; do
    echo "--- Working on $q"
    echo "    quantizing"
    ./bin/llama-quantize --imatrix $imatrix $model junk.bin $q >/dev/null 2>&1
    sleep 30
    echo "    running pp512"
    ./bin/llama-bench -m junk.bin -p 512 -ngl 0 -n 0 -t $nt_pp >>$ofile
    sleep 30
    echo "    running tg128"
    ./bin/llama-bench -m junk.bin -p 0 -ngl 0 -n 128 -t $nt_tg -r 3 >>$ofile
done

I have added some delays between the various runs to avoid (or at least reduce) thermal throttling (it takes quite some time to execute the above script). The number of threads used is adjusted according to the platform.

Zen4 (Ryzen-7950X)

model	size	test	t/s (llama.cpp)	t/s (ik_llama.cpp)	Speedup
F16	14.96 GiB	pp512	63.24 ± 0.18	139.08 ± 0.64	2.199
BF16	14.96 GiB	pp512	112.03 ± 0.17	259.47 ± 0.71	2.316
Q8_0	7.95 GiB	pp512	137.79 ± 2.44	257.81 ± 0.79	1.871
Q4_0	4.35 GiB	pp512	146.20 ± 0.48	260.71 ± 1.21	1.783
Q5_0	5.22 GiB	pp512	104.30 ± 3.30	243.08 ± 0.51	2.331
Q2_K - Small	2.78 GiB	pp512	115.23 ± 0.24	259.24 ± 1.06	2.250
Q3_K - Small	3.41 GiB	pp512	80.68 ± 0.11	245.12 ± 1.08	3.038
Q4_K - Small	4.36 GiB	pp512	102.85 ± 0.31	258.78 ± 1.05	2.516
Q5_K - Small	5.21 GiB	pp512	71.82 ± 0.12	246.19 ± 0.65	3.428
Q6_K	6.14 GiB	pp512	78.24 ± 0.16	257.25 ± 0.96	3.288
IQ2_XXS - 2.0625 bpw	2.23 GiB	pp512	43.54 ± 0.10	154.17 ± 0.39	3.541
IQ2_XS - 2.3125 bpw	2.42 GiB	pp512	47.46 ± 0.09	149.02 ± 1.36	3.140
IQ2_M - 2.7 bpw	2.74 GiB	pp512	41.91 ± 0.11	176.70 ± 0.97	4.216
IQ3_XXS - 3.0625 bpw	3.04 GiB	pp512	32.71 ± 0.05	143.16 ± 0.57	4.377
IQ3_S - 3.4375 bpw	3.42 GiB	pp512	28.96 ± 0.06	149.98 ± 2.85	5.179
IQ4_XS - 4.25 bpw	4.13 GiB	pp512	70.67 ± 0.21	256.51 ± 1.78	3.630
IQ4_NL - 4.5 bpw	4.35 GiB	pp512	114.51 ± 1.33	253.36 ± 0.49	2.213

To get a better idea how things have evolved since my last comparison, below is a graph showing the ik_llama.cpp to llama.cpp performance ratio in July 2024 (black symbols) and Jan 2025 (red symbols). The x-axis has no real meaning, it is simply the index of the type in the above table. We see that the performance gap has increased significantly. The black and red horizontal dashed lines represent the geometric mean of the data points for July 2024 (2.087) and Jan 2025 (2.883), respectively. I.e., on average, the performance of ik_llama.cpp relative to llama.cpp has improved by a factor of 2.883/2.087 = 1.38. The only two types where the performance ratio has remained about constant are Q5_0 (2.3X) and IQ4_NL (2.2X).

zen4

AVX2 (Ryzen-5975WX)

model	size	test	t/s (llama.cpp)	t/s (ik_llama.cpp)	Speedup
F16	14.96 GiB	pp512	91.61 ± 0.24	152.65 ± 0.30	1.666
Q8_0	7.95 GiB	pp512	165.14 ± 0.42	251.75 ± 0.44	1.524
Q4_0	4.35 GiB	pp512	177.90 ± 0.33	283.89 ± 0.70	1.600
Q5_0	5.22 GiB	pp512	121.01 ± 0.27	266.92 ± 0.81	2.206
Q2_K - Small	2.78 GiB	pp512	168.97 ± 0.18	290.18 ± 0.33	1.717
Q3_K - Small	3.41 GiB	pp512	118.88 ± 0.29	267.69 ± 0.70	2.252
Q4_K - Small	4.36 GiB	pp512	148.69 ± 0.26	291.90 ± 0.64	1.963
Q5_K - Small	5.21 GiB	pp512	108.75 ± 0.23	273.59 ± 0.37	2.516
Q6_K	6.14 GiB	pp512	104.15 ± 0.24	264.83 ± 0.65	2.543
IQ2_XXS - 2.0625 bpw	2.23 GiB	pp512	73.37 ± 0.20	224.84 ± 0.26	3.064
IQ2_XS - 2.3125 bpw	2.42 GiB	pp512	68.90 ± 0.12	221.45 ± 0.57	3.214
IQ2_M - 2.7 bpw	2.74 GiB	pp512	68.71 ± 0.17	221.95 ± 0.31	3.230
IQ3_XXS - 3.0625 bpw	3.04 GiB	pp512	52.67 ± 0.16	211.67 ± 0.35	4.019
IQ3_S - 3.4375 bpw	3.42 GiB	pp512	44.88 ± 0.12	230.03 ± 0.30	5.125
IQ4_XS - 4.25 bpw	4.13 GiB	pp512	113.57 ± 0.18	265.45 ± 0.52	2.337
IQ4_NL - 4.5 bpw	4.35 GiB	pp512	131.26 ± 0.25	247.19 ± 0.45	1.883

Here the performance gains are slightly lower compared to Zen4. This is simply due to the fact that llama.cpp still mostly does not take advantage of AVX512 features in its CPU back-end, and as a result ik_llama.cpp is relatively faster on Zen4 compared to vanilla AVX2. I did not run the comparison on this CPU back in July 2024, so no performance evolution graph as the one in the Zen4 section above.

ARM_NEON (M2-Max)

One would normally use the GPU when running LLMs on the M-series chips. But the performance on the M2-Max CPU is indicative for the performance on ARM based systems where there is no GPU, or building with GPU support may prove difficult/impossible. So, here is the current state of affairs between ik_llama.cpp and llama.cpp on ARM_NEON:

model	size	test	t/s (llama.cpp)	t/s (ik_llama.cpp)	Speedup
F16	14.96 GiB	pp512	29.20 ± 0.10	94.00 ± 0.41	3.219
Q8_0	7.95 GiB	pp512	55.51 ± 0.92	132.28 ± 0.28	2.023
Q4_0	4.35 GiB	pp512	114.44 ± 2.77	124.13 ± 0.26	1.085
Q5_0	5.22 GiB	pp512	26.28 ± 0.97	104.28 ± 1.76	3.968
Q2_K - Small	2.78 GiB	pp512	33.05 ± 0.13	107.81 ± 2.50	3.262
Q3_K - Small	3.41 GiB	pp512	25.01 ± 0.00	108.19 ± 2.03	4.326
Q4_K - Small	4.36 GiB	pp512	42.59 ± 0.90	128.18 ± 1.30	3.010
Q5_K - Small	5.21 GiB	pp512	27.13 ± 0.66	112.03 ± 0.05	4.129
Q6_K	6.14 GiB	pp512	26.91 ± 0.33	103.93 ± 2.61	3.862
IQ2_XXS - 2.0625 bpw	2.23 GiB	pp512	18.96 ± 0.19	90.23 ± 0.12	4.759
IQ2_XS - 2.3125 bpw	2.42 GiB	pp512	20.59 ± 0.33	70.75 ± 0.35	3.436
IQ2_M - 2.7 bpw	2.74 GiB	pp512	14.59 ± 0.07	65.75 ± 1.05	4.507
IQ3_XXS - 3.0625 bpw	3.04 GiB	pp512	13.46 ± 0.13	77.43 ± 1.30	5.753
IQ3_S - 3.4375 bpw	3.42 GiB	pp512	11.34 ± 0.33	79.66 ± 1.86	7.025
IQ4_XS - 4.25 bpw	4.13 GiB	pp512	38.13 ± 0.74	136.29 ± 0.78	3.574
IQ4_NL - 4.5 bpw	4.35 GiB	pp512	96.63 ± 2.99	120.23 ± 1.63	1.244

The graph below represents the performance evolution since my last comparison for the above data (see the section on Zen4). The geometric means of the data points represented by the black and red dashed lines are 2.121 (July 2024, black) and 3.350 (Jan 2025, red). I.e., here the performance gap has increased even more than on Zen4 (3.35/2.12 = 1.58X). The only two types where llama.cpp has improved relative to ik_llama.cpp are Q4_0 (at index 4, performance ratio 1.164 in July 2024, 1.085 in Jan 2025) and IQ4_NL (at index 17, 2.212 in July 2024, 1.244 in Jan 2025). This is thanks to the special purpose, ARM-specific implementation that has been added to llama.cpp. The downside of this improvement is that it llama.cpp has lost support for tinyBLAS on ARM, leading to dismal performance for fp16 (performance ratio = 1.056 in July 2024, 3.219 in Jan 2025) and Q8_0 (performance ratio = 1.219 in July 2024, 2.023 in Jan 2025).

neon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jan 2025: prompt processing performance comparison

Setup

Zen4 (Ryzen-7950X)

AVX2 (Ryzen-5975WX)

ARM_NEON (M2-Max)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally