-
Notifications
You must be signed in to change notification settings - Fork 147
Jan 2025: prompt processing performance comparison
- Latest
ik_llama.cpp
(build: 3c5f8722 (3531)
) andllama.cpp
(build: 9f7add1c (4517)
) - LLaMA-3.1-8B-Instruct
-
Zen4
(Ryzen-7950X),AVX2
(Ryzen-5975WX) andARM_NEON
(M2-Max) CPUs - Quantization is done with
llama.cpp
, the same quantized model is run withllama.cpp
andik_llama.cpp
(using run-time-repacking) - Performance is measured via
llama-bench -m model -p 512 -n 0 -ngl 0 -t num_threads [-rtr 1]
- BLAS is not used and Accelerate is disabled on the M2-Max so that the respective CPU implementation is tested
- tinyBLAS is enabled in mainline
llama.cpp
(although it seems to all types provided by tinyBLAS are utilized with currentllama.cpp
) - No attempts are made to fix the automatic
llama.cpp
configuration (e.g., on the M2-Max I8 matrix multiplications get enabled despite the CPU not actually supporting them, which lowers somewhatQ4_0
performance) - The list of quantization types tested is long but not exhaustive (the
QK_X_M/L
andQX_1
were skipped) - To keep things simple, no Flash Attention is used
This is the script used to run the benchmarks
#! /bin/sh
quants="q8_0 q4_0 q5_0 q2_K_S q3_K_S q4_K_S q5_K_S q6_K iq2_xxs iq2_xs iq2_m iq3_xxs iq3_s iq4_xs iq4_nl"
imatrix="../../ik_llama.cpp/nbuild/il31_imat_2048.dat"
model="../../llama.cpp/models/il31_8B/Meta-Llama-3.1-8B-Instruct-F16.gguf"
ofile="perf_neon_jan_2025.out"
nt_pp=8
nt_tg="1,2,4,8"
for q in $quants; do
echo "--- Working on $q"
echo " quantizing"
./bin/llama-quantize --imatrix $imatrix $model junk.bin $q >/dev/null 2>&1
sleep 30
echo " running pp512"
./bin/llama-bench -m junk.bin -p 512 -ngl 0 -n 0 -t $nt_pp >>$ofile
sleep 30
echo " running tg128"
./bin/llama-bench -m junk.bin -p 0 -ngl 0 -n 128 -t $nt_tg -r 3 >>$ofile
done
I have added some delays between the various runs to avoid (or at least reduce) thermal throttling (it takes quite some time to execute the above script). The number of threads used is adjusted according to the platform.
model | size | test | t/s (llama.cpp) | t/s (ik_llama.cpp) | Speedup |
---|---|---|---|---|---|
F16 | 14.96 GiB | pp512 | 63.24 ± 0.18 | 139.08 ± 0.64 | 2.199 |
BF16 | 14.96 GiB | pp512 | 112.03 ± 0.17 | 259.47 ± 0.71 | 2.316 |
Q8_0 | 7.95 GiB | pp512 | 137.79 ± 2.44 | 257.81 ± 0.79 | 1.871 |
Q4_0 | 4.35 GiB | pp512 | 146.20 ± 0.48 | 260.71 ± 1.21 | 1.783 |
Q5_0 | 5.22 GiB | pp512 | 104.30 ± 3.30 | 243.08 ± 0.51 | 2.331 |
Q2_K - Small | 2.78 GiB | pp512 | 115.23 ± 0.24 | 259.24 ± 1.06 | 2.250 |
Q3_K - Small | 3.41 GiB | pp512 | 80.68 ± 0.11 | 245.12 ± 1.08 | 3.038 |
Q4_K - Small | 4.36 GiB | pp512 | 102.85 ± 0.31 | 258.78 ± 1.05 | 2.516 |
Q5_K - Small | 5.21 GiB | pp512 | 71.82 ± 0.12 | 246.19 ± 0.65 | 3.428 |
Q6_K | 6.14 GiB | pp512 | 78.24 ± 0.16 | 257.25 ± 0.96 | 3.288 |
IQ2_XXS - 2.0625 bpw | 2.23 GiB | pp512 | 43.54 ± 0.10 | 154.17 ± 0.39 | 3.541 |
IQ2_XS - 2.3125 bpw | 2.42 GiB | pp512 | 47.46 ± 0.09 | 149.02 ± 1.36 | 3.140 |
IQ2_M - 2.7 bpw | 2.74 GiB | pp512 | 41.91 ± 0.11 | 176.70 ± 0.97 | 4.216 |
IQ3_XXS - 3.0625 bpw | 3.04 GiB | pp512 | 32.71 ± 0.05 | 143.16 ± 0.57 | 4.377 |
IQ3_S - 3.4375 bpw | 3.42 GiB | pp512 | 28.96 ± 0.06 | 149.98 ± 2.85 | 5.179 |
IQ4_XS - 4.25 bpw | 4.13 GiB | pp512 | 70.67 ± 0.21 | 256.51 ± 1.78 | 3.630 |
IQ4_NL - 4.5 bpw | 4.35 GiB | pp512 | 114.51 ± 1.33 | 253.36 ± 0.49 | 2.213 |
To get a better idea how things have evolved since my last comparison, below is a graph showing the ik_llama.cpp
to llama.cpp
performan ce ratio in July 2024 (black symbols) and Jan 2025 (red symbols). The x-axis has no real meaning, it is simply the index of the type in the above table. We see that the performance gap has increased significantly. The only two types where the performance ratio has remained about constant are Q5_0
(2.3X) and IQ4_NL
(2.2X).
model | size | test | t/s (llama.cpp) | t/s (ik_llama.cpp) | Speedup |
---|---|---|---|---|---|
F16 | 14.96 GiB | pp512 | 91.61 ± 0.24 | 152.65 ± 0.30 | 1.666 |
Q8_0 | 7.95 GiB | pp512 | 165.14 ± 0.42 | 251.75 ± 0.44 | 1.524 |
Q4_0 | 4.35 GiB | pp512 | 177.90 ± 0.33 | 283.89 ± 0.70 | 1.600 |
Q5_0 | 5.22 GiB | pp512 | 121.01 ± 0.27 | 266.92 ± 0.81 | 2.206 |
Q2_K - Small | 2.78 GiB | pp512 | 168.97 ± 0.18 | 290.18 ± 0.33 | 1.717 |
Q3_K - Small | 3.41 GiB | pp512 | 118.88 ± 0.29 | 267.69 ± 0.70 | 2.252 |
Q4_K - Small | 4.36 GiB | pp512 | 148.69 ± 0.26 | 291.90 ± 0.64 | 1.963 |
Q5_K - Small | 5.21 GiB | pp512 | 108.75 ± 0.23 | 273.59 ± 0.37 | 2.516 |
Q6_K | 6.14 GiB | pp512 | 104.15 ± 0.24 | 264.83 ± 0.65 | 2.543 |
IQ2_XXS - 2.0625 bpw | 2.23 GiB | pp512 | 73.37 ± 0.20 | 224.84 ± 0.26 | 3.064 |
IQ2_XS - 2.3125 bpw | 2.42 GiB | pp512 | 68.90 ± 0.12 | 221.45 ± 0.57 | 3.214 |
IQ2_M - 2.7 bpw | 2.74 GiB | pp512 | 68.71 ± 0.17 | 221.95 ± 0.31 | 3.230 |
IQ3_XXS - 3.0625 bpw | 3.04 GiB | pp512 | 52.67 ± 0.16 | 211.67 ± 0.35 | 4.019 |
IQ3_S - 3.4375 bpw | 3.42 GiB | pp512 | 44.88 ± 0.12 | 230.03 ± 0.30 | 5.125 |
IQ4_XS - 4.25 bpw | 4.13 GiB | pp512 | 113.57 ± 0.18 | 265.45 ± 0.52 | 2.337 |
IQ4_NL - 4.5 bpw | 4.35 GiB | pp512 | 131.26 ± 0.25 | 247.19 ± 0.45 | 1.883 |
Here the performance gains are slightly lower compared to Zen4
. This is simply due to the fact that llama.cpp
still mostly does not take advantage of AVX512
features in its CPU back-end, and as a result ik_llama.cpp
is relatively faster on Zen4
compared to vanilla AVX2
. I did not run the comparison on this CPU back in July 2024, so no performance evolution graph as the one in the Zen4
section above.