Performance of llama.cpp with Vulkan #10879

netrunnereve · 2024-12-18T03:56:09Z

netrunnereve
Dec 18, 2024
Collaborator

This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make
./bin/llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 (add any extra options here)

Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same setup I'll prioritize newer commits with substantial Vulkan updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that the memory speed and number of channels will greatly affect your inference speed!

Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 4090	8534.56 ± 200.32	170.95 ± 0.32	`bb4f7a9`	coopmat2
AMD Radeon RX 7900 XTX	3489.67 ± 82.17	145.00 ± 0.89	`d1aa0cc`
Nvidia RTX 3090	4543.96 ± 73.80	136.81 ± 3.63	`bb4f7a9`	coopmat2
Nvidia RTX 5070 Ti	6213.63 ± 27.72	135.63 ± 0.18	`d13d0f6`	coopmat2
AMD Radeon RX 9070 XT	3831.64 ± 1.82	130.57 ± 0.03	`fd1234c`
AMD Radeon RX 7900 XT	2941.58 ± 17.17	123.18 ± 0.40	`71e74a3`
Nvidia A100 (80GB)	3103.32 ± 4.21	121.83 ± 0.54	`d394a9a`
AMD Radeon RX 9070	3164.10 ± 66.84	119.71 ± 3.40	`21c17b5`
Apple M3 Ultra Mac Studio	1116.83 ± 0.55	115.54 ± 0.78	`2d451c8`	MoltenVK
AMD Radeon RX 6900 XT	1257.98 ± 1.55	101.42 ± 0.02	44e18ef
AMD Radeon RX 7800 XT	2145.60 + 23.14	96.89 + 0.22	`baad948`
AMD Radeon RX 6800 XT	1533.60 ± 2.47	95.56 ± 0.72	N/A
Nvidia RTX 4070	3179.37 ± 46.16	92.29 ± 0.28	`9a48399`
AMD Radeon PRO W6800X	510.80 ± 0.13	86.47 ± 0.46	`13b4548`	MoltenVK
AMD Radeon PRO W6800X Duo	519.14 ± 0.13	87.56 ± 0.19	`13b4548`	MoltenVK
AMD Radeon RX 6750 XT	1040.58 ± 0.35	81.98 ± 0.03	228f34c
AMD Radeon Pro V620	1595.32 ± 1.59	81.78 ± 0.06	`03d4698`
Nvidia RTX 5060 Ti	3211.73 ± 24.44	81.48 ± 3.50	`658987c`	coopmat2
Nvidia RTX 3070	2113.02 ± 7.38	78.71 ± 0.13	`1b8fb81`
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	504af20
Apple M4 Max Macbook Pro	724.77 ± 20.93	75.02 ± 0.14	1ece0cb6
AMD Radeon Instinct MI50	387.37 ± 0.33	71.46 ± 0.10	d5fe4e8
AMD Radeon Pro VII	612.47 ± 0.87	71.37 ± 0.98	N/A
AMD Radeon RX 9060 XT	2141.67 ± 6.87	70.54 ± 0.74	`ed52f36`
AMD Radeon RX 5700 XT	439.42 ± 0.28	70.13 ± 0.05	c05e8c9
AMD Radeon Pro W5700	504.20 ± 0.14	67.18 ± 0.08	`4265a87`
Nvidia RTX 3060	1681.81 ± 7.45	64.76 ± 3.20	`860a9e4`	coopmat2, eGPU
Nvidia RTX 2070 SUPER	1199.13 ± 7.70	64.64 ± 0.20	`b7552cf`
AMD Radeon RX 6650 XT	735.64 ± 3.12	59.22 ± 0.11	228f34c
AMD Radeon RX 7600 XT	632.88 ± 0.70	58.44 ± 0.01	`3b24d26`
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	`2739a71`
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	`091592d`
AMD Ryzen Al Max+ 395	803.41 ± 3.48	51.89 ± 0.12	`65349f2`
AMD BC-250	331.58 ± 0.06	49.76 ± 0.06	cf2270e
Nvidia RTX 3060 Mobile	1059.76 ± 3.54	49.03 ± 0.13	`dbb3a47`
Intel Arc A770	725.31 + 0.98	49.43 + 1.45	`259469c`
AMD Radeon RX 6800M	861.99 ± 7.67	48.71 ± 0.71	`8e6f8bc`
AMD Radeon RX 6600	617.85 ± 0.28	48.52 ± 0.06	`4227c9b`
AMD Radeon RX 6600M	605.59 ± 0.65	48.21 ± 0.07	`fe5b78c`
AMD Radeon RX Vega 64	356.08 ± 0.09	45.73 ± 0.18	`ec428b0`
AMD Radeon RX 7600M XT	459.39 ± 2.34	45.28 ± 0.10	`b9ab0a4`	eGPU
Intel Arc B580	175.56 ± 2.65	44.12 ± 0.09	`9a48399`
Nvidia GTX 1070 Ti	297.50 ± 0.54	42.86 ± 1.20	`860a9e4`	eGPU
Intel Arc B570	469.46 ± 1.06	42.70 ± 0.46	`21c17b5`
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	`d79d8f3`
Intel Arc A750	665.38 ± 5.49	41.43 ± 0.03	`21c17b5`
Nvidia GTX 1660 Ti Mobile	512.46 ± 2.29	40.28 ± 0.25	e54d41b
AMD Radeon RX 580	258.03 ± 0.71	39.32 ± 0.03	de4c07f
AMD Radeon Pro W5500	315.39 ± 3.76	36.82 ± 0.38	`860a9e4`
AMD Radeon RX 470	185.48 ± 1.17	33.94 ± 0.06	`d7a14c4`
Nvidia GTX 980	186.24 ± 0.09	33.90 ± 0.51	`860a9e4`
AMD FirePro W8100	154.96 ± 0.60	28.55 ± 0.17	`d7a14c4`
AMD Radeon RX 6500 XT	255.25 ± 0.35	27.81 ± 0.10	g9fdfcd
Apple M3 MacBook Pro	263.70 ± 0.02	26.39 ± 0.14	`b9ab0a4`	MoltenVK
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	`914a82d`	Split across dual GPUs
AMD Ryzen AI 9 300 Series	309.35 ± 0.93	21.23 ± 0.40	`87616f0`
Apple M2 Pro Mac Mini	62.70 ± 0.03	20.95 ± 0.11	1fe0029	Asahi Linux
Intel Core Ultra 7 258V	418.08 ± 6.02	20.53 ± 0.53	`d1d8241`
AMD Ryzen 7 8000 Series	245.79 ± 2.97	20.10 ± 0.07	`19d3c82`
AMD Ryzen 7 7000 Series	281.62 ± 1.56	19.91 ± 0.07	`ebce03e`
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	`53ff6b9`
AMD Ryzen 5 8000 Series	183.35 ± 1.73	16.99 ± 0.02	`9ecf3e6`
AMD FirePro D700	69.95 ± 0.04	16.62 ± 0.01	`d3bd719`	MoltenVK, running in FP16 mode on FP32 only chip
AMD Radeon Pro WX 4100	78.79 ± 0.10	16.05 ± 0.07	`860a9e4`
Apple M1 Mac Mini	31.31 ± 0.01	12.41 ± 0.05	1fe0029	Asahi Linux
Apple M2 MacBook Air	38.67 ± 0.03	11.07 ± 0.04	`017cc5f`	Asahi Linux
AMD Ryzen 7 5000 Series	90.55 ± 0.08	10.98 ± 0.07	d84635b
AMD Ryzen 5 5000 Series	75.60 ± 0.32	10.59 ± 0.18	`0bb2919`
Nvidia Tesla K80	89.46 ± 0.10	9.39 ± 0.06	`5d46bab`	Running on single GPU
MediaTek Dimensity 9400	38.36 ± 15.15	8.92 ± 0.06	`b9ab0a4`	GPU supports coopmat but pp512 is faster with it turned off
Intel Core Ultra 7 100 Series	185.51 ± 0.22	8.21 ± 0.07	`1d72c84`
Intel Core i7 1100 Series	42.02 ± 0.07	7.28 ± 0.24	`ff3fcab`
AMD Ryzen 5 3000 Series	48.63 ± 0.10	8.49 ± 0.01	1fe0029
Intel Core i7 1000 Series	25.58 ± 0.00	4.25 ± 0.18	N/A
Intel Core i5 8000 Series	25.28 ± 0.00	3.23 ± 0.00	`f26c874`

Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 4090	9032.44 ± 323.38	166.45 ± 0.18	`bb4f7a9`	coopmat2
AMD Radeon RX 7900 XTX	3225.19 ± 10.11	145.70 ± 0.73	`d1aa0cc`
Nvidia RTX 5070 Ti	6614.86 ± 8.32	133.94 ± 0.02	`d13d0f6`	coopmat2
Nvidia RTX 3090	4946.47 ± 77.89	133.63 ± 4.50	`bb4f7a9`	coopmat2
AMD Radeon RX 9070 XT	3461.64 ± 30.11	130.56 ± 0.05	`fd1234c`
AMD Radeon RX 7900 XT	2701.13 ± 8.75	120.62 ± 0.36	`71e74a3`
Nvidia A100 (80GB)	3164.55 ± 5.00	120.53 ± 0.41	`d394a9a`
AMD Radeon RX 9070	2859.98 ± 31.53	119.51 ± 0.13	`21c17b5`
AMD Radeon RX 7800 XT	2063.66 + 2.92	96.03 + 0.09	`baad948`
Nvidia RTX 4070	4293.57 ± 27.70	91.49 ± 0.89	`9a48399`	coopmat2
Nvidia RTX 5060 Ti	3492.22 ± 15.73	83.26 ± 2.03	`658987c`	coopmat2
AMD Radeon RX 6750 XT	997.05 ± 0.45	82.29 ± 0.06	228f34c
AMD Radeon RX 6800	662.87 ± 0.74	80.17 ± 0.12	`97340b4`
AMD Radeon Pro V620	1556.31 ± 2.82	79.24 ± 0.09	`03d4698`
Apple M4 Max Macbook Pro	557.46 ± 26.87	71.79 ± 4.16	1ece0cb6
AMD Radeon RX 9060 XT	1915.41 ± 7.90	70.52 ± 0.16	`ed52f36`
Nvidia RTX 3060	1826.49 ± 3.02	67.11 ± 0.12	`860a9e4`	coopmat2, eGPU
AMD Radeon RX 6650 XT	730.64 ± 0.25	59.18 ± 0.44	228f34c
AMD Radeon RX 7600 XT	586.16 ± 2.43	59.02 ± 0.03	`3b24d26`
AMD Ryzen Al Max+ 395	927.96 ± 1.79	51.14 ± 0.09	`65349f2`
AMD Radeon RX 6800M	784.16 ± 2.76	49.06 ± 0.34	`8e6f8bc`
AMD Radeon RX 6600	622.72 ± 0.20	48.31 ± 0.04	`4227c9b`
Intel Arc A770	327.58 + 0.19	48.17 + 0.04	`259469c`
AMD Radeon RX Vega 64	320.12 ± 0.22	47.06 ± 0.01	`ec428b0`
Nvidia GTX 1070 Ti	292.85 ± 0.23	43.42 ± 0.34	`860a9e4`	eGPU
Intel Arc B570	285.36 ± 0.13	43.38 ± 0.13	`21c17b5`
Intel Arc A750	312.51 ± 1.74	42.15 ± 0.08	`21c17b5`
Nvidia GTX 1660 Ti Mobile	500.20 ± 0.83	40.07 ± 0.08	e54d41b
Nvidia GTX 980	180.97 ± 0.74	34.16 ± 0.10	`860a9e4`
Apple M2 Pro Mac Mini	58.86 ± 0.02	20.97 ± 0.03	1fe0029	Asahi Linux
AMD Ryzen AI 9 300 Series	264.67 ± 3.26	19.81 ± 0.01	`810b9fc`
AMD Ryzen 5 8000 Series	188.84 ± 0.73	16.57 ± 0.26	`9ecf3e6`
AMD Radeon Pro WX 4100	75.59 ± 0.19	16.56 ± 0.04	`860a9e4`
Apple M1 Mac Mini	28.65 ± 0.00	12.38 ± 0.03	1fe0029	Asahi Linux
AMD Ryzen 7 5000 Series	76.53 ± 0.12	10.09 ± 0.01	`860a9e4`
Nvidia Tesla K80	88.26 ± 0.19	9.49 ± 0.01	`5d46bab`	Running on single GPU
AMD Ryzen 5 3000 Series	47.41 ± 0.14	8.47 ± 0.01	1fe0029
Intel Core Ultra 7 100 Series	77.66 ± 2.75	7.75 ± 0.05	`2e89f76`
Intel Core i7 1100 Series	84.19 ± 3.31	2.87 ± 0.01	`860a9e4`	Slow memory

netrunnereve · 2024-12-18T03:58:41Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD FirePro W8100

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	137.10 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.51 ± 0.12

1 reply

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	154.96 ± 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.55 ± 0.17

netrunnereve · 2024-12-18T04:00:36Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD RX 470

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	161.47 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.45 ± 0.04

1 reply

netrunnereve May 1, 2025
Collaborator Author

With the latest updates:

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: d7a14c42 (5252)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	185.48 ± 1.17
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.94 ± 0.06

max-krasnyansky · 2024-12-18T05:09:04Z

max-krasnyansky
Dec 18, 2024
Collaborator

ubuntu 24.04, vulkan and cuda installed from official APT packages.

ggml_vulkan: 0 = NVIDIA GeForce RTX 3080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	1706.07 ± 139.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	62.16 ± 1.98

build: 4da69d1 (4351)

vs CUDA on the same build/setup

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	4499.47 ± 60.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	131.01 ± 0.43

build: 4da69d1 (4351)

0 replies

hkbu-kennycheng · 2025-01-08T02:57:11Z

hkbu-kennycheng
Jan 8, 2025

Macbook Air M2 on Asahi Linux

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 (G14G B0) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	38.67 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	11.07 ± 0.04

[build build: 017cc5f](build: 017cc5f)

3 replies

ericcurtin Jan 14, 2025
Collaborator

For the record I think this is slow on the HoneyKrisp side rather than llama.cpp

tsugabloom Mar 29, 2025

Can you share how you got vulkan to build on Asahi? I can't seem to get cmake to notice it.

cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+nosme 
CMake Error at /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Vulkan (missing: Vulkan_LIBRARY) (found version "1.3.296")
Call Stack (most recent call first):
  /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.30/Modules/FindVulkan.cmake:595 (find_package_handle_standard_args)
  ggml/src/ggml-vulkan/CMakeLists.txt:4 (find_package)


-- Configuring incomplete, errors occurred!

tsugabloom Mar 29, 2025

Spoke too soon, got it working! cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1 -DVulkan_LIBRARY=/usr/lib64/libvulkan.so.1

hkbu-kennycheng · 2025-01-08T03:22:16Z

hkbu-kennycheng
Jan 8, 2025

Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	199.36 ± 7.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	18.77 ± 0.02

[build build: 53ff6b9](build: 53ff6b9)

0 replies

hkbu-kennycheng · 2025-01-08T10:35:31Z

hkbu-kennycheng
Jan 8, 2025

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1545.39 ± 6.58
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	88.12 ± 1.06

[build build: 53ff6b9](build: 53ff6b9)

4 replies

0cc4m Jan 8, 2025
Collaborator

Cool setup! Could you also post the result of 1, 2 and 3 7900 XTX GPUs? You can use only the first GPU with export GGML_VK_VISIBLE_DEVICES=0, the first two with export GGML_VK_VISIBLE_DEVICES=0,1 and so on.

hkbu-kennycheng Jan 8, 2025

env GGML_VK_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2022.59 ± 10.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.24 ± 0.30

env GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2039.24 ± 18.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	140.68 ± 2.09

env GGML_VK_VISIBLE_DEVICES=2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2062.17 ± 5.36
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	143.99 ± 0.23

env GGML_VK_VISIBLE_DEVICES=3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1997.04 ± 5.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.98 ± 1.73

env GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1668.19 ± 12.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	100.62 ± 0.66

env GGML_VK_VISIBLE_DEVICES=0,1,2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1566.38 ± 8.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	97.96 ± 1.13

env GGML_VK_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1484.04 ± 6.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.48 ± 0.63

netrunnereve Jan 8, 2025
Collaborator Author

For this multi GPU case getting Vulkan to support #6017 pipeline parallelism might help improve the prompt processing speed.

hkbu-kennycheng Jan 9, 2025

@netrunnereve I updated the commit id in all my result.

0cc4m · 2025-01-08T11:04:08Z

0cc4m
Jan 8, 2025
Collaborator

build: 0d52a69 (4439)

NVIDIA GeForce RTX 3090 (NVIDIA)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	3301.47 ± 33.76
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	123.72 ± 0.14

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	863.03 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.59 ± 0.40

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	312.02 ± 0.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.17 ± 0.25

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	95.52 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	44.49 ± 0.03

0 replies

0cc4m · 2025-01-08T11:08:46Z

0cc4m
Jan 8, 2025
Collaborator

@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release

2 replies

netrunnereve Jan 8, 2025
Collaborator Author

I've added -DCMAKE_BUILD_TYPE=Release to the post, but honestly I've always built without this flag for both Vulkan and CPU backends and never noticed a difference in performance. Having Release set might strip the debug symbols but it shouldn't affect the compiler optimizations.

My release numbers for the RX 470 are basically identical to the ones I posted earlier without the flag.

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	160.08 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.41 ± 0.15

0cc4m Jan 8, 2025
Collaborator

Maybe not in your case, but some other results are suspiciously low in tg (for example the RTX 3080)

qnixsynapse · 2025-01-09T02:41:52Z

qnixsynapse
Jan 9, 2025
Collaborator

Build: 8d59d91 (4450)
ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	88.86 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	27.57 ± 0.03

Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
Compared to SYCL:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	1616.11 ± 5.28
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	36.64 ± 0.05

edit: retested both with the default batch size.

8 replies

0cc4m Jan 10, 2025
Collaborator

They do have vtune but it needs a third party kernel module to run which I don't like tbh.

Also, I don't know whether it supports Vulkan apps or not. But it does seem to support opencl.

I put my A770 into a Windows PC and gave Intel GPA and vtune a shot: GPA just crashes most of the time, I couldn't get it to trace anything useful. vtune works, but does not support Vulkan. It just shows some high-level metrics in that case, not really useful sadly.

qnixsynapse Jan 11, 2025
Collaborator

Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.

I did build it with cmake with build type Release.

0cc4m Jan 11, 2025
Collaborator

In that case it's something else, cause it should be performing similarly to my A770. I suspect the mesa version, there was something in newer mesa versions that slowed down tg on Intel.

qnixsynapse Jan 11, 2025
Collaborator

A750 has 448 CUs, A770 has 512 CUs I think. Personally, I am not worried about tg. I am worried about pp here. The gemm batch quickly saturates my GPU.

qnixsynapse Feb 9, 2025
Collaborator

@0cc4m https://gitlab.freedesktop.org/mesa/mesa/-/issues/12585

0cc4m · 2025-01-09T15:32:01Z

0cc4m
Jan 9, 2025
Collaborator

Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5.

build: 914a82d (4452)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	pp512	94.78 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	tg128	25.32 ± 0.02

1 reply

netrunnereve Jan 9, 2025
Collaborator Author

Very interesting, and looks like it's pretty close to the W8100 in tg despite being a dual GPU card. Your backend scales pretty well with layer splitting which is why I find it worthwhile to run my RX470 and W8100 together (I end up getting results that are close to the average of both cards).

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	main_gpu	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	pp512	147.84 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	tg128	30.77 ± 0.00

vkhodygo · 2025-01-10T12:21:36Z

vkhodygo
Jan 10, 2025

Latest arch with Vulkan Instance Version: 1.4.303 on a i7-1185G7 laptop. The config is not completely stock, I had to deal with thermals ages ago to boost the performance, so it doesn't throttle.

For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason cmake doesn't want to clean everything):

kill -STOP -1

timeout 240s $COMMAND

kill -CONT -1

Vulkan only:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	42.02 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	7.28 ± 0.24

build: ff3fcab (4459)

Vulkan and OpenBLAS w/ default 4 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	42.05 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	7.35 ± 0.26

This bit seems to underutilise both GPU and CPU in real conditions based on top activities.

Vulkan and OpenBLAS w/ default 8 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	41.89 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	7.22 ± 0.20

4 replies

0cc4m Jan 10, 2025
Collaborator

Unless you reduce the number of GPU layers, threads and openblas/non-openblas is not gonna make any difference. Try it with ngl 0, then only prompt processing is accelerated using Vulkan, the rest runs on CPU. This is often a good setting for integrated GPUs.

vkhodygo Jan 10, 2025

That's something I didn't think about, with -ngl 0 it goes like this:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	30.51 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	9.87 ± 0.05

build: ba8a1f9 (4460)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	32.11 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	9.49 ± 0.18

vkhodygo Feb 5, 2025

It seems latest patches has improved the results a bit:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	50.86 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	8.30 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	pp512	50.90 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	tg128	8.11 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	pp512	50.91 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	tg128	7.99 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	pp512	50.89 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	tg128	7.92 ± 0.24

macie Jun 1, 2025

A few months later and I get:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	106.19 ± 0.40
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	5.89 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	0	pp512	73.26 ± 1.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	0	tg128	5.24 ± 0.02

build: f3a4b16 (5568)

I run it on Linux (Arch with lama.cpp-vulkan-git package compiled by GCC 15). From my tests, only Vulkan backend (1.4.313) provides visible gains on i7-1185G7 processor, when comparing to other methods (I tried different combinations of GCC and Intel DPC++ compilers and backends: BLIS, OpenBLAS, oneMKL, SYCL, Vulkan).

I'm curious why I cannot go over 6 t/s. Is this an issue with the newer llama.cpp version or with my OS configuration?

0cc4m · 2025-01-10T20:27:15Z

0cc4m
Jan 10, 2025
Collaborator

Intel ARC A770 on Windows:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	314.24 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	45.22 ± 0.25

build: ba8a1f9 (4460)

0 replies

8XXD8 · 2025-01-11T12:48:55Z

8XXD8
Jan 11, 2025

Single GPU Vulkan

Radeon Instinct MI25

ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	54.69 ± 0.03

build: 2739a71 (4461)

Radeon PRO VII

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	329.86 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	75.22 ± 0.05

build: 2739a71 (4461)

Multi GPU Vulkan

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	324.55 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.39 ± 0.09

build: 2739a71 (4461)

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 3 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 4 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	pp512	32.29 ± 0.04
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	tg128	4.75 ± 0.00

build: 2739a71 (4461)

Single GPU Rocm

Device 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	409.83 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	63.94 ± 0.06

build: 2739a71 (4461)

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1064.99 ± 1.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	87.45 ± 0.04

build: 2739a71 (4461)

Multi GPU Rocm

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1061.87 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	81.49 ± 0.41

build: 2739a71 (4461)

Layer split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	pp512	16.36 ± 0.02
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	tg128	6.43 ± 0.01

build: 2739a71 (4461)

Row split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	sm	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	pp512	30.86 ± 0.03
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	tg128	12.52 ± 0.21

build: 2739a71 (4461)

Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split.

2 replies

cb88 Jan 18, 2025

What is the power profile for this MI25? Mine is 110W but its running slower than yours on git from today.

8XXD8 Jan 21, 2025

Mine defaults to 220w.
You can increase the power with rocm-smi --setpoweroverdrive 220

daniandtheweb · 2025-01-12T01:48:51Z

daniandtheweb
Jan 12, 2025

AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
build: c05e8c9 (4462)

Vulkan:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.13 ± 0.05

HIP:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	354.17 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	67.55 ± 0.04

I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).

Vulkan FA:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	214.48 ± 2.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	23.21 ± 0.08

HIP FA:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	314.17 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	62.02 ± 0.05

2 replies

0cc4m Jan 12, 2025
Collaborator

There is no Vulkan flash attention support (except with coopmat2 on very new nvidia drivers). What you're measuring here is a CPU fallback.

daniandtheweb Jan 12, 2025

I see, I was sure about the CPU fallback but didn't know there was no flash attention support at all.

FNsi · 2025-01-12T06:17:07Z

FNsi
Jan 12, 2025

I tried but there's nothing after 1 hrs , ok, might be 40 mins...

Anyway I run the llama_cli for a sample eval...

build: 4419 (46e3556e)

./llama-cli -m ~/storage/llama-2-7b.Q4_0.gguf -p "can u" -ngl 100                         ggml_vulkan: Found 1 Vulkan devices:                  ggml_vulkan: 0 = Mali-G57 (Mali-G57) | uma: 1 | fp16: 1 | warp size: 16 | matrix cores: none                build: 4419 (46e3556e) with clang version 19.1.6 for aarch64-unknown-linux-android24

llama_perf_sampler_print:    sampling time =       3.31 ms /    24 runs   (    0.14 ms per token,  7242.00 tokens per second)                                     llama_perf_context_print:        load time =   28544.85 ms                                                  llama_perf_context_print: prompt eval time =    3788.63 ms /     3 tokens ( 1262.88 ms per token,     0.79 tokens per second)                                     llama_perf_context_print:        eval time =   23248.44 ms /    20 runs   ( 1162.42 ms per token,     0.86 tokens per second)                                     llama_perf_context_print:       total time =   27591.65 ms /    23 tokens

Meanwhile OpenBLAS

llama_perf_sampler_print:    sampling time =       5.00 ms /    43 runs   (    0.12 ms per token,  8608.61 tokens per second)                                     llama_perf_context_print:        load time =   10871.74 ms                                                  llama_perf_context_print: prompt eval time =    1228.38 ms /     3 tokens (  409.46 ms per token,     2.44 tokens per second)                                     llama_perf_context_print:        eval time =   17010.39 ms /    39 runs   (  436.16 ms per token,     2.29 tokens per second)                                     llama_perf_context_print:       total time =   18639.62 ms /    42 tokens

2 replies

netrunnereve Jan 12, 2025
Collaborator Author

Even at below 1t/s llama-bench shouldn't run for an hour. The support just isn't there atm for Vulkan on Android.

FNsi Jan 13, 2025

Truth is ...

(0.79 tokens per second),

3788.63 ms / 3 tokens

So it's not even...it just slower...

Hadrianneue · 2025-08-07T18:47:55Z

Hadrianneue
Aug 7, 2025

got a nice 4-5% performance increase in tg128 since i last tested in late june using build: fd1234c (6096)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	3461.64 ± 30.11
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	130.56 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	3831.64 ± 1.82
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	130.57 ± 0.03

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	3530.68 ± 18.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	125.45 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	3558.71 ± 2.87
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	124.33 ± 0.56

2 replies

olegshulyakov Aug 7, 2025

@Hadrianneue can you please test ROCm?

Hadrianneue Aug 7, 2025

@Hadrianneue can you please test ROCm?

sure its terrible on rdna4, didn't do anything special, only built with -DGGML_HIP=ON -DGPU_TARGETS="gfx1201" -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	589.16 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	91.86 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	723.41 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	90.08 ± 0.39

pt13762104 · 2025-08-09T10:28:56Z

pt13762104
Aug 9, 2025

About 5% performance increase vs 8e6f8bc.

GTX 1660 Ti Mobile

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	pp512	152.78 ± 3.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	0	tg128	44.11 ± 0.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	158.09 ± 2.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	44.92 ± 0.68

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	pp512	512.46 ± 2.29
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	tg128	40.28 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	pp512	500.20 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	tg128	40.07 ± 0.08

build: e54d41b (6121)

CUDA pp is over 2x slower, but TG is 10% faster.

0 replies

Basten7 · 2025-08-11T07:35:03Z

Basten7
Aug 11, 2025

New Metal Build vs Vulkan build !

./build/bin/llama-bench -ngl 99 -m ../Models/llama-2-7b-q4_0.gguf

ggml_metal_init: found device: AMD Radeon RX 6900 XT

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	12	pp512	326.29 ± 0.64
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	12	tg128	98.27 ± 0.80
build: `79c1160` (6123custom)

./build/bin/llama-bench -ngl 99 -m /Users/xionz/Models/llama-2-7b-q4_0.gguf -sm none -mg 0

model	size	params	backend	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	none	pp512	643.37 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	none	tg128	82.26 ± 0.20

build: 79c1160 (6123)

0 replies

netzwanze · 2025-08-14T11:38:29Z

netzwanze
Aug 14, 2025

AMD Ryzen AI 9 HX 370 Linux Mint LMDE6 (Debian 12) igpu Radeon 890M

Why are only 6 GB of 8 GB shared memory being used? Radeontop shows max 5,2GB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	257.01 ± 0.39
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	19.93 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	264.67 ± 3.26
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	19.81 ± 0.01

build: 810b9fc (6156)

1 reply

olegshulyakov Aug 14, 2025

Is it integrated graphics Radeon 890M?

adilsonfsantos · 2025-08-15T02:30:11Z

adilsonfsantos
Aug 15, 2025

Updating the RX 6600 test with fa argument

Results test from march pp512 380.87 tg128 47.47

Results from june pp512 615.03 tg128 47.49

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600 (RADV NAVI23) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	617.85 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	48.52 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	622.72 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	48.31 ± 0.04

build: 4227c9b (6170)

0 replies

prototypicall · 2025-08-16T22:11:05Z

prototypicall
Aug 16, 2025

Radeon RX 9070 (non-XT)

Edit: updated results with amdvlk, which resolved the KHR_coopmat being unavailable issue.

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 9070 (AMD open-source driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	0	pp512	3164.10 ± 66.84
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	0	tg128	119.71 ± 3.40
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	1	pp512	2859.98 ± 31.53
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	100	1	tg128	119.51 ± 0.13

build: 21c17b5 (6188)

7 replies

prototypicall Aug 17, 2025

It's definitely not the shader compiler.

Vulkan found
GL_KHR_cooperative_matrix supported by glslc.
GL_NV_cooperative_matrix2 supported by glslc.
GL_EXT_integer_dot_product supported by glslc.
GL_EXT_bfloat16 supported by glslc.
Including Vulkan backend.

I am using the amdgpu driver. I tried these:

uninstall : After this, the Mesa drivers should be in charge, no? Rebuilt and tried the benchmark with the same result.
install the latest amdgpu driver (6.4.3): No dice.

vulkaninfo output has no 'coop' in it (I am assuming that the extension/property is KHR_cooperative_matrix).

I am really baffled.

0cc4m Aug 17, 2025
Collaborator

Please upload the output of vulkaninfo to a gist and link it here.

prototypicall Aug 17, 2025

Please upload the output of vulkaninfo to a gist and link it here.

Sure thing. Here:
vulkaninfo

Edit: fixed the link

0cc4m Aug 17, 2025
Collaborator

My guess would be that your driver is too old for the GPU, that's why it lacks some capabilities. Mesa takes a while until it has full support for very new hardware. You can either install a newer version (how to do that depends on your distro, worst case you can compile from source), or just use amdvlk for now. amdvlk is maintained by AMD itself and has full support on launch. It should be available through your package manager.

prototypicall Aug 17, 2025

Thanks for the response. Per your suggestion, I installed amdvlk and KHR_coopmat is finally detected. I will update my entry above with the new results which seem to have improved the pp speed quite a bit.

TinyServal · 2025-08-17T06:20:38Z

TinyServal
Aug 17, 2025

Vega 8 (GCN5) on the Ryzen 5 3550H running on dual channel DDR4-2400 memory, UMA framebuffer size set to 8GiB. TTM introduces a slight performance penalty so it's best to set the UMA framebuffer size in BIOS instead. This hardware is pretty weak, thought I'd benchmark it anyway just for fun.

Fedora 42 running mesa 25.1.7. Can't compare performance with ROCm because GCN5 is no longer supported.

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Vega 8 Graphics (RADV RAVEN) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	48.63 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	8.49 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	47.41 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	8.47 ± 0.01

build: 1fe0029 (6182)

0 replies

TinyServal · 2025-08-17T06:32:27Z

TinyServal
Aug 17, 2025

Apple M1 Mac Mini with 7-core GPU, Asahi Linux (Fedora 42) and mesa 25.2.0.

Total system power (reported by the hwmon driver):

pp512: 10.25W
tg128: 20.35W

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M1 (G13G B1) (Honeykrisp) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	31.31 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	12.41 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	28.65 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	12.38 ± 0.03

build: 1fe0029 (6182)

0 replies

TinyServal · 2025-08-17T06:41:46Z

TinyServal
Aug 17, 2025

Apple M2 Pro Mac Mini with 16-core GPU, Asahi Linux (Fedora 42) and mesa 25.2.0.

Total system power (reported by the hwmon driver):

pp512: 19.00W
tg128: 36.50W

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 Pro (G14S B1) (Honeykrisp) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	62.70 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	20.95 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	58.86 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	20.97 ± 0.03

build: 1fe0029 (6182)

0 replies

lhl · 2025-08-17T07:31:43Z

lhl
Aug 17, 2025

I didn't see the source for the AMD Ryzen AI Max+ 395 (Radeon 8060S iGPU) numbers, but the -fa 0 numbers seemed higher than what I got from my recent sweeps with different backends.

UPDATE: found it, was truncated by default, so didn't show up in a Find.

Here's my latest numbers on a Framework Desktop w/ 140W PL and w/ amd_iommu=off and with tuned profile set to accelerator-performance as a point of comparison. It is running Linux 6.17.0-rc1-1-mainline on a headless CachyOS.

First with AMDVLK:

❯ vulkaninfo | grep Version
Vulkan Instance Version: 1.4.321
        apiVersion        = 1.4.313 (4211001)
        driverVersion     = 2.0.349 (8388957)
❯ build/bin/llama-bench -fa 0,1 -m llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (AMD open-source driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	1121.72 ± 4.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	46.69 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	1347.94 ± 6.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	46.09 ± 0.03

build: 65349f26 (6183)

And secondly, here's it running w/ Mesa RADV

❯ AMD_VULKAN_ICD=RADV vulkaninfo | grep Version
'DISPLAY' environment variable not set... skipping surface info
Vulkan Instance Version: 1.4.321
        apiVersion        = 1.4.318 (4211006)
        driverVersion     = 25.2.0 (104865792)

❯ AMD_VULKAN_ICD=RADV build/bin/llama-bench -fa 0,1 -m llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot:
 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	803.41 ± 3.48
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	51.89 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	927.96 ± 1.79
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	51.14 ± 0.09

build: 65349f26 (6183)

So, interestingly, my -fa 0 numbers haven't quite hit the ones listed, although it's better than my sweep, this is on a new system w/ CachyOS installed and the latest drivers and kernel, so who knows.

BTW, the AMDVLK/Mesa RADV perf is about as expected - AMDVLK is almost always faster for pp in general, and Mesa RADV has better MBW and hence better tg. For those curious, here's the https://github.com/GpuZelenograd/memtest_vulkan comparings AMDVLK vs Mesa RADV drivers:

# AMDVLK
57 iteration. Passed 30.3231 seconds  written: 3168.2GB 220.8GB/sec        checked: 3335.0GB 208.7GB/sec

# Mesa RADV
56 iteration. Passed 30.0982 seconds  written: 3262.5GB 227.7GB/sec        checked: 3425.6GB 217.2GB/sec

0 replies

pt13762104 · 2025-08-17T11:06:18Z

pt13762104
Aug 17, 2025

Vulkan on an i7-10750H GPU, thought I would share it here after feeling bored with my dGPU:

model	size	params	backend	threads	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	0	pp512	21.31 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	0	tg128	6.25 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	1	pp512	18.44 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	1	tg128	6.32 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	25.46 ± 0.21
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	2.71 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	25.21 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	2.72 ± 0.01

build: de56279 (6184)

0 replies

guilherme-chaves · 2025-08-17T17:33:10Z

guilherme-chaves
Aug 17, 2025

Release b6188 gave a nice improvement for Arc B570 in tg128:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	469.46 ± 1.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	42.70 ± 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	285.36 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	43.38 ± 0.13

build: 21c17b5 (6188)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	468.82 ± 0.50
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	31.31 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	285.34 ± 0.73
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	31.74 ± 0.01

build: 19f4dec (6187)

2 replies

lhl Aug 17, 2025

@guilherme-chaves I know it's a bit off-topic, but I'd be very curious to see how the Vulkan compares to ipex-llm's sycl now. There's a portable zip build here: https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md

guilherme-chaves Aug 17, 2025

llama-cpp-ipex-llm-2.3.0b20250724-ubuntu

model	size	params	backend	ngl	fa	test	t/s
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	0	pp512	2037.29 ± 3.23
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	0	tg128	77.26 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	1	pp512	2024.67 ± 6.36
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	1	tg128	76.66 ± 0.13

build: d2c8ed1 (1)

TinyServal · 2025-08-18T02:45:51Z

TinyServal
Aug 18, 2025

Intel A750, mesa 25.1.7. Enabling flash attention reduces prompt processing throughput by 53% for some reason, but overall it's quite a lot faster than the results from January: #10879 (comment)

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	665.38 ± 5.49
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	41.43 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	312.51 ± 1.74
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	42.15 ± 0.08

build: 21c17b5 (6188)

0 replies

jruhe-adesso · 2025-08-18T17:53:21Z

jruhe-adesso
Aug 18, 2025

Significant improvements visible (now close to 2/3 of IPEX LLM)
Intel® Core™ Ultra 7 Prozessor 258V, Arc 140V
18 GB VRAM
Windows 11 Pro
Most recent driver.
Vulkan 1.4

.\llama-bench.exe -m ..\llama-2-7b.Q4_0.gguf -ngl 99
load_backend: loaded RPC backend from C:\Users\julia\Downloads\llama-b6193-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\julia\Downloads\llama-b6193-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\julia\Downloads\llama-b6193-bin-win-vulkan-x64\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	pp512	418.08 ± 6.02
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	tg128	20.53 ± 0.53

build: d1d8241 (6193)

0 replies

pebaryan · 2025-08-20T13:19:45Z

pebaryan
Aug 20, 2025

AMD Radeon R9 Fury
Windows 11

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) R9 Fury Series (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	main_gpu	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	0	pp512	130.41 ± 0.42
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	0	tg128	8.54 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	1	pp512	120.56 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	1	tg128	9.20 ± 0.04

build: c8c4495 (5820)

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) R9 Fury Series (AMD proprietary driver) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model	size	params	backend	ngl	main_gpu	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	0	pp512	127.65 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	0	tg128	8.79 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	1	pp512	123.77 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	1	tg128	9.16 ± 0.02

build: 1a99c2d (6213)

Sadly, driver for RX 470 mining was not successfully installed so gpu-z only reported that the card only support OpenGL and no vulkan

1 reply

pebaryan Aug 20, 2025

GPU1:
        apiVersion         = 4202629 (1.2.133)
        driverVersion      = 8388745 (0x800089)
        vendorID           = 0x1002
        deviceID           = 0x7300
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = AMD Radeon (TM) R9 Fury Series
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 20.10.35.02
        conformanceVersion = 1.2.0.2

Performance of llama.cpp with Vulkan #10879

Uh oh!

Uh oh!

netrunnereve Dec 18, 2024 Collaborator

Replies: 134 comments · 226 replies

Uh oh!

netrunnereve Dec 18, 2024 Collaborator Author

Uh oh!

netrunnereve May 1, 2025 Collaborator Author

Uh oh!

netrunnereve Dec 18, 2024 Collaborator Author

Uh oh!

netrunnereve May 1, 2025 Collaborator Author

Uh oh!

max-krasnyansky Dec 18, 2024 Collaborator

Uh oh!

Uh oh!

hkbu-kennycheng Jan 8, 2025

Uh oh!

ericcurtin Jan 14, 2025 Collaborator

Uh oh!

tsugabloom Mar 29, 2025

Uh oh!

tsugabloom Mar 29, 2025

Uh oh!

Uh oh!

hkbu-kennycheng Jan 8, 2025

Uh oh!

Uh oh!

hkbu-kennycheng Jan 8, 2025

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

hkbu-kennycheng Jan 8, 2025

Uh oh!

netrunnereve Jan 8, 2025 Collaborator Author

Uh oh!

hkbu-kennycheng Jan 9, 2025

Uh oh!

0cc4m Jan 8, 2025 Collaborator

NVIDIA GeForce RTX 3090 (NVIDIA)

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

netrunnereve Jan 8, 2025 Collaborator Author

Uh oh!

0cc4m Jan 8, 2025 Collaborator

Uh oh!

Uh oh!

qnixsynapse Jan 9, 2025 Collaborator

Uh oh!

0cc4m Jan 10, 2025 Collaborator

Uh oh!

qnixsynapse Jan 11, 2025 Collaborator

Uh oh!

0cc4m Jan 11, 2025 Collaborator

Uh oh!

qnixsynapse Jan 11, 2025 Collaborator

Uh oh!

qnixsynapse Feb 9, 2025 Collaborator

Uh oh!

0cc4m Jan 9, 2025 Collaborator

Uh oh!

netrunnereve
Dec 18, 2024
Collaborator

Replies: 134 comments 226 replies

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve May 1, 2025
Collaborator Author

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve May 1, 2025
Collaborator Author

max-krasnyansky
Dec 18, 2024
Collaborator

hkbu-kennycheng
Jan 8, 2025

ericcurtin Jan 14, 2025
Collaborator

hkbu-kennycheng
Jan 8, 2025

hkbu-kennycheng
Jan 8, 2025

0cc4m Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m
Jan 8, 2025
Collaborator

0cc4m
Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m Jan 8, 2025
Collaborator

qnixsynapse
Jan 9, 2025
Collaborator

0cc4m Jan 10, 2025
Collaborator

qnixsynapse Jan 11, 2025
Collaborator

0cc4m Jan 11, 2025
Collaborator

qnixsynapse Jan 11, 2025
Collaborator

qnixsynapse Feb 9, 2025
Collaborator

0cc4m
Jan 9, 2025
Collaborator