Performance of llama.cpp on AMD HIP/ROCm #15021

olegshulyakov · 2025-08-01T20:10:19Z

olegshulyakov
Aug 1, 2025

This is similar to the Performance of llama.cpp on Apple Silicon M-series and Performance of llama.cpp with Vulkan, but for ROCm! I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our ROCm(HIP) releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

Share your llama-bench results along with the git hash and ROCm info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device I'll prioritize newer commits with substantial ROCm updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!

ROCm Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11476.40 ± 72.79	218.87 ± 0.61	`2bf3fbf`	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3529.10 ± 56.47	150.86 ± 0.29	`d1aa0cc`	@Diablo-D3
RX 7900 XT	20 GB / GDDR6 / 320 bit	3098.38 ± 24.02	116.15 ± 0.06	`1e15bfd`	@AdamNiederer
Instinct MI100	32 GB / HBM2 / 4096 bit	2732.83 ± 1.98	110.48 ± 0.14	`9c35706`	@firefox42
RX 7800 XT	16 GB / GDDR6 / 256 bit	2151.81 + 17.94	100.94 + 0.10	`00131d6`	@olegshulyakov
Instinct MI60	32 GB / HBM2 / 4096 bit	1289.11 ± 0.62	91.46 ± 0.13	504af20	@Said-Akbar
Pro VII	16 GB / HBM2 / 4096 bit	1064.99 ± 1.18	87.45 ± 0.04	`2739a71`	@8XXD8
Pro V620	32 GB / GDDR6 / 256 bit	1803.65 ± 2.54	74.66 ± 0.01	`5c0eb5e`	@samteezy
RX 5700 XT	8 GB / GDDR6 / 256 bit	354.17 ± 0.18	67.55 ± 0.04	c05e8c9	@daniandtheweb
Instinct MI25	16 GB / HBM2 / 2048 bit	409.83 ± 0.23	63.94 ± 0.06	`2739a71`	@8XXD8
RX 7600 XT	16 GB / GDDR6 / 256 bit	1099.64 ± 2.05	48.58 ± 0.06	`9c35706`	@wbruna

ROCm Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	4037.07 ± 8.61	158.12 ± 0.21	`2bf3fbf`	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3598.42 ± 10.65	140.96 ± 0.23	`d1aa0cc`	@Diablo-D3
RX 7900 XT	20 GB / GDDR6 / 320 bit	3261.75 ± 9.09	112.30 ± 0.06	`1e15bfd`	@AdamNiederer
Instinct MI100	32 GB / HBM2 / 4096 bit	2755.00 ± 3.68	104.71 ± 0.10	`9c35706`	@firefox42
RX 7800 XT	16 GB / GDDR6 / 256 bit	2304.63 + 2.85	95.99 + 0.21	`00131d6`	@olegshulyakov
Pro V620	32 GB / GDDR6 / 256 bit	1256.86 ± 0.55	70.83 ± 0.02	`5c0eb5e`	@samteezy
RX 5700 XT	8 GB / GDDR6 / 256 bit	314.17 ± 0.29	62.02 ± 0.05	c05e8c9	@daniandtheweb
RX 7600 XT	16 GB / GDDR6 / 256 bit	1199.16 ± 1.07	47.65 ± 0.06	`9c35706`	@wbruna

olegshulyakov · 2025-08-01T20:20:29Z

olegshulyakov
Aug 1, 2025
Author

RX 7800 XT (Sapphire Pulse 280W)

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	pp512	2151.81 + 17.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	tg128	100.94 + 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	pp512	2304.63 + 2.85
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	tg128	95.99 + 0.21

build: 00131d6 (6031)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	pp512	2145.60 + 23.14
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	tg128	96.89 + 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	pp512	2063.66 + 2.92
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	tg128	96.03 + 0.09

build: baad948 (6056)

Notes:

Sapphire RX 7800 XT Pulse (Power Limit +15% - 280W)
Windows 10.
Drivers - Radeon Pro.

0 replies

AdamNiederer · 2025-08-01T20:24:57Z

AdamNiederer
Aug 1, 2025

Happy to replicate:

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	2967.12 ± 31.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	116.00 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3163.24 ± 4.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	112.75 ± 0.04

build: 9c35706 (6060)

On Linux

0 replies

wbruna · 2025-08-01T21:53:04Z

wbruna
Aug 1, 2025

RX 7600 XT

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	1099.64 ± 2.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	48.58 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	1199.16 ± 1.07
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	47.65 ± 0.06

build: 9c35706 (6060)

Running on Linux 6.12.32, mainline amdgpu, ROCm 6.4.1.

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	606.24 ± 0.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	52.84 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	612.33 ± 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	53.70 ± 0.33

build: 9c35706 (6060)

0 replies

Said-Akbar · 2025-08-02T02:11:59Z

Said-Akbar
Aug 2, 2025

AMD MI60.

Happy to contribute.
I am on Ubuntu 24.04 and ROCm 6.3.4. GPU is connected at 8x PCIE4.0 speed. AMD 5950x CPU with 96GB RAM at 3200Mhz. Flash attention is disabled (FA=0).

model	size	params	backend	ngl	sm	test	t/s	build
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	none	pp512	1289.11 ± 0.62	`504af20` (4476)
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	none	tg128	91.46 ± 0.13	`504af20` (4476)

I will post FA=1 and vulkan results once I have time during the weekend.

0 replies

firefox42 · 2025-08-02T07:40:51Z

firefox42
Aug 2, 2025

MI100

Using ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	2732.83 ± 1.98
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	110.48 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	2755.00 ± 3.68
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	104.71 ± 0.10

build: 9c35706 (6060)

I'm running Ubuntu 24.04.2 and ROCm 6.4.1

1 reply

olegshulyakov Aug 2, 2025
Author

I expected it to be faster than RX 7800 XT because of HBM2... Have you tried to launch with a single device only?

yeahdongcn · 2025-08-02T14:10:55Z

yeahdongcn
Aug 2, 2025
Collaborator

AMD Instinct MI300X

root@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	11476.40 ± 72.79
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	218.87 ± 0.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	4037.07 ± 8.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	158.12 ± 0.21

build: 2bf3fbf (6069)

Ref: #14640

3 replies

rohan-sircar Aug 4, 2025

Hi, your PP number looks really low with flash attention on. Did you compile the ROCm backend with rocwmma support?

yeahdongcn Aug 5, 2025
Collaborator

I used the default compilation config. Do you have any suggestions specific to MI300X?

rohan-sircar Aug 5, 2025

I'm just referring to the rocWMMA flag from the build instructions: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip

To enhance flash attention performance on RDNA3+ or CDNA architectures, you can utilize the rocWMMA library by enabling the -DGGML_HIP_ROCWMMA_FATTN=ON option. This requires rocWMMA headers to be installed on the build system.

It should work for CDNA too but we have only tested with our RDNA3 cards (7900 XTX) and saw huge performance jumps in PP with FA on: #10879 (reply in thread)

Please try it out because 1/3rd the performance in PP with FA on is just... strange at best

samteezy · 2025-08-02T19:58:08Z

samteezy
Aug 2, 2025

Pro V620

Why does FA slow down the V620 so much? Been a question I've been trying to answer for a while now.

root@llama:/mnt/models# /root/llama-builds/llama.cpp/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon PRO V620, gfx1030 (0x1030), VMM: no, Wave Size: 32
  Device 1: AMD Radeon (TM) Pro WX 3200 Series, gfx803 (0x803), VMM: no, Wave Size: 64
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) Pro WX 3200 Series (RADV POLARIS12) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon PRO V620 (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	threads	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	0	pp512	1801.16 ± 3.33
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	0	tg128	74.48 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	1	pp512	1258.12 ± 0.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	1	tg128	70.74 ± 0.02

build: 03d4698 (6074)

Linux, ROCm 6.4.1 ( will try upgrading soon)

2 replies

olegshulyakov Aug 2, 2025
Author

@samteezy Can you please run per each device PRO V620/Pro WX 3200 and ROCm only backend?

samteezy Aug 2, 2025

@olegshulyakov The numbers come out the same. Forcing -sm none mg 0 ensures only the V620 is running. I don't benchmark the WX 3200.

root@llama:~# /root/llama-builds/llama.cpp/bin/llama-bench -m /mnt/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon PRO V620, gfx1030 (0x1030), VMM: no, Wave Size: 32
Device 1: AMD Radeon (TM) Pro WX 3200 Series, gfx803 (0x803), VMM: no, Wave Size: 64

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	1803.65 ± 2.54
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	74.66 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	1256.86 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	70.83 ± 0.02

build: 5c0eb5e (6075)

rohan-sircar · 2025-08-05T03:56:34Z

rohan-sircar
Aug 5, 2025

Powercolor Hellhound RX 7900 XTX (400W power limit)

Opensuse tumbleweed system with rocm packages from AMD ROCm repository installed

Information for package rocm-hip:
---------------------------------
Repository     : AMD ROCm (openSUSE_Factory)
Name           : rocm-hip
Version        : 6.4.1-6.5
Arch           : x86_64
Vendor         : obs://build.opensuse.org/science
Installed Size : 25.5 MiB
Installed      : Yes
Status         : up-to-date
Source package : rocclr-6.4.1-6.5.src
Upstream URL   : https://github.com/ROCm/clr
Summary        : ROCm HIP platform and device tool
Description    : 
    HIP is a C++ Runtime API and Kernel Language that allows developers to create
    portable applications for AMD and NVIDIA GPUs from the same source code.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3243.15 ± 10.32
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	125.84 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3557.68 ± 13.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	122.71 ± 0.11

build: 5c0eb5e (6075)

Sapphire Nitro 7900 XTX (400W power limit)

In a different PC unfortunately because these GPUs are too chonky to fit in a regular case
So no TP for now but it serves my use case of running an LLM on one and STT/TTS on the other card to get a fully local voice-to-voice chatbot (Just tried with Amica and it works great! Very entertaining!)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3369.65 ± 10.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	122.06 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3573.30 ± 14.31
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	118.71 ± 0.14

build: 9c35706 (6060)

1 reply

olegshulyakov Aug 5, 2025
Author

Can it be that Windows build is faster? @Diablo-D3 results

Performance of llama.cpp on AMD HIP/ROCm #15021

Uh oh!

Uh oh!

ROCm Scoreboard for Llama 2 7B, Q4_0 (no FA)

ROCm Scoreboard for Llama 2 7B, Q4_0 (with FA)

Replies: 8 comments · 7 replies

Uh oh!

Uh oh!

olegshulyakov Aug 1, 2025 Author

RX 7800 XT (Sapphire Pulse 280W)

Uh oh!

Uh oh!

RX 7600 XT

Uh oh!

Uh oh!

Uh oh!

MI100

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

yeahdongcn Aug 2, 2025 Collaborator

Uh oh!

Uh oh!

yeahdongcn Aug 5, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Pro V620

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Powercolor Hellhound RX 7900 XTX (400W power limit)

Sapphire Nitro 7900 XTX (400W power limit)

Uh oh!

olegshulyakov Aug 5, 2025 Author

Replies: 8 comments 7 replies

olegshulyakov
Aug 1, 2025
Author

olegshulyakov Aug 2, 2025
Author

yeahdongcn
Aug 2, 2025
Collaborator

yeahdongcn Aug 5, 2025
Collaborator

olegshulyakov Aug 2, 2025
Author

olegshulyakov Aug 5, 2025
Author