Performance of llama.cpp with Vulkan #10879
Replies: 101 comments 155 replies
-
AMD FirePro W8100
|
Beta Was this translation helpful? Give feedback.
-
AMD RX 470
|
Beta Was this translation helpful? Give feedback.
-
ubuntu 24.04, vulkan and cuda installed from official APT packages.
build: 4da69d1 (4351) vs CUDA on the same build/setup
build: 4da69d1 (4351) |
Beta Was this translation helpful? Give feedback.
-
Macbook Air M2 on Asahi Linux ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 4 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
build: 0d52a69 (4439) NVIDIA GeForce RTX 3090 (NVIDIA)
AMD Radeon RX 6800 XT (RADV NAVI21) (radv)
AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)
Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)
|
Beta Was this translation helpful? Give feedback.
-
@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require |
Beta Was this translation helpful? Give feedback.
-
Build: 8d59d91 (4450)
Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
edit: retested both with the default batch size. |
Beta Was this translation helpful? Give feedback.
-
Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5. build: 914a82d (4452)
|
Beta Was this translation helpful? Give feedback.
-
Latest arch with For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason kill -STOP -1
timeout 240s $COMMAND
kill -CONT -1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: ff3fcab (4459)
This bit seems to underutilise both GPU and CPU in real conditions based on
|
Beta Was this translation helpful? Give feedback.
-
Intel ARC A770 on Windows:
build: ba8a1f9 (4460) |
Beta Was this translation helpful? Give feedback.
-
Single GPU VulkanRadeon Instinct MI25 ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Radeon PRO VII ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Multi GPU Vulkanggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Single GPU RocmDevice 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
build: 2739a71 (4461) Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Multi GPU RocmDevice 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Layer split
build: 2739a71 (4461) Row split
build: 2739a71 (4461) Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split. |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).
|
Beta Was this translation helpful? Give feedback.
-
I tried but there's nothing after 1 hrs , ok, might be 40 mins... Anyway I run the llama_cli for a sample eval...
Meanwhile OpenBLAS
|
Beta Was this translation helpful? Give feedback.
-
Lunar Lake 258V with 140V iGPU and Coopmat (Windows 11) PS C:\Users\julia\Downloads\llama-b5598-bin-win-vulkan-x64> .\llama-bench.exe -m ..\llama-2-7b.Q4_0.gguf -ngl 99
build: 669c13e (5598) |
Beta Was this translation helpful? Give feedback.
-
OS: Arch Linux x86_64
ggml_vulkan: Found 1 Vulkan devices:
build: 228f34c (5604) OS: Arch Linux x86_64
ggml_vulkan: Found 1 Vulkan devices:
build: 228f34c (5604) |
Beta Was this translation helpful? Give feedback.
-
OS: Arch Linux x86_64 vulkaninfo:
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: 0 = Intel(R) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
build: 2e89f76 (5640) |
Beta Was this translation helpful? Give feedback.
-
Lenovo ThinkPad t14s wsl2 ubuntu 24.04
build: ed52f36 (5648) windows ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
build: ed52f36 (5648) windows ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
build: ed52f36 (5648) |
Beta Was this translation helpful? Give feedback.
-
395+ on arch
b7cc774 (5656) |
Beta Was this translation helpful? Give feedback.
-
Operating System: Microsoft Windows 11 24H2 (Build 26100.4061) AMD Radeon RX 6600 With
build: ed52f36 (5648) AMD Radeon RX 9060 XT (16 GB)
With
Without
build: ed52f36 (5648) |
Beta Was this translation helpful? Give feedback.
-
RTX 3060 (asus phoenix single fan) ggml_vulkan: Found 1 Vulkan devices:
build: 860a9e4 (5688) better than existing result but still lower than cuda ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 860a9e4 (5688) for some reason, only 1 GPU was detected using vulkan. running vulkaninfo displayed 2 gpu but each with different vulkan api version
|
Beta Was this translation helpful? Give feedback.
-
A dump of most of my AI-capable hardware All OS used is Arch Linux x86_64 LCPP build: 860a9e4 (5688) NVIDIA GeForce RTX 3090 (PCIe 4x16) ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 -mg 1 -sm none -r 30
build: 860a9e4 (5688) NVIDIA GeForce GTX 980 ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
build: 860a9e4 (5688) NVIDIA GeForce GTX 1650 Mobile ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 30 -fa 0,1 -sm none -mg 1
build: 860a9e4 (5688) NVIDIA GeForce MX 330 ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 15 -fa 0,1 -mg 0 -sm none
build: 860a9e4 (5688) AMD Radeon Pro W5500 (PCIe 3x2) ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 -mg 0 -sm none -r 30
build: 860a9e4 (5688) AMD Radeon Pro WX 4100 ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
build: 860a9e4 (5688) AMD Radeon HD 7790 ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 7 -fa 0,1
build: 860a9e4 (5688) AMD Radeon HD 7750 ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 7 -fa 0,1
build: 860a9e4 (5688) AMD Ryzen 7 5800HS (DDR4 3200 @ 2ch) ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 -sm none -mg 0
build: 860a9e4 (5688) Intel Core i7-1165G7 (DDR4 2666 @ 1ch) ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1 -mg 1 -sm none
build: 860a9e4 (5688) Intel Core i5-6300U ./llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 31 -fa 0,1
build: 860a9e4 (5688) |
Beta Was this translation helpful? Give feedback.
-
Update for Radeon RX 9070 XT (16GB) w/ Ryzen 9 5900X:
build: fa4a9f2 (5740) |
Beta Was this translation helpful? Give feedback.
-
i would like to add mining gpu (Nvidia P106-100, Pascal arch) to the benchmark since it can run llama-bench with cuda with around 410-420 t/s. unfortunately, it is not detected by vulkan. In windows it is listed as Display adapters but not as GPU in task managers. I looked into nvidia's vulkan driver and it seems that it is not listed as supported gpu. |
Beta Was this translation helpful? Give feedback.
-
NVIDIA GeForce GTX 1070 Ti
build: 860a9e4 (5688) on cuda it can reach over 700 t/s
build: 860a9e4 (5688) |
Beta Was this translation helpful? Give feedback.
-
Again in a RX 6600 And comparing with my previous bench in march, the pp512 test improved like 60% The previous pp512 380.87 ± 0.21
build: 72babea (5767) |
Beta Was this translation helpful? Give feedback.
-
Okay gonna try submitting my first benchmarks. My understanding is that the performance might vary a lot depending on the exact kernel drivers as well as user-space implementations and versions of the Linux AMD ecosystem so will try to capture those details as well including how to capture the info for others to reproduce if so desired. (Is there a script to do all this already or something as this is rather ad-hoc and I'm not sure it accurately describes all the moving parts in play sufficiently to recreate such an environment? Haha the AMD GPU stack is clear as mud to me...) Hardware Details
Driver Details
BenchmarksVulkan
ROCm/HIP$ export HIPCXX="$(hipconfig -l)/clang"
$ export HIP_PATH="$(hipconfig -R)"
$ cmake -B build -DGGML_VULKAN=0 -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release
$ cmake --build build --config Release -j $(nproc)
# run as sudo because permissions are borked despite adding user to group etc...
$ sudo ./build/bin/llama-bench \
-m /models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf \
-ngl 100 \
-fa 0,1 \
-t 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
sweep-benchTo get a better understanding of the performance across the kv-cache size curve I've run the same test model both for Vulkan and ROCm/HIP backends as compiled above. I maintain a branch ported from ik_llama.cpp of llama-sweep-bench for mainline llama.cpp. This tools gives the full view across a wider context length rather than sampling a single point along the curve (e.g. PP512 TG128). Also useful for checking various ./build/bin/llama-sweep-bench \
-m /models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf \
-fa \
-c 18432 \
-ngl 99 \
--threads 1 The winner to me of this specific benchmark roundup would be ROCm/HIP with flash attention enabled despite vulkan FA=1 having slight advantage in token generation. At longer contexts FA=1 is a must for both vulkan and rocm/hip otherwise performance falls off rapidly despite similar TG128 scores. For baseline I've added in my home rig 3090TI FE 24GB VRAM running at default 450W cap. While it gets much warmer, you can see the performance disparity. I've not tried vulkan backend with NVIDIA GPU, but some recent nvidia slides for some models vulkan may be faster than native CUDA interestingly. Thanks much for this informative thread and all the help figuring out the best combination of kernel/userspace drivers and api/backend/compiler libraries to get the best performance out of this GPU on llama.cpp. Cheers! |
Beta Was this translation helpful? Give feedback.
-
Curious that with DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M, DeepSeek-R1-Distill-Qwen-7B-Q4_K_M or Devstral-Small-2505-Q4_K_M i see nothing but 5% degradation in speed when running RADV vs AMDVLK in prompt processing/text generation but with llama-2-7b.Q4_0 prompt processing is twice as fast with AMDVLK
build: 27208bf (5774) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.
We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our Vulkan releases. If you have multiple GPUs please run the test on a single GPU using
-sm none -mg YOUR_GPU_NUMBER
unless the model is too big to fit in VRAM.Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same device newer commits with substantial Vulkan updates are prioritized, alternatively the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!
Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)
Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)
Beta Was this translation helpful? Give feedback.
All reactions