-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Implementation of GGML_NUMA_MIRROR for inferencing performance gain on numa systems #14969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…/llama.cpp into numa-improvements-take2
… instance and pointer struct member accesses
Should it also work on 1920x? |
It uses the standard numa libraries so it should work on any system with multiple numa nodes. I only have access to a dual Xeon though, so feel free to try it out. |
For my next trick I discovered several quant types in the GGML cpu backend don't have AVX 512 optimisations. I'll do that on a different PR... |
By the way if you want to have hugepages allocated in a numa-aware way at boot time, you can write a systemd service like this:
With a script like this #!/bin/bash
# Setup NUMA-aware hugepages - 188,928 per node (368GB per node)
# Clear existing hugepages
echo 0 > /proc/sys/vm/nr_hugepages
# Allocate on NUMA node 0
numactl --cpunodebind=0 --membind=0 bash -c 'echo 188928 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages'
# Allocate on NUMA node 1
numactl --cpunodebind=1 --membind=1 bash -c 'echo 188928 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages'
# Verify allocation
echo "Node 0 hugepages:"
cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo "Node 1 hugepages:"
cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages Note that the above numbers need to be adjusted based on your system and the number of nodes you have. I have a two socket Xeon (hence, 2 numa nodes) with 768GB of memory (384GB per node). So my magic numbers are Then just systemctl daemon-reload
systemctl enable numa-hugepages.service and and on boot you will get:
Keep in mind this reserves all but 32GB of system ram for hugepages. You can adjust those numbers up and down depending on your needs. Each 'page' is 2MB so just do the math. |
Just tested on a dual
(I get around 6.5-6.75 tokens/s with my optimised settings already) I used the same number of hugepages as you: # Allocate on NUMA node 0
numactl --cpunodebind=0 --membind=0 bash -c 'echo 188928 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages'
# Allocate on NUMA node 1
numactl --cpunodebind=1 --membind=1 bash -c 'echo 188928 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages' and can confirm it used them and loaded fine (but a lot slower). For reference: #!/bin/bash
host_address=192.168.1.1
port_number=8080
# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Dropping caches..."
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi
# Run the main command
CUDA_VISIBLE_DEVICES=0 ~/llama.cpp/build/bin/llama-server \
--host "$host_address" \
--port "$port_number" \
--alias "DeepSeek-R1-0528" \
--jinja \
--chat-template-file ~/models/DeepSeek-R1-0528.jinja \
--model ~/models/gguf/DeepSeek-R1-0528-Q6_K_X.gguf \
--n-gpu-layers 99 \
--numa distribute \
--threads 80 \
--override-tensor exps=CPU \
--flash-attn \
--ctx_size 65536 \
--batch-size 8192 \
--ubatch-size 8192 This offloads the non-shared experts in
I've optimised these settings by trying just about every possible combination though: Using
|
I forgot to add that this is lucky: as it's also the best 4bit quant type to use for the non-shared experts in terms of perplexity (IIRC, +0.5% compared to The added PP speed of |
It would be helpful if you guys could compile and run the And can you also make sure you run with |
I think you mean
This is just my stock settings and not this PR though - I probably won't have chance to run that until tomorrow or Monday now. According to: https://en.wikichip.org/wiki/intel/xeon_gold/6248
|
No I mean run llama-server with Looks like you are getting symmetric bandwidth usage on both nodes though. I think that's what I would expect... |
Oh, sorry |
I think anyway I'll add a very detailed log at the start of memory allocation with:
Then it will be very clear and easy to debug. There could be strange things like numa sub-clustering going on, I read that's a thing. |
It also occurs to me that PCIe slots are always attached to a single socket, so numa-aware allocation might impact that, that would be an interesting side effect. I'm learning more about memory every day... 😄 |
I tried this out on my dual Xeon 4216 system (no GPU) with Cohere Command-A on RHEL 8. I had to make changes to the Unfortunately, I didn't see any change to performance on my system. Here's the command I used:
Edit: I tried allocating hugepages with a script similar to what you shared above, except with 49152 2048k hugepages per node. Still no performance change. |
Thinking about this more today, for offloading shared-experts only; If the sampling frequency is high enough, then I might be able to hack |
It might be worth trying without the Do you find that using |
IIRC, the current CUDA offloading code only uses a single GPU for the offloaded calculations, so having 2 copies won't really help it. I do think there is a bottleneck somewhere as |
If the threads doing the offloading are located on socket 1, but the GPU is on a pcie slot attached to socket 2, maybe that would be sending the traffic over the UPI link? Might be worth investigating. I'll try to get that better visibility of thread/numa assignments in soon. |
Without this PR, I had a slight speedup from using HyperThreading (i.e. --threads 64 instead of --threads 32). Removing -np 2 had no impact on performance (for a single request, nothing running concurrently). However, I noticed that with -np 2 the generated tokens were gibberish, while without -np 2 it was giving valid/correct outputs. Looks like a bug. With this PR, switching from --threads 64 to --threads 32 had the same slowdown I had without this PR.
|
Installed libnuma and Pulled the latest changes (9d66473). Had to disable building RPC to build successfully. Tried to run it with Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL. I set .vm.nr_hugepages to 160000 to make sure the model and context had enough space. First time I run it took much longer than regular llama.cpp to load, I didn't time it, but felt like 5 minutes, whereas regular llama.cpp takes 1 minute or less. Subsequent loads were very quick, much quicker than llama.cpp. I haven't been able to get any output. Prompt prpcessing takes forever even on short six word prompts (ex: write a pong game in C++). In htop, I see only two cores (on CPU0) being at 100%, while all others are at 0%. The cores are the first and the 24th in htop. The system is a dual 24 core Xeon (ES QQ89, with HT enabled). I think there's a bug in thread pinning. The 24th core would have been the first core of the second CPU if HT was disabled. All threads get pinned to those two cores regardless of whether I set -t or not in llama-server. Tried using numactl with --physcpubind=$(seq -s, 1 2 95), which usually pins one worker to each physical core, but all threads get mapped to the same two cores (0 and 24). Waited a couple of minutes on that pong prompt to see if I get any output, but not a single token. EDIT: Got my dual Epyc back online, and can confirm same behavior as the dual Xeon. Compiled the branch, and run with --threads 96. Can see all threads get crammed on cpuid 00 and 48 in the log output, as well as on htop. Can also confirm what @aifartist mentioned about SMT threads not being the same as Intel consumer. Running |
Personally if that's of any issue, i'd put support for gpu aside the time for the patch to be developed. |
I've done quite a bit of testing and code deep diving over the weekend. What I've realised is that:
All of this said, now I can see what needs to be done to get this over the line. Each socket needs its own threadpool and the matrix operations need to be divvied up between the numa nodes / sockets, then we can leverage data paralellism. I am iterating on this locally at the moment and will update when I have something to test. |
Some information that may be useful. |
Looking at the code architecture, COSMA needs to be its own new backend really. And just throw away ggml-cpu. This could be good or bad, I'm not sure :D I like the idea. As a pedagogical exercise, I'll carry on with the framework I've created up to now, and maybe attempt that as a new PR when I feel more confident. |
Just a draft for now. Uses code from the fork by @wkgcass, added cleanup and merged it with a recent cut of
master
.This strategy mirrors the model in the local memory of each numa node on your system to eliminate the slow UPI link bottleneck.
Headline Improvements
Test system is a dual Xeon Gold 6240 with 768GB of DDR4 @ 2933Mhz, 6 channels per socket.
I see a performance improvement during inferencing of 64.6% on my system:
I see both memory banks being fully utilised during inference using the Intel pcm-memory tool:
Instructions
sudo apt-get install -y libnuma-dev
Check out the source and build with
-DGGML_NUMA_MIRROR=ON
.Make sure you run as a user with the ability to write to
/dev/hugepages
.Allocate some hugepages on your system. This allocates about 80GB, enough for 2x Qwen3-32B:
sudo sysctl -w vm.nr_hugepages=40000
-ngl 0
or whatever) and with--numa distribute
.You should see the following: