Skip to content

Implementation of GGML_NUMA_MIRROR for inferencing performance gain on numa systems #14969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 46 commits into
base: master
Choose a base branch
from

Conversation

dbsanfte
Copy link

@dbsanfte dbsanfte commented Jul 30, 2025

Just a draft for now. Uses code from the fork by @wkgcass, added cleanup and merged it with a recent cut of master.

This strategy mirrors the model in the local memory of each numa node on your system to eliminate the slow UPI link bottleneck.

Headline Improvements

Test system is a dual Xeon Gold 6240 with 768GB of DDR4 @ 2933Mhz, 6 channels per socket.

I see a performance improvement during inferencing of 64.6% on my system:

root@xeon:/home/dbsanfte/llama-cpp-dbsanfte# /home/dbsanfte/llama-cpp-dbsanfte/build/bin/llama-bench -m /home/dbsanfte/models/Qwen3-32B-Q6_K.gguf -ngl 0 -ot ".*=CPU" --numa distribute
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------------- | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | ROCm       |   0 | .*=CPU                |           pp512 |         64.55 ± 0.01 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | ROCm       |   0 | .*=CPU                |           tg128 |          2.43 ± 0.00 |

build: fa72aa39 (6010)


root@xeon:/home/dbsanfte/llama-cpp-dbsanfte# /home/dbsanfte/llama.cpp-rocm/build/bin/llama-bench -m /home/dbsanfte/models/Qwen3-32B-Q6_K.gguf -ngl 0 -ot ".*=CPU" --numa distribute
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------------- | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | ROCm       |   0 | .*=CPU                |           pp512 |         64.22 ± 0.11 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | ROCm       |   0 | .*=CPU                |           tg128 |          1.57 ± 0.00 |

build: a86f52b2 (5973)

I see both memory banks being fully utilised during inference using the Intel pcm-memory tool:

|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):      61880.60                --|
|--           System DRAM Write Throughput(MB/s):       2604.99                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):      61880.60                --|
|--                System Write Throughput(MB/s):       2604.99                --|
|--               System Memory Throughput(MB/s):      64485.59                --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):  4990.48 --||-- Mem Ch  0: Reads (MB/s):  5244.65 --|
|--            Writes(MB/s):    37.25 --||--            Writes(MB/s):   400.60 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  1: Reads (MB/s):  4989.23 --||-- Mem Ch  1: Reads (MB/s):  5198.51 --|
|--            Writes(MB/s):    35.61 --||--            Writes(MB/s):   333.43 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  2: Reads (MB/s):  4988.42 --||-- Mem Ch  2: Reads (MB/s):  5291.83 --|
|--            Writes(MB/s):    34.97 --||--            Writes(MB/s):   480.57 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  3: Reads (MB/s):  4990.57 --||-- Mem Ch  3: Reads (MB/s):  5259.35 --|
|--            Writes(MB/s):    32.57 --||--            Writes(MB/s):   438.86 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  4: Reads (MB/s):  4991.53 --||-- Mem Ch  4: Reads (MB/s):  5245.06 --|
|--            Writes(MB/s):    33.23 --||--            Writes(MB/s):   399.35 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  5: Reads (MB/s):  4992.82 --||-- Mem Ch  5: Reads (MB/s):  5233.07 --|
|--            Writes(MB/s):    34.20 --||--            Writes(MB/s):   389.41 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- SKT  0 Mem Read (MB/s) : 29943.05 --||-- SKT  1 Mem Read (MB/s) : 31472.47 --|
|-- SKT  0 Mem Write(MB/s) :   207.83 --||-- SKT  1 Mem Write(MB/s) :  2442.21 --|
|-- SKT  0 PMM Read (MB/s):      0.00 --||-- SKT  1 PMM Read (MB/s):      0.00 --|
|-- SKT  0 PMM Write(MB/s):      0.00 --||-- SKT  1 PMM Write(MB/s):      0.00 --|
|-- SKT  0.0 NM read hit rate :  1.00 --||-- SKT  1.0 NM read hit rate :  1.03 --|
|-- SKT  0.1 NM read hit rate :  1.00 --||-- SKT  1.1 NM read hit rate :  1.03 --|
|-- SKT  0.2 NM read hit rate :  0.00 --||-- SKT  1.2 NM read hit rate :  0.00 --|
|-- SKT  0.3 NM read hit rate :  0.00 --||-- SKT  1.3 NM read hit rate :  0.00 --|
|-- SKT  0.4 NM read hit rate :  0.00 --||-- SKT  1.4 NM read hit rate :  0.00 --|
|-- SKT  0.5 NM read hit rate :  0.00 --||-- SKT  1.5 NM read hit rate :  0.00 --|
|-- SKT  0.6 NM read hit rate :  0.00 --||-- SKT  1.6 NM read hit rate :  0.00 --|
|-- SKT  0.7 NM read hit rate :  0.00 --||-- SKT  1.7 NM read hit rate :  0.00 --|
|-- SKT  0.8 NM read hit rate :  0.00 --||-- SKT  1.8 NM read hit rate :  0.00 --|
|-- SKT  0.9 NM read hit rate :  0.00 --||-- SKT  1.9 NM read hit rate :  0.00 --|
|-- SKT  0.10 NM read hit rate :  0.00 --||-- SKT  1.10 NM read hit rate :  0.00 --|
|-- SKT  0.11 NM read hit rate :  0.00 --||-- SKT  1.11 NM read hit rate :  0.00 --|
|-- SKT  0.12 NM read hit rate :  0.00 --||-- SKT  1.12 NM read hit rate :  0.00 --|
|-- SKT  0.13 NM read hit rate :  0.00 --||-- SKT  1.13 NM read hit rate :  0.00 --|
|-- SKT  0.14 NM read hit rate :  0.00 --||-- SKT  1.14 NM read hit rate :  0.00 --|
|-- SKT  0.15 NM read hit rate :  0.00 --||-- SKT  1.15 NM read hit rate :  0.00 --|
|-- SKT  0.16 NM read hit rate :  0.00 --||-- SKT  1.16 NM read hit rate :  0.00 --|
|-- SKT  0.17 NM read hit rate :  0.00 --||-- SKT  1.17 NM read hit rate :  0.00 --|
|-- SKT  0.18 NM read hit rate :  0.00 --||-- SKT  1.18 NM read hit rate :  0.00 --|
|-- SKT  0.19 NM read hit rate :  0.00 --||-- SKT  1.19 NM read hit rate :  0.00 --|
|-- SKT  0.20 NM read hit rate :  0.00 --||-- SKT  1.20 NM read hit rate :  0.00 --|
|-- SKT  0.21 NM read hit rate :  0.00 --||-- SKT  1.21 NM read hit rate :  0.00 --|
|-- SKT  0.22 NM read hit rate :  0.00 --||-- SKT  1.22 NM read hit rate :  0.00 --|
|-- SKT  0.23 NM read hit rate :  0.00 --||-- SKT  1.23 NM read hit rate :  0.00 --|
|-- SKT  0.24 NM read hit rate :  0.00 --||-- SKT  1.24 NM read hit rate :  0.00 --|
|-- SKT  0.25 NM read hit rate :  0.00 --||-- SKT  1.25 NM read hit rate :  0.00 --|
|-- SKT  0.26 NM read hit rate :  0.00 --||-- SKT  1.26 NM read hit rate :  0.00 --|
|-- SKT  0.27 NM read hit rate :  0.00 --||-- SKT  1.27 NM read hit rate :  0.00 --|
|-- SKT  0.28 NM read hit rate :  0.00 --||-- SKT  1.28 NM read hit rate :  0.00 --|
|-- SKT  0.29 NM read hit rate :  0.00 --||-- SKT  1.29 NM read hit rate :  0.00 --|
|-- SKT  0.30 NM read hit rate :  0.00 --||-- SKT  1.30 NM read hit rate :  0.00 --|
|-- SKT  0.31 NM read hit rate :  0.00 --||-- SKT  1.31 NM read hit rate :  0.00 --|
|-- SKT  0 Memory (MB/s):    30150.88 --||-- SKT  1 Memory (MB/s):    33914.68 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):      61415.52                --|
|--           System DRAM Write Throughput(MB/s):       2650.04                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):      61415.52                --|
|--                System Write Throughput(MB/s):       2650.04                --|
|--               System Memory Throughput(MB/s):      64065.56                --|
|---------------------------------------||---------------------------------------|

Instructions

  1. sudo apt-get install -y libnuma-dev

  2. Check out the source and build with -DGGML_NUMA_MIRROR=ON.

  3. Make sure you run as a user with the ability to write to /dev/hugepages.

  4. Allocate some hugepages on your system. This allocates about 80GB, enough for 2x Qwen3-32B:

sudo sysctl -w vm.nr_hugepages=40000

  1. Run llama-server with CPU offload (-ngl 0 or whatever) and with --numa distribute.

You should see the following:

Aug 01 14:54:52 xeon bash[52251]: load_tensors: tensor 'token_embd.weight' (q4_K) (and 174 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
Aug 01 14:54:52 xeon bash[52251]: Creating unified NUMA mapping for 6 multi-part GGUF files
Aug 01 14:54:52 xeon bash[52251]: Detected 2 NUMA nodes for unified multi-part mapping
Aug 01 14:54:52 xeon bash[52251]: Total unified model size: 272915622592 bytes across 6 files
Aug 01 14:54:52 xeon bash[52251]: Creating unified mapping: 255 hugepages (273804165120 bytes total) for 272915622592 bytes across 6 files
Aug 01 14:54:52 xeon bash[52251]: numa_set_preferred(0) - creating single unified mapping
Aug 01 14:56:20 xeon bash[52251]: mmap(/dev/hugepages/llama-unified-node0-0) desire=0x200000000000 size=273804165120 result=0x200000000000 is_new_mem[0]=yes
Aug 01 14:57:33 xeon bash[52251]: numa_set_preferred(1) - creating single unified mapping
Aug 01 14:58:18 xeon bash[52251]: mmap(/dev/hugepages/llama-unified-node1-0) desire=0x400000000000 size=273804165120 result=0x400000000000 is_new_mem[1]=yes
Aug 01 14:58:54 xeon bash[52251]: begin to copy unified model data from disk to mem...
Aug 01 14:58:54 xeon bash[52251]: copying file data at offset 0, size 49913955424
Aug 01 14:59:31 xeon bash[52251]: copying file data at offset 49913955424, size 48440972736
Aug 01 15:00:05 xeon bash[52251]: copying file data at offset 98354928160, size 49385348672
Aug 01 15:00:41 xeon bash[52251]: copying file data at offset 147740276832, size 49200580288
Aug 01 15:01:16 xeon bash[52251]: copying file data at offset 196940857120, size 49567581888
Aug 01 15:01:51 xeon bash[52251]: copying file data at offset 246508439008, size 26407183584
Aug 01 15:02:10 xeon bash[52251]: begin to copy unified model from numa0 to numa1...
Aug 01 15:03:04 xeon bash[52251]: load_tensors: offloading 61 repeating layers to GPU
Aug 01 15:03:04 xeon bash[52251]: load_tensors: offloading output layer to GPU
Aug 01 15:03:04 xeon bash[52251]: load_tensors: offloaded 62/62 layers to GPU
Aug 01 15:03:04 xeon bash[52251]: load_tensors:        ROCm0 model buffer size =  9562.48 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 46857.20 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 46189.46 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 46985.99 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 46810.22 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 47160.22 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 25175.97 MiB
Aug 01 15:03:06 xeon bash[52251]: ...................................................................................................
Aug 01 15:03:06 xeon bash[52251]: .
Aug 01 15:03:06 xeon bash[52251]: llama_context: constructing llama_context
Aug 01 15:03:06 xeon bash[52251]: llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_seq_max     = 1
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_ctx         = 131072
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_ctx_per_seq = 131072
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_batch       = 2048
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_ubatch      = 512
Aug 01 15:03:06 xeon bash[52251]: llama_context: causal_attn   = 1
Aug 01 15:03:06 xeon bash[52251]: llama_context: flash_attn    = 1
Aug 01 15:03:06 xeon bash[52251]: llama_context: kv_unified    = true
Aug 01 15:03:06 xeon bash[52251]: llama_context: freq_base     = 10000.0
Aug 01 15:03:06 xeon bash[52251]: llama_context: freq_scale    = 0.025
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_ctx_per_seq (131072) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
Aug 01 15:03:06 xeon bash[52251]: set_abort_callback: call
Aug 01 15:03:06 xeon bash[52251]: llama_context:  ROCm_Host  output buffer size =     0.49 MiB
Aug 01 15:03:06 xeon bash[52251]: create_memory: n_ctx = 131072 (padded)
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   0: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   1: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   2: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   3: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   4: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   5: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   6: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   7: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   8: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   9: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  10: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  11: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  12: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  13: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  14: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  15: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  16: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  17: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  18: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  19: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  20: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  21: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  22: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  23: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  24: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  25: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  26: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  27: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  28: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  29: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  30: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  31: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  32: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  33: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  34: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  35: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  36: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  37: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  38: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  39: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  40: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  41: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  42: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  43: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  44: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  45: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  46: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  47: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  48: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  49: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  50: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  51: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  52: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  53: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  54: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  55: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  56: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  57: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  58: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  59: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  60: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified:      ROCm0 KV buffer size = 16592.00 MiB
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: size = 16592.00 MiB (131072 cells,  61 layers,  1/ 1 seqs), K (f16): 8784.00 MiB, V (f16): 7808.00 MiB
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
Aug 01 15:03:06 xeon bash[52251]: llama_context: enumerating backends
Aug 01 15:03:06 xeon bash[52251]: llama_context: backend_ptrs.size() = 2
Aug 01 15:03:06 xeon bash[52251]: llama_context: max_nodes = 8688
Aug 01 15:03:06 xeon bash[52251]: llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
Aug 01 15:03:06 xeon bash[52251]: graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
Aug 01 15:03:06 xeon bash[52251]: graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
Aug 01 15:03:06 xeon bash[52251]: graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
Aug 01 15:03:06 xeon bash[52251]: llama_context:      ROCm0 compute buffer size =  2077.50 MiB
Aug 01 15:03:06 xeon bash[52251]: llama_context:  ROCm_Host compute buffer size =   784.01 MiB
Aug 01 15:03:06 xeon bash[52251]: llama_context: graph nodes  = 4907
Aug 01 15:03:06 xeon bash[52251]: llama_context: graph splits = 298 (with bs=512), 240 (with bs=1)
Aug 01 15:03:06 xeon bash[52251]: clear_adapter_lora: call
Aug 01 15:03:06 xeon bash[52251]: common_init_from_params: added <|end▁of▁sentence|> logit bias = -inf
Aug 01 15:03:06 xeon bash[52251]: common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
Aug 01 15:03:06 xeon bash[52251]: common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Aug 01 15:03:06 xeon bash[52251]: set_warmup: value = 1
Aug 01 15:03:06 xeon bash[52251]: thread_id = 00, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 1
Aug 01 15:03:06 xeon bash[52251]: thread_id = 18, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 15, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 34, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 03, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 04, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 07, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 02, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 25, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 26, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 01, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 24, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 17, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 32, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 05, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 20, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 29, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 12, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 09, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 10, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 27, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 06, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 11, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 22, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 23, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 08, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 31, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 16, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 21, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 14, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 35, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 30, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 19, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 28, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 13, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 33, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:14 xeon bash[52251]: set_warmup: value = 0
Aug 01 15:03:14 xeon bash[52251]: srv          init: initializing slots, n_slots = 1
Aug 01 15:03:14 xeon bash[52251]: slot         init: id  0 | task -1 | new slot n_ctx_slot = 131072
Aug 01 15:03:14 xeon bash[52251]: slot        reset: id  0 | task -1 |
Aug 01 15:03:14 xeon bash[52251]: main: model loaded

@jacekpoplawski
Copy link
Contributor

Should it also work on 1920x?

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

It uses the standard numa libraries so it should work on any system with multiple numa nodes. I only have access to a dual Xeon though, so feel free to try it out.

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

For my next trick I discovered several quant types in the GGML cpu backend don't have AVX 512 optimisations. I'll do that on a different PR...

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

By the way if you want to have hugepages allocated in a numa-aware way at boot time, you can write a systemd service like this:

[Unit]
Description=Configure NUMA-aware hugepages
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/setup-numa-hugepages.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

With a script like this

#!/bin/bash
# Setup NUMA-aware hugepages - 188,928 per node (368GB per node)

# Clear existing hugepages
echo 0 > /proc/sys/vm/nr_hugepages

# Allocate on NUMA node 0
numactl --cpunodebind=0 --membind=0 bash -c 'echo 188928 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages'

# Allocate on NUMA node 1
numactl --cpunodebind=1 --membind=1 bash -c 'echo 188928 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages'

# Verify allocation
echo "Node 0 hugepages:"
cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo "Node 1 hugepages:"
cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

Note that the above numbers need to be adjusted based on your system and the number of nodes you have.

I have a two socket Xeon (hence, 2 numa nodes) with 768GB of memory (384GB per node). So my magic numbers are 188928 for nr_hugepages, with --cpunodebind=0 --membind=0 for the first node/cpu and --cpunodebind=1 --membind=1 for the second node/cpu.

Then just

systemctl daemon-reload
systemctl enable numa-hugepages.service

and shutdown -r now

and on boot you will get:

Jul 31 13:51:40 xeon systemd[1]: Starting numa-hugepages.service - Configure NUMA-aware hugepages...
Jul 31 13:53:49 xeon setup-numa-hugepages.sh[2447]: Node 0 hugepages:
Jul 31 13:53:49 xeon setup-numa-hugepages.sh[3067]: 188928
Jul 31 13:53:49 xeon setup-numa-hugepages.sh[2447]: Node 1 hugepages:
Jul 31 13:53:49 xeon setup-numa-hugepages.sh[3068]: 188928
Jul 31 13:53:49 xeon systemd[1]: Finished numa-hugepages.service - Configure NUMA-aware hugepages.
Jul 31 14:51:10 xeon systemd[1]: numa-hugepages.service: Deactivated successfully.
Jul 31 14:51:10 xeon systemd[1]: Stopped numa-hugepages.service - Configure NUMA-aware hugepages.
Jul 31 14:51:10 xeon systemd[1]: numa-hugepages.service: Consumed 2min 9.225s CPU time.

Keep in mind this reserves all but 32GB of system ram for hugepages. You can adjust those numbers up and down depending on your needs. Each 'page' is 2MB so just do the math.

@jukofyork
Copy link
Collaborator

Test system is a dual Xeon Gold 6240 with 768GB of DDR4 @ 2933Mhz, 6 channels per socket.

I see a performance improvement during inferencing of 64.6% on my system:

Just tested on a dual Xeon Gold 6248 (40 cores / 80 threads) with 1.5TB of DDR4 @ 2699Mhz (6 channels per socket), and didn't get any improvement sadly:

  • Without setting --threads it uses 40 threads and I get around 5.5 tokens/s generation.
  • Setting --threads 80 gives around 6.5 tokens/s generation.

(I get around 6.5-6.75 tokens/s with my optimised settings already)

I used the same number of hugepages as you:

# Allocate on NUMA node 0
numactl --cpunodebind=0 --membind=0 bash -c 'echo 188928 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages'

# Allocate on NUMA node 1
numactl --cpunodebind=1 --membind=1 bash -c 'echo 188928 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages'

and can confirm it used them and loaded fine (but a lot slower).

For reference:

#!/bin/bash

host_address=192.168.1.1
port_number=8080

# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null

# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo    # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
    echo "Dropping caches..."
    echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi

# Run the main command
CUDA_VISIBLE_DEVICES=0 ~/llama.cpp/build/bin/llama-server \
        --host "$host_address" \
        --port "$port_number" \
        --alias "DeepSeek-R1-0528" \
        --jinja \
        --chat-template-file ~/models/DeepSeek-R1-0528.jinja \
        --model ~/models/gguf/DeepSeek-R1-0528-Q6_K_X.gguf \
        --n-gpu-layers 99 \
        --numa distribute \
        --threads 80 \
        --override-tensor exps=CPU \
        --flash-attn \
        --ctx_size 65536 \
        --batch-size 8192 \
        --ubatch-size 8192

This offloads the non-shared experts in q4_k and everything else is on the (RTX 6000 ADA) GPU:

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  183 tensors
llama_model_loader: - type q4_K:  174 tensors
llama_model_loader: - type q6_K:  368 tensors

I've optimised these settings by trying just about every possible combination though:

Using --numa distribute and --threads 80

  • On my other dual E5-2699 v4 (44 cores / 88 threads) systems, --numa interleave and --threads 30 was optimal.
  • Changing --threads vs --threads-batch just makes things worse.
  • I suspect it's not actually the use of hyper-threading that is helping here, but actually how it causes the tensors to get laid out in RAM - it actually makes the model load faster, and I noticed the batch transfer rate to the GPU via the PCI-E 3.0 16x bus is higher too.
  • I see 7800-8000% CPU use during inference (unlike the dual E5-2699 v4 which will see like 3000-3200% if you try to use the full 88 threads).

Using q4_k for the non-shared expert tensors

  • This seems important too, and q4_0, iq4_nl, iq4_xs, etc all perform slightly worse for TG and significantly worse for PP (when performed on the CPU in RAM for small batches).

I hack ggml/src/ggml-cuda/ggml-cuda.cu to use const int min_batch_size = 2800 as this is the break-even point for copying tensors to the GPU (and hence why my batch sizes are so large above).

It might be worth trying some similar settings for your dual dual Xeon Gold 6240 to see if the speedup is coming from the new NUMA_MIRROR, or just these same quirks! :)

@jukofyork
Copy link
Collaborator

Using q4_k for the non-shared expert tensors

* This seems important too, and `q4_0`, `iq4_nl`, `iq4_xs`, etc all perform slightly worse for TG and significantly worse for PP (when performed on the CPU in RAM for small batches).

I forgot to add that this is lucky: as it's also the best 4bit quant type to use for the non-shared experts in terms of perplexity (IIRC, +0.5% compared to q8_0). If you check the old MLA thread I posted the results of running models with everything else in BF16 apart from the non-shared expert tensors, and iq4_xs was significantly worse (something like +1% compared to q8_0) and q4_0 was much worse (something like +1.5% compared to q8_0).

The added PP speed of q4_k makes an even bigger difference if you want to use speculative decoding too (in another thread I plotted the asymptotes of the different 4bit quants for batch size 1 to 64).

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

It would be helpful if you guys could compile and run the pcm-memory tool and let's see what your realtime memory bandwidth utilisation figures look like during inferencing:

https://github.com/intel/pcm

And can you also make sure you run with -v and post the log here, and let's see what numa nodes it's putting all those threads on.

@jukofyork
Copy link
Collaborator

And can you also make sure you run with -v and post the log here, and let's see what numa nodes it's putting all those threads on.

I think you mean -u?

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):  6724.50 --||-- Mem Ch  0: Reads (MB/s):  6727.73 --|
|--            Writes(MB/s):   271.34 --||--            Writes(MB/s):   249.71 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  1: Reads (MB/s):  6702.11 --||-- Mem Ch  1: Reads (MB/s):  6734.47 --|
|--            Writes(MB/s):   244.82 --||--            Writes(MB/s):   258.78 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  2: Reads (MB/s):  6706.66 --||-- Mem Ch  2: Reads (MB/s):  6682.26 --|
|--            Writes(MB/s):   250.18 --||--            Writes(MB/s):   181.19 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  3: Reads (MB/s):  6724.16 --||-- Mem Ch  3: Reads (MB/s):  6687.68 --|
|--            Writes(MB/s):   270.95 --||--            Writes(MB/s):   185.54 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  4: Reads (MB/s):  6707.93 --||-- Mem Ch  4: Reads (MB/s):  6737.81 --|
|--            Writes(MB/s):   246.61 --||--            Writes(MB/s):   261.74 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  5: Reads (MB/s):  6697.25 --||-- Mem Ch  5: Reads (MB/s):  6717.85 --|
|--            Writes(MB/s):   231.48 --||--            Writes(MB/s):   225.57 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- SKT  0 Mem Read (MB/s) : 40262.61 --||-- SKT  1 Mem Read (MB/s) : 40287.79 --|
|-- SKT  0 Mem Write(MB/s) :  1515.37 --||-- SKT  1 Mem Write(MB/s) :  1362.54 --|
|-- SKT  0 PMM Read (MB/s):      0.00 --||-- SKT  1 PMM Read (MB/s):      0.00 --|
|-- SKT  0 PMM Write(MB/s):      0.00 --||-- SKT  1 PMM Write(MB/s):      0.00 --|
|-- SKT  0.0 NM read hit rate :  1.00 --||-- SKT  1.0 NM read hit rate :  1.01 --|
|-- SKT  0.1 NM read hit rate :  1.00 --||-- SKT  1.1 NM read hit rate :  1.01 --|
|-- SKT  0.2 NM read hit rate :  0.00 --||-- SKT  1.2 NM read hit rate :  0.00 --|
|-- SKT  0.3 NM read hit rate :  0.00 --||-- SKT  1.3 NM read hit rate :  0.00 --|
|-- SKT  0.4 NM read hit rate :  0.00 --||-- SKT  1.4 NM read hit rate :  0.00 --|
|-- SKT  0.5 NM read hit rate :  0.00 --||-- SKT  1.5 NM read hit rate :  0.00 --|
|-- SKT  0.6 NM read hit rate :  0.00 --||-- SKT  1.6 NM read hit rate :  0.00 --|
|-- SKT  0.7 NM read hit rate :  0.00 --||-- SKT  1.7 NM read hit rate :  0.00 --|
|-- SKT  0.8 NM read hit rate :  0.00 --||-- SKT  1.8 NM read hit rate :  0.00 --|
|-- SKT  0.9 NM read hit rate :  0.00 --||-- SKT  1.9 NM read hit rate :  0.00 --|
|-- SKT  0.10 NM read hit rate :  0.00 --||-- SKT  1.10 NM read hit rate :  0.00 --|
|-- SKT  0.11 NM read hit rate :  0.00 --||-- SKT  1.11 NM read hit rate :  0.00 --|
|-- SKT  0.12 NM read hit rate :  0.00 --||-- SKT  1.12 NM read hit rate :  0.00 --|
|-- SKT  0.13 NM read hit rate :  0.00 --||-- SKT  1.13 NM read hit rate :  0.00 --|
|-- SKT  0.14 NM read hit rate :  0.00 --||-- SKT  1.14 NM read hit rate :  0.00 --|
|-- SKT  0.15 NM read hit rate :  0.00 --||-- SKT  1.15 NM read hit rate :  0.00 --|
|-- SKT  0.16 NM read hit rate :  0.00 --||-- SKT  1.16 NM read hit rate :  0.00 --|
|-- SKT  0.17 NM read hit rate :  0.00 --||-- SKT  1.17 NM read hit rate :  0.00 --|
|-- SKT  0.18 NM read hit rate :  0.00 --||-- SKT  1.18 NM read hit rate :  0.00 --|
|-- SKT  0.19 NM read hit rate :  0.00 --||-- SKT  1.19 NM read hit rate :  0.00 --|
|-- SKT  0.20 NM read hit rate :  0.00 --||-- SKT  1.20 NM read hit rate :  0.00 --|
|-- SKT  0.21 NM read hit rate :  0.00 --||-- SKT  1.21 NM read hit rate :  0.00 --|
|-- SKT  0.22 NM read hit rate :  0.00 --||-- SKT  1.22 NM read hit rate :  0.00 --|
|-- SKT  0.23 NM read hit rate :  0.00 --||-- SKT  1.23 NM read hit rate :  0.00 --|
|-- SKT  0.24 NM read hit rate :  0.00 --||-- SKT  1.24 NM read hit rate :  0.00 --|
|-- SKT  0.25 NM read hit rate :  0.00 --||-- SKT  1.25 NM read hit rate :  0.00 --|
|-- SKT  0.26 NM read hit rate :  0.00 --||-- SKT  1.26 NM read hit rate :  0.00 --|
|-- SKT  0.27 NM read hit rate :  0.00 --||-- SKT  1.27 NM read hit rate :  0.00 --|
|-- SKT  0.28 NM read hit rate :  0.00 --||-- SKT  1.28 NM read hit rate :  0.00 --|
|-- SKT  0.29 NM read hit rate :  0.00 --||-- SKT  1.29 NM read hit rate :  0.00 --|
|-- SKT  0.30 NM read hit rate :  0.00 --||-- SKT  1.30 NM read hit rate :  0.00 --|
|-- SKT  0.31 NM read hit rate :  0.00 --||-- SKT  1.31 NM read hit rate :  0.00 --|
|-- SKT  0 Memory (MB/s):    41777.98 --||-- SKT  1 Memory (MB/s):    41650.33 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):      80550.40                --|
|--           System DRAM Write Throughput(MB/s):       2877.91                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):      80550.40                --|
|--                System Write Throughput(MB/s):       2877.91                --|
|--               System Memory Throughput(MB/s):      83428.31                --|
|---------------------------------------||---------------------------------------|
slot update_slots: id  0 | task 1087 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 16
slot update_slots: id  0 | task 1087 | need to evaluate at least 1 token for each active slot, n_past = 16, n_prompt_tokens = 16
slot update_slots: id  0 | task 1087 | kv cache rm [15, end)
slot update_slots: id  0 | task 1087 | prompt processing progress, n_past = 16, n_tokens = 1, progress = 0.062500
slot update_slots: id  0 | task 1087 | prompt done, n_past = 16, n_tokens = 1
slot      release: id  0 | task 1087 | stop processing: n_past = 1570, truncated = 0
slot print_timing: id  0 | task 1087 | 
prompt eval time =     244.06 ms /     1 tokens (  244.06 ms per token,     4.10 tokens per second)
       eval time =  230359.52 ms /  1555 tokens (  148.14 ms per token,     6.75 tokens per second)
      total time =  230603.58 ms /  1556 tokens

This is just my stock settings and not this PR though - I probably won't have chance to run that until tomorrow or Monday now.

According to:

https://en.wikichip.org/wiki/intel/xeon_gold/6248

(2699/2933)*131.13 = ~120GB/s per socket is the theoretical maximum bandwidth, so it looks like I'm getting about 1/3rd of this.

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

No I mean run llama-server with -v and post its startup log so we can see where my numa code is putting your threads.

Looks like you are getting symmetric bandwidth usage on both nodes though. I think that's what I would expect...

@jukofyork
Copy link
Collaborator

No I mean run llama-server with -v and post its startup log so we can see where my numa code is putting your threads.

Oh, sorry -u is to make pcm-memory act like you used watch on it! :)

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

I think anyway I'll add a very detailed log at the start of memory allocation with:

  • what numa topology it sees

  • what sockets and cores it sees

  • what kind of cores they are (perf/efficiency/hyper threaded)

  • where it plans to put the threads: on which nodes

Then it will be very clear and easy to debug.

There could be strange things like numa sub-clustering going on, I read that's a thing.

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

It also occurs to me that PCIe slots are always attached to a single socket, so numa-aware allocation might impact that, that would be an interesting side effect. I'm learning more about memory every day... 😄

@sultanqasim
Copy link

sultanqasim commented Aug 2, 2025

I tried this out on my dual Xeon 4216 system (no GPU) with Cohere Command-A on RHEL 8. I had to make changes to the gettid and getcpu calls (replace them with raw syscalls) because those were added in glibc 2.30/2.29, while RHEL 8 uses glibc 2.28. I got it to build, and it appear to allocate mirrored copies of the model for the two sockets.

Unfortunately, I didn't see any change to performance on my system. Here's the command I used:

sudo sysctl -w vm.nr_hugepages=120000
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo ~/llama.cpp/build/bin/llama-server -m ~/llm/c4ai-command-a-03-2025-Q4_K_M-00001-of-00002.gguf --numa distribute --threads 64 -c 65536 -np 2 --temp 0.6 -fa -ctk q8_0 -ctv q8_0 -ot ".*=CPU" --chat-template-file ~/llm/c4ai-command-a-template-notool.txt

Edit: I tried allocating hugepages with a script similar to what you shared above, except with 49152 2048k hugepages per node. Still no performance change.

@jukofyork
Copy link
Collaborator

According to:

https://en.wikichip.org/wiki/intel/xeon_gold/6248

(2699/2933)*131.13 = ~120GB/s per socket is the theoretical maximum bandwidth, so it looks like I'm getting about 1/3rd of this.

Thinking about this more today, for offloading shared-experts only; pcm-memory is probably giving quite deceptive throughput statistics, as it really depends on what fraction of the time is spent in the non-offloaded calculation vs offloaded calculation...

If the sampling frequency is high enough, then I might be able to hack pcm-memory or failing that; add some timing stats to llama.cpp to track this.

@jukofyork
Copy link
Collaborator

Here's the command I used:

sudo sysctl -w vm.nr_hugepages=120000
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo ~/llama.cpp/build/bin/llama-server -m ~/llm/c4ai-command-a-03-2025-Q4_K_M-00001-of-00002.gguf --numa distribute --threads 64 -c 65536 -np 2 --temp 0.6 -fa -ctk q8_0 -ctv q8_0 -ot ".*=CPU" --chat-template-file ~/llm/c4ai-command-a-template-notool.txt

It might be worth trying without the -np 2 in case llama.cpp does anything different with that.

Do you find that using --threads 64 (ie: using your hyperthreading threads) gives better performance than using --threads 32 without this PR?

@jukofyork
Copy link
Collaborator

jukofyork commented Aug 2, 2025

It also occurs to me that PCIe slots are always attached to a single socket, so numa-aware allocation might impact that, that would be an interesting side effect. I'm learning more about memory every day... 😄

IIRC, the current CUDA offloading code only uses a single GPU for the offloaded calculations, so having 2 copies won't really help it.

I do think there is a bottleneck somewhere as PCIe 3.0 x16 has a max bandwidth of ~16GB/s, yet watching nvtop during large batch processing it often only gets around 1/3rd of this (but like for the pcm-memory test; I can't be sure if this is due to averaging a bimodal set of timings).

@dbsanfte
Copy link
Author

dbsanfte commented Aug 2, 2025

If the threads doing the offloading are located on socket 1, but the GPU is on a pcie slot attached to socket 2, maybe that would be sending the traffic over the UPI link? Might be worth investigating. I'll try to get that better visibility of thread/numa assignments in soon.

@sultanqasim
Copy link

It might be worth trying without the -np 2 in case llama.cpp does anything different with that.

Do you find that using --threads 64 (ie: using your hyperthreading threads) gives better performance than using --threads 32 without this PR?

Without this PR, I had a slight speedup from using HyperThreading (i.e. --threads 64 instead of --threads 32).

Removing -np 2 had no impact on performance (for a single request, nothing running concurrently). However, I noticed that with -np 2 the generated tokens were gibberish, while without -np 2 it was giving valid/correct outputs. Looks like a bug.

With this PR, switching from --threads 64 to --threads 32 had the same slowdown I had without this PR.

GGML_NUMA_MIRROR, with 64 threads

prompt eval time =   70104.59 ms /   321 tokens (  218.39 ms per token,     4.58 tokens per second)
       eval time =   85848.91 ms /   165 tokens (  520.30 ms per token,     1.92 tokens per second)
      total time =  155953.50 ms /   486 tokens

GGML_NUMA_MIRROR, with 32 threads

prompt eval time =   75922.60 ms /   321 tokens (  236.52 ms per token,     4.23 tokens per second)
       eval time =  118967.32 ms /   201 tokens (  591.88 ms per token,     1.69 tokens per second)
      total time =  194889.92 ms /   522 tokens

@FullstackSensei
Copy link

FullstackSensei commented Aug 3, 2025

Installed libnuma and Pulled the latest changes (9d66473). Had to disable building RPC to build successfully.

Tried to run it with Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL. I set .vm.nr_hugepages to 160000 to make sure the model and context had enough space. First time I run it took much longer than regular llama.cpp to load, I didn't time it, but felt like 5 minutes, whereas regular llama.cpp takes 1 minute or less. Subsequent loads were very quick, much quicker than llama.cpp.

I haven't been able to get any output. Prompt prpcessing takes forever even on short six word prompts (ex: write a pong game in C++). In htop, I see only two cores (on CPU0) being at 100%, while all others are at 0%. The cores are the first and the 24th in htop.

The system is a dual 24 core Xeon (ES QQ89, with HT enabled). I think there's a bug in thread pinning. The 24th core would have been the first core of the second CPU if HT was disabled. All threads get pinned to those two cores regardless of whether I set -t or not in llama-server.

Tried using numactl with --physcpubind=$(seq -s, 1 2 95), which usually pins one worker to each physical core, but all threads get mapped to the same two cores (0 and 24). Waited a couple of minutes on that pong prompt to see if I get any output, but not a single token.

EDIT: Got my dual Epyc back online, and can confirm same behavior as the dual Xeon. Compiled the branch, and run with --threads 96. Can see all threads get crammed on cpuid 00 and 48 in the log output, as well as on htop. Can also confirm what @aifartist mentioned about SMT threads not being the same as Intel consumer. Running cat /sys/devices/system/cpu/cpu{0..NN}/topology/thread_siblings_list (where NN is the total number of cores/threads reported in, for ex, htop) on my dual Xeon, dual Epyc, and single Epyc all report the same pairing: physical cores come first, then SMT ones.

@ptempier
Copy link

ptempier commented Aug 3, 2025

It also occurs to me that PCIe slots are always attached to a single socket, so numa-aware allocation might impact that, that would be an interesting side effect. I'm learning more about memory every day... 😄

IIRC, the current CUDA offloading code only uses a single GPU for the offloaded calculations, so having 2 copies won't really help it.

I do think there is a bottleneck somewhere as PCIe 3.0 x16 has a max bandwidth of ~16GB/s, yet watching nvtop during large batch processing it often only gets around 1/3rd of this (but like for the pcm-memory test; I can't be sure if this is due to averaging a bimodal set of timings).

Personally if that's of any issue, i'd put support for gpu aside the time for the patch to be developed.
Datacenters are riddled which old esxi with a lot of memory.
They are often more powerful than personal computers but with no gpu and sometime, no space to put one.
If the patch works on this type of older machines with slower memory, that would already be nice.

@dbsanfte dbsanfte changed the title Implementation of GGML_NUMA_MIRROR for 64% inferencing performance gain on numa systems Implementation of GGML_NUMA_MIRROR for inferencing performance gain on numa systems Aug 5, 2025
@dbsanfte
Copy link
Author

dbsanfte commented Aug 5, 2025

I've done quite a bit of testing and code deep diving over the weekend. What I've realised is that:

  1. My performance gain in the original PR post was illusory - it was real, but only because the NUMA_MIRROR code was undoing the effects of kernel numa-balancing. I had forgotten to disable it before testing... (facepalm). With it disabled, I see no speedup over master.

  2. The NUMA mirroring is a valid strategy and works in theory, but the threadpool in ggml-cpu.c does not split the matrix work up between sockets, just between threads, and they all wait on each other in the barrier method. So this solves the issue of cross-numa memory access but does not provide any real speedup yet - all numas are still bound to each other and to a single threadpool.

All of this said, now I can see what needs to be done to get this over the line. Each socket needs its own threadpool and the matrix operations need to be divvied up between the numa nodes / sockets, then we can leverage data paralellism. I am iterating on this locally at the moment and will update when I have something to test.

@FullstackSensei
Copy link

@dbsanfte take a look at.COSMA. I've read the paper and it's supposed to solve all these issues, including distributing the workload across several nodes.

I have 56gb infiniband on my nodes and can test dual Xeon with dual Epyc, and can even add a single Epyc as a 3rd node

@rankaiyx
Copy link

rankaiyx commented Aug 5, 2025

Some information that may be useful.
https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/#multi-numa-parallelism
https://arxiv.org/abs/2502.10923
https://lwn.net/Articles/221885/

@dbsanfte
Copy link
Author

dbsanfte commented Aug 5, 2025

Looking at the code architecture, COSMA needs to be its own new backend really. And just throw away ggml-cpu.

This could be good or bad, I'm not sure :D I like the idea.

As a pedagogical exercise, I'll carry on with the framework I've created up to now, and maybe attempt that as a new PR when I feel more confident.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants