NUMA-awareness fixes #652

lucjaulmes · 2025-06-20T14:56:47Z

v3.7 was sensible on NUMA machines “by default” through first-touch initialization. This had been lost in v4 as idnetified by James Richings. Here’s some basic numa-aware allocation I use, wrapped in NUMA_AWARE guards. And a little love for general parallel/openmp usage.

If we’re on *nix and we find libnuma, we enable NUMA-aware allocaitons
Add & use cpu_allocNumaArray() and cpu_deallocNumaArray for the state-vector allocations (as the current alloc functions are also used for many smaller regions). Fall-back to normal allocation functions if NUMA-unaware.
Perform zero-initialization in parallel (still with std::fill() but use a parallel region)
Make getCurrentNumThreads() work inside parallel regions (!)
Add getAvailableNumThreads() to get thread count outside parallel regions. Improve this from previous getCurrentNumThreads() to only call the omp function once (rather than once per thread).

TysonRayJones · 2025-06-21T12:33:18Z

Wew amazing! Great call on renaming cpu_getCurrentNumThreads() to cpu_getAvailableNumThreads(), and revising cpu_getCurrentNumThreads() to actually do what it says on the tin 😅

Some notes:

Could we rename distribute() in utils to include the util_ prefix and be more illustrative?
Is it possible to refactor the changes to cpu_statevector_anyCtrlPauliTensorOrGadget_subA() such that distribute() does not need to be a static function in the utilities header? I note its inclusion in fastmath.hpp instead would require it is CUDA-kernel compatible (so no standard types)
Is sysconf definitely platform agonstic?
Can we replace use of perror with custom error messages in errors.cpp? That file is probably over-engineered but is at least nice to centralise all error-throwing (i.e. all termination of of QuEST execution)
Can get_page_size() and get_numa_nodes() be changed to camelCase for consistency?

TysonRayJones · 2025-06-23T10:54:24Z

Btw for speed, I can commit those aforementioned changes myself (and fix the CI - Github Actions is treating the cmake warning as an error) but still suggest giving you full merge credit (i.e. not coauthoring with me) since they're merely stylistic. Let me know if you prefer that and I'll do so!

lucjaulmes · 2025-06-23T13:01:24Z

No worries @TysonRayJones I’m on it now!

lucjaulmes · 2025-06-23T13:07:06Z

Re sysconf it’s posix, and NUMA_AWARE is not enabled on Windows. I’m not sure there’s other systems (neither Windows nor unix-based) that this code is supposed to run on?

lucjaulmes · 2025-06-23T13:29:12Z

I’ve moved util_distribute() out of headers. It’s literally 2 divmods so I was thinking there’s no point in it being its own standalone function, but to be fair it’s never called anywhere sensitive (nor should it) so it’s not worth the headache of including it in headers either.

otbrown · 2025-06-24T16:39:21Z

Love this, thanks @lucjaulmes!

TysonRayJones · 2025-06-26T14:34:39Z

Just helping collate the CI issues:

All serial unit tests are failing for the below functions:

calcTotalProb
initBlankState
initClassicalState
initPlusState
collapseToOutcome

I believe the first four are due to the change in cpu_statevec_initUniformState_sub() (I'm unsure aboutcollapseToOutcome() but likely also). The bug might lie within either:

QuEST/quest/src/cpu/cpu_subroutines.cpp

Lines 2356 to 2368 in 31a2913

    
           void cpu_statevec_initUniformState_sub(Qureg qureg, qcomp amp) { 
        
               // faster on average (though perhaps not for large quregs) 
        
               // than a custom multithreaded loop 
        
               #pragma omp parallel if(qureg.isMultithreaded) 
        
               { 
        
                   // Distribute number of tasks and convert to indexes. 4kB page standard? 
        
                   const auto [start, end] = util_distribute(qureg.numAmpsPerNode, 4096 / sizeof(qcomp), 
        
                           cpu_getOpenmpThreadInd(), cpu_getCurrentNumThreads()); 
        
                   std::fill(qureg.cpuAmps + start, qureg.cpuAmps + end, amp); 
        
               } 
        
           }

QuEST/quest/src/core/utilities.cpp

Lines 913 to 927 in 31a2913

    
           std::pair<qindex, qindex> 
        
           util_distribute(const qindex work, const qindex block, const int id, const int n) { 
        
               // ASSUME(work % block == 0); 
        
               const qindex blocks = work / block; 
        
               qindex spread = blocks / n; 
        
               qindex extra = blocks % n; 
        
               qindex prev_extra = (id * extra) / n; 
        
               qindex prev_shift = (id * extra) % n; 
        
               qindex here_extra = (prev_shift + extra) >= n; 
        
               qindex pos = id * spread + prev_extra; 
        
               return std::make_pair(pos * block, (pos + spread + here_extra) * block); 
        
           }

It's hard to gauge without some comments or a more illustrative name - would it be appropriate to rename

util_distribute(const qindex work, const qindex block, const int id, const int n) {

     ...
}

to

util_getThreadSubRangeAsBlockMultiple(qindex rangeLen, qindex blockLen, int threadInd, int numThreads) {
    
     // distributes rangeLen non-uniformly between threads (returning [start,end) of a sub-range)
     // such that each thread is allocated a non-overlapping integer multiple of blockLen, or zero 
     // (as indicated by end==start)

     ...
}

Have I correctly understood the function's intention?

Note too the commented assumption that work % block == 0 is equivalent to that qureg.numAmpsPerNode is a multiple of 4096 / sizeof(qcomp), which is not satisfied by small Qureg (e.g. at double precision, it is satisfied by 8+ qubit quregs which excludes those of the unit test).

Note too that 4096 / sizeof(qcomp) snippet also looks like a magic number; is it worth moving to a constant in core/memory.hppwith an illustrative name?

The memory checks CI is reporting a "still reachable" memory leak:

==4388== LEAK SUMMARY:
==4388==    definitely lost: 0 bytes in 0 blocks
==4388==    indirectly lost: 0 bytes in 0 blocks
==4388==      possibly lost: 0 bytes in 0 blocks
==4388==    still reachable: 72,704 bytes in 1 blocks
==4388==         suppressed: 0 bytes in 0 blocks
==4388== Reachable blocks (those to which a pointer was found) are not shown.
==4388== To see them, rerun with: --leak-check=full --show-leak-kinds=all

but it may be caused by the failing unit tests - I'm not sure!

All non-Windows OMP-enabled compilation CI is failing with message:

-- Checking for one of the modules 'numa'
CMake Error at CMakeLists.txt:340 (compile_option):
  compile_option Function invoked with incorrect arguments for function
  named: compile_option

lucjaulmes · 2025-06-29T10:49:18Z

Hmm I likely didn’t test this enough on tiny sized examples. Pushed a few fixes, letting CIs run again.

Regarding distribute() it’s doing it as uniformly as possible under the constraint that assignments are block-multiples. I’m not sure it’s limited to threads, “getBlockMultipleSubRange” maybe ? Still quite a mouthful though.

Rebasing on latest devel for good measure.

In the case libnuma is not found, NUMA_FOUND is (can be?) not defined instead of being set to 0.

TysonRayJones · 2025-06-30T21:58:49Z

I'm intending to make some mostly stylistic and defensive changes but I'm having trouble pulling the PR branch locally - so I'm going to experiment with Codespaces, hence please excuse any commit spam from failed tests!

Things I intend to tinker with:

cpu_getPageSize() guards based on NUMA_AWARE = 0 but that's really a separate consideration which might confuse a reader. Ideally it should guard based on whether e.g. _WIN32 is defined.
cpu_getPageSize() queries the page-size on POSIX systems but assumes FALLBACK_PAGE_SIZE on Windows, though that mightn't be fully motivated; the function is called on Windows (informing an argument to util_distribute()) which should try "as hard" as POSIX to learn the page size. Even though it's not yet benefitting from the application (distributing std::fill over NUMA nodes), it should avoid astonishment, in case e.g. we do later add NUMA awareness on Windows. Beware it's worthwhile to avoid including the full windows.h however (e.g. using WIN32_LEAN_AND_MEAN) which drastically slows compilation.
the "cache" logic in the form
```
 static int value = 0;
 if (!value) {
     value = getValue();
     if (isBad(value))
         error();
 }
 return value;
```
is a little verbose and confusing, and can be (if I understand correctly) simplified to:
```
static int value = getValue();
if (isBad(value))
    error();
return value;
```
The isBad validation is now called on every invocation, though it is as trivial as the otherwise invoked if (!value). In this revised form, the function looks just like the "unoptimised" form (which would call getValue() at every invocation) but merely adds the static keyword, so is not compromising readability for premature optimisation.
The trailing // NUMA_AWARE comment in
```
 #if NUMA_AWARE
     #include <sys/mman.h>
     #include <unistd.h>
     #include <numaif.h>
     #include <numa.h>
 #endif // NUMA_AWARE
```
(and similar) is made redundant by the indentation. So far, such comments have only been used for clarifying the end of the header #ifndef since the contents therein is not indented.
as discussed, a statevector with < 8 qubits has an array smaller than a page of memory. Currently, cpu_allocNumaArray() will still use mmap with pages=1 to "wastefully" map a full page. An alternative use of cpu_allocArray()'s calloc() permits the OS to squeeze multiple Qureg into a page. Do we care? Is there a sensible use-case where a user simultaneously instantiates tens of millions of tiny Qureg which cannot fit into memory if each is given an entire page? On the other hand, is there a performance benefit from forcing max one Qureg per page?

If we're indifferent, would it be clearer to allocate a sub-page-size Qureg with cpu_allocArray() rather than cpu_allocNumaArray(). I.e.
```
qcomp* cpu_allocNumaArray(qindex length) {
#if !NUMA_AWARE
    return cpu_allocArray(length);
#else
    unsigned long page_size = cpu_getPageSize();
    qindex array_size = length * sizeof(qcomp);
    
    // avoid mapping an entire page for a sub-page Qureg
    if (array_size < page_size)
        return cpu_allocArray(length);

    ...
    mmap(...)
#endif 
}
```
and where cpu_deallocNumaArray() would need a similar branch. This circumvents the need to use clever expressions like pages = (array_size + page_size - 1) / page_size instead of pages = array_size / page_size

Alternatively, would a comment in cpu_allocNumaArray() explaining that it's harmless to allocate page per sub-page-Qureg be better?

The function cpu_getNumaNodes()...

QuEST/quest/src/cpu/cpu_config.cpp

Lines 131 to 142 in 99e2705

    
           #if NUMA_AWARE 
        
           unsigned long cpu_getNumaNodes() { 
        
               static int n_nodes = 0; 
        
               if (!n_nodes) { 
        
                   n_nodes = numa_num_configured_nodes(); 
        
                   if (n_nodes < 1) { 
        
                       error_gettingNumaNodesFailed(); 
        
                   } 
        
               } 
        
               return n_nodes; 
        
           } 
        
           #endif

is defined within #if NUMA_AWARE guards (which makes one a little frightened), and unnecessarily given a cpu_ prefix as if it were visible outside the cpu_config.cpp file (it's instead private). But thankfully, it doesn't need to exist! It is only called by cpu_allocNumaArray() (within the same guards) which could instead just call n_nodes = numa_num_configured_nodes() directly with that variable being static. There again, the validation check (whether n_nodes < 1) is just as cheap as checking whether the static var has been already set.

Though wordy, I think getBlockMultipleSubRange() (with a clarifying comment therein) is indeed better than distribute(). Pixels are cheap, but time lost due to misinterpreting a terse function can be expensive! ;)

Sorry to appear pedantic - QuEST v4's explosion in complexity has made me extremely attentive to defensive design since I foresee a million new ways for insidious bugs. Ergo changes to very fundamental things (like allocating memory) make me overly scrutinous!

lucjaulmes · 2025-07-01T11:07:39Z

Feel free to edit the PR as you see fit, of course!

TysonRayJones · 2025-07-01T17:01:31Z

BAH clicking through the inconspicuous Github desktop prompts has caused that unhelpful merge commit. I wasn't able to pull your changes until I opened the "resolve conflicts" menu which reported "no conflicts" and gave a new option to "continue merge", which caused that merge commit without warning, duplicating a number of commits to devel. I cannot now seem to force undo that merge into your PR branch like I would on an ordinary branch. I should know better by now than to use Github Desktop 💢

Are you able to force undo my merge from your end, or otherwise know the secret git source to pull your changes to my local repo? 🫤 Edit: this should require only forcefully reverting my commit onto your lucjaulmes/devel branch:

git checkout lucjaulmes/devel
git reset --hard 99e2705
git push origin --force

which keeps your latest commit, discarding my merge. I don't have permission to do this myself (despite being able to make commits to it).

That should put me back into the strange situation of not being able to pull your changes with Github desktop, but I'll handle it manually now :^)

lucjaulmes · 2025-07-02T08:32:48Z

not sure how to do it with github desktop, but with cli you should be able to fetch refs/pull/652/head into your local repo. That should point to the head here.

TysonRayJones · 2025-07-02T10:45:14Z

Wew brilliant, sorry for the nuisance there! I'll make those edits tomorrow evening after finishing marking 62 masters QIT assignments 😭

TysonRayJones · 2025-07-03T23:27:30Z

Closing this in favour of #658

Luc: v3.7 was sensible on NUMA machines “by default” through first-touch initialization. This had been lost in v4 as idnetified by James Richings. Here’s some basic numa-aware allocation, and a little love for general parallel/openmp usage. - If we’re on *nix _and_ we find libnuma, we enable NUMA-aware allocaitons - Add & use cpu_allocNumaArray() and cpu_deallocNumaArray for the state-vector allocations (as the current alloc functions are also used for many smaller regions). Fall-back to normal allocation functions if NUMA-unaware. - Perform zero-initialization in parallel (still with std::fill() but use a parallel region) - Make getCurrentNumThreads() work inside parallel regions (!) - Add getAvailableNumThreads() to get thread count outside parallel regions. Improve this from previous getCurrentNumThreads() to only call the omp function once (rather than once per thread). Luc coded the logic and Tyson added doc and error-handling. PR #658 replaced the original of #652 --------- Co-authored-by: Luc Jaulmes <ljaulmes@ed.ac.uk>

lucjaulmes force-pushed the devel branch from 3b10894 to 3e17502 Compare June 20, 2025 15:06

lucjaulmes force-pushed the devel branch from 3e17502 to 31a2913 Compare June 23, 2025 13:25

lucjaulmes added 4 commits June 29, 2025 11:59

Rename CurrentNumThreads to AvailableNumThreads

b3e1b43

Assign init work in parallel to enforce first-touch

08b7db2

Numa aware alloc

3adba93

Move util_distribute() out of headers

827cd09

lucjaulmes force-pushed the devel branch 2 times, most recently from e53c97b to b547470 Compare June 29, 2025 11:28

lucjaulmes added 3 commits June 29, 2025 13:05

Make util_distribute() work on toy examples

7960aeb

Clean up hardcoded 4k page size

98c50a4

Fix cmake to handle missing NUMA_FOUND variable

99e2705

In the case libnuma is not found, NUMA_FOUND is (can be?) not defined instead of being set to 0.

lucjaulmes force-pushed the devel branch from b547470 to 99e2705 Compare June 29, 2025 12:05

lucjaulmes force-pushed the devel branch from c330a50 to 99e2705 Compare July 2, 2025 08:30

TysonRayJones mentioned this pull request Jul 3, 2025

NUMA-awareness fixes (replaces #652) #658

Merged

TysonRayJones closed this Jul 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NUMA-awareness fixes #652

NUMA-awareness fixes #652

Uh oh!

lucjaulmes commented Jun 20, 2025 •

edited

Loading

Uh oh!

TysonRayJones commented Jun 21, 2025 •

edited

Loading

Uh oh!

TysonRayJones commented Jun 23, 2025

Uh oh!

lucjaulmes commented Jun 23, 2025

Uh oh!

lucjaulmes commented Jun 23, 2025

Uh oh!

lucjaulmes commented Jun 23, 2025

Uh oh!

otbrown commented Jun 24, 2025

Uh oh!

TysonRayJones commented Jun 26, 2025 •

edited

Loading

Uh oh!

lucjaulmes commented Jun 29, 2025 •

edited

Loading

Uh oh!

TysonRayJones commented Jun 30, 2025 •

edited

Loading

Uh oh!

lucjaulmes commented Jul 1, 2025 •

edited

Loading

Uh oh!

TysonRayJones commented Jul 1, 2025 •

edited

Loading

Uh oh!

lucjaulmes commented Jul 2, 2025

Uh oh!

TysonRayJones commented Jul 2, 2025

Uh oh!

TysonRayJones commented Jul 3, 2025

Uh oh!

Uh oh!

NUMA-awareness fixes #652

NUMA-awareness fixes #652

Uh oh!

Conversation

lucjaulmes commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Jun 23, 2025

Uh oh!

lucjaulmes commented Jun 23, 2025

Uh oh!

lucjaulmes commented Jun 23, 2025

Uh oh!

lucjaulmes commented Jun 23, 2025

Uh oh!

otbrown commented Jun 24, 2025

Uh oh!

TysonRayJones commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucjaulmes commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucjaulmes commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucjaulmes commented Jul 2, 2025

Uh oh!

TysonRayJones commented Jul 2, 2025

Uh oh!

TysonRayJones commented Jul 3, 2025

Uh oh!

Uh oh!

lucjaulmes commented Jun 20, 2025 •

edited

Loading

TysonRayJones commented Jun 21, 2025 •

edited

Loading

TysonRayJones commented Jun 26, 2025 •

edited

Loading

lucjaulmes commented Jun 29, 2025 •

edited

Loading

TysonRayJones commented Jun 30, 2025 •

edited

Loading

lucjaulmes commented Jul 1, 2025 •

edited

Loading

TysonRayJones commented Jul 1, 2025 •

edited

Loading