Skip to content

NUMA-awareness fixes #652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

NUMA-awareness fixes #652

wants to merge 7 commits into from

Conversation

lucjaulmes
Copy link
Contributor

@lucjaulmes lucjaulmes commented Jun 20, 2025

v3.7 was sensible on NUMA machines “by default” through first-touch initialization. This had been lost in v4 as idnetified by James Richings. Here’s some basic numa-aware allocation I use, wrapped in NUMA_AWARE guards. And a little love for general parallel/openmp usage.

  • If we’re on *nix and we find libnuma, we enable NUMA-aware allocaitons
  • Add & use cpu_allocNumaArray() and cpu_deallocNumaArray for the state-vector allocations (as the current alloc functions are also used for many smaller regions). Fall-back to normal allocation functions if NUMA-unaware.
  • Perform zero-initialization in parallel (still with std::fill() but use a parallel region)
  • Make getCurrentNumThreads() work inside parallel regions (!)
  • Add getAvailableNumThreads() to get thread count outside parallel regions. Improve this from previous getCurrentNumThreads() to only call the omp function once (rather than once per thread).

@TysonRayJones
Copy link
Member

TysonRayJones commented Jun 21, 2025

Wew amazing! Great call on renaming cpu_getCurrentNumThreads() to cpu_getAvailableNumThreads(), and revising cpu_getCurrentNumThreads() to actually do what it says on the tin 😅

Some notes:

  • Could we rename distribute() in utils to include the util_ prefix and be more illustrative?
  • Is it possible to refactor the changes to cpu_statevector_anyCtrlPauliTensorOrGadget_subA() such that distribute() does not need to be a static function in the utilities header? I note its inclusion in fastmath.hpp instead would require it is CUDA-kernel compatible (so no standard types)
  • Is sysconf definitely platform agonstic?
  • Can we replace use of perror with custom error messages in errors.cpp? That file is probably over-engineered but is at least nice to centralise all error-throwing (i.e. all termination of of QuEST execution)
  • Can get_page_size() and get_numa_nodes() be changed to camelCase for consistency?

@TysonRayJones
Copy link
Member

Btw for speed, I can commit those aforementioned changes myself (and fix the CI - Github Actions is treating the cmake warning as an error) but still suggest giving you full merge credit (i.e. not coauthoring with me) since they're merely stylistic. Let me know if you prefer that and I'll do so!

@lucjaulmes
Copy link
Contributor Author

No worries @TysonRayJones I’m on it now!

@lucjaulmes
Copy link
Contributor Author

Re sysconf it’s posix, and NUMA_AWARE is not enabled on Windows. I’m not sure there’s other systems (neither Windows nor unix-based) that this code is supposed to run on?

@lucjaulmes
Copy link
Contributor Author

I’ve moved util_distribute() out of headers. It’s literally 2 divmods so I was thinking there’s no point in it being its own standalone function, but to be fair it’s never called anywhere sensitive (nor should it) so it’s not worth the headache of including it in headers either.

@otbrown
Copy link
Contributor

otbrown commented Jun 24, 2025

Love this, thanks @lucjaulmes!

@TysonRayJones
Copy link
Member

TysonRayJones commented Jun 26, 2025

Just helping collate the CI issues:

  • All serial unit tests are failing for the below functions:

    calcTotalProb
    initBlankState
    initClassicalState
    initPlusState
    collapseToOutcome
    

    I believe the first four are due to the change in cpu_statevec_initUniformState_sub() (I'm unsure aboutcollapseToOutcome() but likely also). The bug might lie within either:

    void cpu_statevec_initUniformState_sub(Qureg qureg, qcomp amp) {
    // faster on average (though perhaps not for large quregs)
    // than a custom multithreaded loop
    #pragma omp parallel if(qureg.isMultithreaded)
    {
    // Distribute number of tasks and convert to indexes. 4kB page standard?
    const auto [start, end] = util_distribute(qureg.numAmpsPerNode, 4096 / sizeof(qcomp),
    cpu_getOpenmpThreadInd(), cpu_getCurrentNumThreads());
    std::fill(qureg.cpuAmps + start, qureg.cpuAmps + end, amp);
    }
    }

    std::pair<qindex, qindex>
    util_distribute(const qindex work, const qindex block, const int id, const int n) {
    // ASSUME(work % block == 0);
    const qindex blocks = work / block;
    qindex spread = blocks / n;
    qindex extra = blocks % n;
    qindex prev_extra = (id * extra) / n;
    qindex prev_shift = (id * extra) % n;
    qindex here_extra = (prev_shift + extra) >= n;
    qindex pos = id * spread + prev_extra;
    return std::make_pair(pos * block, (pos + spread + here_extra) * block);
    }

    It's hard to gauge without some comments or a more illustrative name - would it be appropriate to rename

    util_distribute(const qindex work, const qindex block, const int id, const int n) {
    
         ...
    }

    to

    util_getThreadSubRangeAsBlockMultiple(qindex rangeLen, qindex blockLen, int threadInd, int numThreads) {
        
         // distributes rangeLen non-uniformly between threads (returning [start,end) of a sub-range)
         // such that each thread is allocated a non-overlapping integer multiple of blockLen, or zero 
         // (as indicated by end==start)
    
         ...
    }

    Have I correctly understood the function's intention?

    Note too the commented assumption that work % block == 0 is equivalent to that qureg.numAmpsPerNode is a multiple of 4096 / sizeof(qcomp), which is not satisfied by small Qureg (e.g. at double precision, it is satisfied by 8+ qubit quregs which excludes those of the unit test).

    Note too that 4096 / sizeof(qcomp) snippet also looks like a magic number; is it worth moving to a constant in core/memory.hppwith an illustrative name?

  • The memory checks CI is reporting a "still reachable" memory leak:

    ==4388== LEAK SUMMARY:
    ==4388==    definitely lost: 0 bytes in 0 blocks
    ==4388==    indirectly lost: 0 bytes in 0 blocks
    ==4388==      possibly lost: 0 bytes in 0 blocks
    ==4388==    still reachable: 72,704 bytes in 1 blocks
    ==4388==         suppressed: 0 bytes in 0 blocks
    ==4388== Reachable blocks (those to which a pointer was found) are not shown.
    ==4388== To see them, rerun with: --leak-check=full --show-leak-kinds=all
    

    but it may be caused by the failing unit tests - I'm not sure!

  • All non-Windows OMP-enabled compilation CI is failing with message:

    -- Checking for one of the modules 'numa'
    CMake Error at CMakeLists.txt:340 (compile_option):
      compile_option Function invoked with incorrect arguments for function
      named: compile_option
    

@lucjaulmes
Copy link
Contributor Author

lucjaulmes commented Jun 29, 2025

Hmm I likely didn’t test this enough on tiny sized examples. Pushed a few fixes, letting CIs run again.

Regarding distribute() it’s doing it as uniformly as possible under the constraint that assignments are block-multiples. I’m not sure it’s limited to threads, “getBlockMultipleSubRange” maybe ? Still quite a mouthful though.

Rebasing on latest devel for good measure.

@lucjaulmes lucjaulmes force-pushed the devel branch 2 times, most recently from e53c97b to b547470 Compare June 29, 2025 11:28
In the case libnuma is not found, NUMA_FOUND is (can be?) not defined
instead of being set to 0.
@TysonRayJones
Copy link
Member

TysonRayJones commented Jun 30, 2025

I'm intending to make some mostly stylistic and defensive changes but I'm having trouble pulling the PR branch locally - so I'm going to experiment with Codespaces, hence please excuse any commit spam from failed tests!

Things I intend to tinker with:

  • cpu_getPageSize() guards based on NUMA_AWARE = 0 but that's really a separate consideration which might confuse a reader. Ideally it should guard based on whether e.g. _WIN32 is defined.

  • cpu_getPageSize() queries the page-size on POSIX systems but assumes FALLBACK_PAGE_SIZE on Windows, though that mightn't be fully motivated; the function is called on Windows (informing an argument to util_distribute()) which should try "as hard" as POSIX to learn the page size. Even though it's not yet benefitting from the application (distributing std::fill over NUMA nodes), it should avoid astonishment, in case e.g. we do later add NUMA awareness on Windows. Beware it's worthwhile to avoid including the full windows.h however (e.g. using WIN32_LEAN_AND_MEAN) which drastically slows compilation.

  • the "cache" logic in the form

     static int value = 0;
     if (!value) {
         value = getValue();
         if (isBad(value))
             error();
     }
     return value;

    is a little verbose and confusing, and can be (if I understand correctly) simplified to:

    static int value = getValue();
    if (isBad(value))
        error();
    return value;

    The isBad validation is now called on every invocation, though it is as trivial as the otherwise invoked if (!value). In this revised form, the function looks just like the "unoptimised" form (which would call getValue() at every invocation) but merely adds the static keyword, so is not compromising readability for premature optimisation.

  • The trailing // NUMA_AWARE comment in

     #if NUMA_AWARE
         #include <sys/mman.h>
         #include <unistd.h>
         #include <numaif.h>
         #include <numa.h>
     #endif // NUMA_AWARE

    (and similar) is made redundant by the indentation. So far, such comments have only been used for clarifying the end of the header #ifndef since the contents therein is not indented.

  • as discussed, a statevector with < 8 qubits has an array smaller than a page of memory. Currently, cpu_allocNumaArray() will still use mmap with pages=1 to "wastefully" map a full page. An alternative use of cpu_allocArray()'s calloc() permits the OS to squeeze multiple Qureg into a page. Do we care? Is there a sensible use-case where a user simultaneously instantiates tens of millions of tiny Qureg which cannot fit into memory if each is given an entire page? On the other hand, is there a performance benefit from forcing max one Qureg per page?

    If we're indifferent, would it be clearer to allocate a sub-page-size Qureg with cpu_allocArray() rather than cpu_allocNumaArray(). I.e.

    qcomp* cpu_allocNumaArray(qindex length) {
    #if !NUMA_AWARE
        return cpu_allocArray(length);
    #else
        unsigned long page_size = cpu_getPageSize();
        qindex array_size = length * sizeof(qcomp);
        
        // avoid mapping an entire page for a sub-page Qureg
        if (array_size < page_size)
            return cpu_allocArray(length);
    
        ...
        mmap(...)
    #endif 
    }

    and where cpu_deallocNumaArray() would need a similar branch. This circumvents the need to use clever expressions like pages = (array_size + page_size - 1) / page_size instead of pages = array_size / page_size

    Alternatively, would a comment in cpu_allocNumaArray() explaining that it's harmless to allocate page per sub-page-Qureg be better?

  • The function cpu_getNumaNodes()...

    #if NUMA_AWARE
    unsigned long cpu_getNumaNodes() {
    static int n_nodes = 0;
    if (!n_nodes) {
    n_nodes = numa_num_configured_nodes();
    if (n_nodes < 1) {
    error_gettingNumaNodesFailed();
    }
    }
    return n_nodes;
    }
    #endif

    is defined within #if NUMA_AWARE guards (which makes one a little frightened), and unnecessarily given a cpu_ prefix as if it were visible outside the cpu_config.cpp file (it's instead private). But thankfully, it doesn't need to exist! It is only called by cpu_allocNumaArray() (within the same guards) which could instead just call n_nodes = numa_num_configured_nodes() directly with that variable being static. There again, the validation check (whether n_nodes < 1) is just as cheap as checking whether the static var has been already set.

  • Though wordy, I think getBlockMultipleSubRange() (with a clarifying comment therein) is indeed better than distribute(). Pixels are cheap, but time lost due to misinterpreting a terse function can be expensive! ;)

Sorry to appear pedantic - QuEST v4's explosion in complexity has made me extremely attentive to defensive design since I foresee a million new ways for insidious bugs. Ergo changes to very fundamental things (like allocating memory) make me overly scrutinous!

@lucjaulmes
Copy link
Contributor Author

lucjaulmes commented Jul 1, 2025

Feel free to edit the PR as you see fit, of course!

@TysonRayJones
Copy link
Member

TysonRayJones commented Jul 1, 2025

BAH clicking through the inconspicuous Github desktop prompts has caused that unhelpful merge commit. I wasn't able to pull your changes until I opened the "resolve conflicts" menu which reported "no conflicts" and gave a new option to "continue merge", which caused that merge commit without warning, duplicating a number of commits to devel. I cannot now seem to force undo that merge into your PR branch like I would on an ordinary branch. I should know better by now than to use Github Desktop 💢

Are you able to force undo my merge from your end, or otherwise know the secret git source to pull your changes to my local repo? 🫤 Edit: this should require only forcefully reverting my commit onto your lucjaulmes/devel branch:

git checkout lucjaulmes/devel
git reset --hard 99e2705
git push origin --force

which keeps your latest commit, discarding my merge. I don't have permission to do this myself (despite being able to make commits to it).

That should put me back into the strange situation of not being able to pull your changes with Github desktop, but I'll handle it manually now :^)

@lucjaulmes
Copy link
Contributor Author

not sure how to do it with github desktop, but with cli you should be able to fetch refs/pull/652/head into your local repo. That should point to the head here.

@TysonRayJones
Copy link
Member

Wew brilliant, sorry for the nuisance there! I'll make those edits tomorrow evening after finishing marking 62 masters QIT assignments 😭

@TysonRayJones
Copy link
Member

Closing this in favour of #658

TysonRayJones added a commit that referenced this pull request Jul 9, 2025
Luc:
v3.7 was sensible on NUMA machines “by default” through first-touch initialization. This had been lost in v4 as idnetified by James Richings. Here’s some basic numa-aware allocation, and a little love for general parallel/openmp usage.

- If we’re on *nix _and_ we find libnuma, we enable NUMA-aware allocaitons
- Add & use cpu_allocNumaArray() and cpu_deallocNumaArray for the state-vector allocations (as the current alloc functions are also used for many smaller regions). Fall-back to normal allocation functions if NUMA-unaware.
- Perform zero-initialization in parallel (still with std::fill() but use a parallel region)
- Make getCurrentNumThreads() work inside parallel regions (!)
- Add getAvailableNumThreads() to get thread count outside parallel regions. Improve this from previous getCurrentNumThreads() to only call the omp function once (rather than once per thread).

Luc coded the logic and Tyson added doc and error-handling. PR #658 replaced the original of #652

---------

Co-authored-by: Luc Jaulmes <ljaulmes@ed.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants