Skip to content

performance of thread_pool_executor in fork-join is very bad #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tzcnt opened this issue Apr 11, 2025 · 9 comments
Open

performance of thread_pool_executor in fork-join is very bad #174

tzcnt opened this issue Apr 11, 2025 · 9 comments

Comments

@tzcnt
Copy link

tzcnt commented Apr 11, 2025

I am putting together a cross-cutting comparison of async runtimes. You can see the current benchmark results here: https://github.com/tzcnt/runtime-benchmarks

The performance of concurrencpp is, at the moment, quite bad. I want to give you the opportunity to put your library in the best light. Perhaps there is a more optimal implementation for these benchmarks? Here is the concurrencpp implementation folder: https://github.com/tzcnt/runtime-benchmarks/tree/main/cpp/concurrencpp

One thing I tried was to remove the std::shared_ptr<thread_pool_executor> from the function parameters and replace it with a thread_pool_executor& - however this doesn't work with when_all() because that function simply requires a shared_ptr in the interface.

@David-Haim
Copy link
Owner

looking at it.

@David-Haim
Copy link
Owner

David-Haim commented Apr 20, 2025

Update:

I looked thoroughly into the other libraries and concurrencpp.
It's worth mentioning that the tests don't do exactly the same things (even if the final result is the same) and don't measure exactly the same thing.

For example, at least in the fibonacci test, some implementation fork the first step (fib(n-1)) but call the second step (fib(n-2)) directly. meaning that 50% of the tasks run inline and not asynchronously. concurrencpp forks every step.

Also, concurrencpp along the other c++20 coroutine libraries actually seem to pass the asynchronous result as a real function-return, other libraries require weird and unsafe things to do it - from passing the asynchronous result as a reference to the async-function (it's UB if you read or write to it before the task is done and you've synched the value) , to weird syntax (libfork is especially weird with its fork[...](...) syntax).
Obviously, simplicity and saftiness will cost more, and concurrencpp is willing to pay this price.

But in the end of the day - you're right - parallel coroutines are bad for F/J.
last few days, I worked on a real F/J implementation for concurrencpp (it uses some variation of lazy_result which is much better), also some improvements to the threadpool mechanism and a new task implementation.

From internal benchmarks fibonnaci now takes 1/5 to 1/6 of the time it took with parallel coroutines and skynet takes about 1/3 of the time.
I''ll finish the implementation in the next few days and hopefully it will be released in 0.1.9.

Also:

One thing I tried was to remove the std::shared_ptr<thread_pool_executor> from the function parameters and replace it with a thread_pool_executor& - however this doesn't work with when_all() because that function simply requires a shared_ptr in the interface.

Noted, will be added in the future

@tzcnt
Copy link
Author

tzcnt commented Apr 20, 2025

Due to differences in the library APIs they don't all measure exactly the same behavior. I tried to get the implementations as close as possible, but the API differences make it impossible to get things exactly identical.

The goal is to simply measure the best performance of "run N tasks in parallel and get N results" - in whatever manner is available for this in each library. Then the reader can judge how usable and expressive they feel the syntax required to do this is. As you observed, libfork is very fast but not expressive, requiring arcane syntax and only exposing fork and join operators. Other libraries like coros can be made much faster but requires actively fighting against the library API - this change results in about a 6x speedup. But for concurrencpp I was just a bit stumped how to modify the implementation for better performance.

One thing - I understand that safety is an important feature of your library. Maybe you can look at why passing a range directly to when_all here breaks the program? The issue is resolved by materializing a vector first - but this adds more overhead.

@David-Haim
Copy link
Owner

Hi

Regarding the UB, I assume it's a compiler bug.

I pushed my ForkJoin implementation to the branch fork_join,
Here are the modified test files (as txt, becasue github prohibit cpp or zip files) that work with this branch.
From my benchmarking, there's a huge improvements over just using result with when_all , somtimes 10% of the original execution time.
Mind benchmarking the new implentation in your machine and tell me the new results ?
I will merge this implementation to the next version.

fork_join_fibbonacci.cpp.txt
fork_join_nqueens.cpp.txt
fork_join_skynet.cpp.txt
fork_join_mat_mull.cpp.txt

@tzcnt
Copy link
Author

tzcnt commented Apr 27, 2025

That makes a marked improvement in the performance. I'm doing a full benchmark run and will report back with the results shortly.

A couple notes on the implementation:

I re-optimized a couple of the benches (these changes are ~10% faster on my test setup):

  • fib: pass immediately invoked function to fork_join instead of creating temporaries
  • nqueens: still using the range, but with the the addition of reserving vector size first

I suspect the reason passing the range begin/end directly to when_all fails is because it enumerates the entire range to assert the elements are non-null, and then again to save them. Since this range produces elements on-demand, that would result in creating additional tasks that aren't executed. To be compatible with this kind of iterator, you would need to do everything in one pass.

I suspect that fj_result may also be susceptible to the same issue - even though you materialize the values into m_results first, and then operate on that, which makes a single pass against the original iterator - you call std::distance on the iterator first, which in the case of the nqueens range which is not random access (since it has the unpredictable filter) requires iterating the entire range in order to calculate the distance. So this still results in 2 passes against the range.

@David-Haim
Copy link
Owner

David-Haim commented Apr 27, 2025

I suspect the reason passing the range begin/end directly to when_all fails is because it enumerates the entire range to assert the elements are non-null, and then again to save them. Since this range produces elements on-demand, that would result in creating additional tasks that aren't executed. To be compatible with this kind of iterator, you would need to do everything in one pass.

I suspect that fj_result may also be susceptible to the same issue - even though you materialize the values into m_results first, and then operate on that, which makes a single pass against the original iterator - you call std::distance on the iterator first, which in the case of the nqueens range which is not random access (since it has the unpredictable filter) requires iterating the entire range in order to calculate the distance. So this still results in 2 passes against the range.

Interesting. I have to say that while I love C# linq, I don't like the C++ version and I don't have much experience with it, but thanks for the direction, I will try to see what's going on there.

I'm doing a full benchmark run and will report back with the results shortly.

looking forward to it 😎

@tzcnt
Copy link
Author

tzcnt commented Apr 27, 2025

New benchmark results based on the development version are temporarily available at: https://fleetcode.com/runtime-benchmarks/release_test/. You can compare them to the original at: https://fleetcode.com/runtime-benchmarks

generated from this commit: tzcnt/runtime-benchmarks@658a8ca

@David-Haim
Copy link
Owner

These results are very good!
I'm very pleased with the results and I think I can optimize the threadpool a bit more.
Interesting to see that concurrencpp doesn't work well with a huge number of cores/threads (48 and more), I guess it's the work sharing algorithm that needs fine tuning for this case (is it NUMA?).
Thanks for all the input and research, it made concurrencpp much better. I also had fun implementing this fork/join mechanism.
It will be pushed to either 0.1.8 or 0.1.9, depending how fast I write tests for this.

@tzcnt
Copy link
Author

tzcnt commented Apr 29, 2025

The 7742 has multiple independent L3 caches with 4 cores sharing each cache, so it could be considered a "NUCA" architecture. See https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 for reference. My machine has the option to configure it as up to 4 NUMA nodes but I only have it configured as 1 NUMA node. These L3 caches also have multiple tiers of latency domains as their communication is mediated through the I/O Die which has 4 quadrants.

Both libfork and TooManyCooks use hwloc in some way to understand the structure of the underlying machine. I use hwloc to partition the thread groups according to L3 cache and use a work stealing strategy that is aware of the locality of threads to nearby L3 caches.

Newer AMD Zen parts have 8 cores sharing each cache, so the impact of this is reduced, but not eliminated. I think you can see some of this in the Ryzen 5950X benchmark - where performance gain on concurrencpp reduces from 8 -> 16 threads. This part has 2 8-core chiplets, so that tracks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants