-
Notifications
You must be signed in to change notification settings - Fork 229
performance of thread_pool_executor in fork-join is very bad #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
looking at it. |
Update: I looked thoroughly into the other libraries and concurrencpp. For example, at least in the fibonacci test, some implementation fork the first step (fib(n-1)) but call the second step (fib(n-2)) directly. meaning that 50% of the tasks run inline and not asynchronously. concurrencpp forks every step. Also, concurrencpp along the other c++20 coroutine libraries actually seem to pass the asynchronous result as a real function-return, other libraries require weird and unsafe things to do it - from passing the asynchronous result as a reference to the async-function (it's UB if you read or write to it before the task is done and you've synched the value) , to weird syntax (libfork is especially weird with its But in the end of the day - you're right - parallel coroutines are bad for F/J. From internal benchmarks fibonnaci now takes 1/5 to 1/6 of the time it took with parallel coroutines and skynet takes about 1/3 of the time. Also:
Noted, will be added in the future |
Due to differences in the library APIs they don't all measure exactly the same behavior. I tried to get the implementations as close as possible, but the API differences make it impossible to get things exactly identical. The goal is to simply measure the best performance of "run N tasks in parallel and get N results" - in whatever manner is available for this in each library. Then the reader can judge how usable and expressive they feel the syntax required to do this is. As you observed, libfork is very fast but not expressive, requiring arcane syntax and only exposing fork and join operators. Other libraries like coros can be made much faster but requires actively fighting against the library API - this change results in about a 6x speedup. But for concurrencpp I was just a bit stumped how to modify the implementation for better performance. One thing - I understand that safety is an important feature of your library. Maybe you can look at why passing a range directly to when_all here breaks the program? The issue is resolved by materializing a vector first - but this adds more overhead. |
Hi Regarding the UB, I assume it's a compiler bug. I pushed my ForkJoin implementation to the branch fork_join, fork_join_fibbonacci.cpp.txt |
That makes a marked improvement in the performance. I'm doing a full benchmark run and will report back with the results shortly. A couple notes on the implementation:
I re-optimized a couple of the benches (these changes are ~10% faster on my test setup):
I suspect the reason passing the range begin/end directly to when_all fails is because it enumerates the entire range to assert the elements are non-null, and then again to save them. Since this range produces elements on-demand, that would result in creating additional tasks that aren't executed. To be compatible with this kind of iterator, you would need to do everything in one pass. I suspect that fj_result may also be susceptible to the same issue - even though you materialize the values into m_results first, and then operate on that, which makes a single pass against the original iterator - you call std::distance on the iterator first, which in the case of the nqueens range which is not random access (since it has the unpredictable filter) requires iterating the entire range in order to calculate the distance. So this still results in 2 passes against the range. |
Interesting. I have to say that while I love C# linq, I don't like the C++ version and I don't have much experience with it, but thanks for the direction, I will try to see what's going on there.
looking forward to it 😎 |
New benchmark results based on the development version are temporarily available at: https://fleetcode.com/runtime-benchmarks/release_test/. You can compare them to the original at: https://fleetcode.com/runtime-benchmarks generated from this commit: tzcnt/runtime-benchmarks@658a8ca |
These results are very good! |
The 7742 has multiple independent L3 caches with 4 cores sharing each cache, so it could be considered a "NUCA" architecture. See https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 for reference. My machine has the option to configure it as up to 4 NUMA nodes but I only have it configured as 1 NUMA node. These L3 caches also have multiple tiers of latency domains as their communication is mediated through the I/O Die which has 4 quadrants. Both libfork and TooManyCooks use hwloc in some way to understand the structure of the underlying machine. I use hwloc to partition the thread groups according to L3 cache and use a work stealing strategy that is aware of the locality of threads to nearby L3 caches. Newer AMD Zen parts have 8 cores sharing each cache, so the impact of this is reduced, but not eliminated. I think you can see some of this in the Ryzen 5950X benchmark - where performance gain on concurrencpp reduces from 8 -> 16 threads. This part has 2 8-core chiplets, so that tracks. |
I am putting together a cross-cutting comparison of async runtimes. You can see the current benchmark results here: https://github.com/tzcnt/runtime-benchmarks
The performance of concurrencpp is, at the moment, quite bad. I want to give you the opportunity to put your library in the best light. Perhaps there is a more optimal implementation for these benchmarks? Here is the concurrencpp implementation folder: https://github.com/tzcnt/runtime-benchmarks/tree/main/cpp/concurrencpp
One thing I tried was to remove the
std::shared_ptr<thread_pool_executor>
from the function parameters and replace it with athread_pool_executor&
- however this doesn't work withwhen_all()
because that function simply requires a shared_ptr in the interface.The text was updated successfully, but these errors were encountered: