Skip to content

Commit 2926208

Browse files
authored
Merge pull request #86 from Krastanov/master
disable_polyester_threads
2 parents 259c20f + 36aacb8 commit 2926208

File tree

4 files changed

+261
-7
lines changed

4 files changed

+261
-7
lines changed

Project.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "Polyester"
22
uuid = "f517fe37-dbe3-4b94-8317-1923a5111588"
33
authors = ["Chris Elrod <elrodc@gmail.com> and contributors"]
4-
version = "0.6.13"
4+
version = "0.6.14"
55

66
[deps]
77
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9"
@@ -21,7 +21,7 @@ BitTwiddlingConvenienceFunctions = "0.1"
2121
CPUSummary = "0.1.2 - 0.1.8, 0.1.11"
2222
IfElse = "0.1"
2323
ManualMemory = "0.1.3"
24-
PolyesterWeave = "0.1"
24+
PolyesterWeave = "0.1.7"
2525
Requires = "1"
2626
Static = "0.7"
2727
StrideArraysCore = "0.3.11"

README.md

Lines changed: 164 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -413,6 +413,8 @@ Note that `@batch` defaults to using up to one thread per physical core, instead
413413
is because [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl) currently only uses up to 1 thread per physical core, and switching the number of
414414
threads incurs some overhead. See the docstring on `@batch` (i.e., `?@batch` in a Julia REPL) for some more discussion.
415415
416+
## Local per-thread storage
417+
416418
You also can define local storage for each thread, providing a vector containing each of the local storages at the end.
417419
418420
```julia
@@ -446,4 +448,165 @@ julia> let
446448
end
447449

448450
Float16[83.0, 90.0, 27.0, 65.0]
449-
```
451+
```
452+
453+
## Disabling Polyester threads
454+
455+
When running many repetitions of a Polyester-multithreaded function (e.g. in an embarrassingly parallel problem that repeatedly executes a small already Polyester-multithreaded function), it can be beneficial to disable Polyester (the inner multithreaded loop) and multithread only at the outer level (e.g. with `Base.Threads`). This can be done with the `disable_polyester_threads` context manager. In the expandable section below you can see examples with benchmarks.
456+
457+
It is best to call `disable_polyester_threads` only once, before any `@thread` uses happen, to avoid overhead. E.g. best to do it as:
458+
```julia
459+
disable_polyester_threads() do
460+
@threads for i in 1:n
461+
f()
462+
end
463+
end
464+
```
465+
instead of doing it in the following unnecessarily slow manner:
466+
```julia
467+
@threads for i in 1:n # DO NOT DO THIS
468+
disable_polyester_threads() do # IT HAS UNNECESSARY OVERHEAD
469+
f()
470+
end
471+
end
472+
```
473+
474+
475+
<details>
476+
<summary>Benchmarks of nested multi-threading with Polyester</summary>
477+
478+
```julia
479+
# Big inner problem, repeated only a few times
480+
481+
y = rand(10000000,4);
482+
x = rand(size(y)...);
483+
484+
@btime inner($x,$y,1) # 73.319 ms (0 allocations: 0 bytes)
485+
@btime inner_polyester($x,$y,1) # 8.936 ms (0 allocations: 0 bytes)
486+
@btime inner_thread($x,$y,1) # 11.206 ms (49 allocations: 4.56 KiB)
487+
488+
@btime sequential_sequential($x,$y) # 274.926 ms (0 allocations: 0 bytes)
489+
@btime sequential_polyester($x,$y) # 36.963 ms (0 allocations: 0 bytes)
490+
@btime sequential_thread($x,$y) # 49.373 ms (196 allocations: 18.25 KiB)
491+
492+
@btime threads_of_polyester($x,$y) # 78.828 ms (58 allocations: 4.84 KiB)
493+
# the following is a purposefully suboptimal way to disable threads
494+
@btime threads_of_polyester_inner_disable($x,$y) # 70.182 ms (47 allocations: 4.50 KiB)
495+
# the following is a good way to disable threads (the disable call happening once in the outer scope)
496+
@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 71.141 ms (47 allocations: 4.50 KiB)
497+
@btime threads_of_sequential($x,$y) # 70.857 ms (46 allocations: 4.47 KiB)
498+
@btime threads_of_thread($x,$y) # 45.116 ms (219 allocations: 22.00 KiB)
499+
500+
# Small inner problem, repeated many times
501+
502+
y = rand(1000,1000);
503+
x = rand(size(y)...);
504+
505+
@btime inner($x,$y,1) # 7.028 μs (0 allocations: 0 bytes)
506+
@btime inner_polyester($x,$y,1) # 1.917 μs (0 allocations: 0 bytes)
507+
@btime inner_thread($x,$y,1) # 7.544 μs (45 allocations: 4.44 KiB)
508+
509+
@btime sequential_sequential($x,$y) # 6.790 ms (0 allocations: 0 bytes)
510+
@btime sequential_polyester($x,$y) # 2.070 ms (0 allocations: 0 bytes)
511+
@btime sequential_thread($x,$y) # 9.296 ms (49002 allocations: 4.46 MiB)
512+
513+
@btime threads_of_polyester($x,$y) # 2.090 ms (42 allocations: 4.34 KiB)
514+
# the following is a purposefully suboptimal way to disable threads
515+
@btime threads_of_polyester_inner_disable($x,$y) # 1.065 ms (42 allocations: 4.34 KiB)
516+
# the following is a good way to disable threads (the disable call happening once in the outer scope)
517+
@btime Polyester.disable_polyester_threads() do; threads_of_polyester($x,$y) end; # 997.918 μs (49 allocations: 4.56 KiB)
518+
@btime threads_of_sequential($x,$y) # 1.057 ms (48 allocations: 4.53 KiB)
519+
@btime threads_of_thread($x,$y) # 4.105 ms (42059 allocations: 4.25 MiB)
520+
521+
# The tested functions
522+
# All of these would be better implemented by just using @tturbo,
523+
# but these suboptimal implementations serve as good test case for
524+
# Polyster-vs-Base thread scheduling.
525+
526+
function inner(x,y,j)
527+
for i axes(x,1)
528+
y[i,j] = sin(x[i,j])
529+
end
530+
end
531+
532+
function inner_polyester(x,y,j)
533+
@batch for i axes(x,1)
534+
y[i,j] = sin(x[i,j])
535+
end
536+
end
537+
538+
function inner_thread(x,y,j)
539+
@threads for i axes(x,1)
540+
y[i,j] = sin(x[i,j])
541+
end
542+
end
543+
544+
function sequential_sequential(x,y)
545+
for j axes(x,2)
546+
inner(x,y,j)
547+
end
548+
end
549+
550+
function sequential_polyester(x,y)
551+
for j axes(x,2)
552+
inner_polyester(x,y,j)
553+
end
554+
end
555+
556+
function sequential_thread(x,y)
557+
for j axes(x,2)
558+
inner_thread(x,y,j)
559+
end
560+
end
561+
562+
function threads_of_polyester(x,y)
563+
@threads for j axes(x,2)
564+
inner_polyester(x,y,j)
565+
end
566+
end
567+
568+
function threads_of_polyester_inner_disable(x,y)
569+
# XXX This is a bad way to disable Polyester threads as
570+
# it causes unnecessary overhead for each @threads thread.
571+
# See the benchmarks above for a better way.
572+
@threads for j axes(x,2)
573+
Polyester.disable_polyester_threads() do
574+
inner_polyester(x,y,j)
575+
end
576+
end
577+
end
578+
579+
function threads_of_thread(x,y)
580+
@threads for j axes(x,2)
581+
inner_thread(x,y,j)
582+
end
583+
end
584+
585+
function threads_of_thread(x,y)
586+
@threads for j axes(x,2)
587+
inner_thread(x,y,j)
588+
end
589+
end
590+
591+
function threads_of_sequential(x,y)
592+
@threads for j axes(x,2)
593+
inner(x,y,j)
594+
end
595+
end
596+
```
597+
Benchmarks executed on:
598+
```
599+
Julia Version 1.9.0-DEV.998
600+
Commit e1739aa42a1 (2022-07-18 10:27 UTC)
601+
Platform Info:
602+
OS: Linux (x86_64-linux-gnu)
603+
CPU: 16 × AMD Ryzen 7 1700 Eight-Core Processor
604+
WORD_SIZE: 64
605+
LIBM: libopenlibm
606+
LLVM: libLLVM-14.0.5 (ORCJIT, znver1)
607+
Threads: 8 on 16 virtual cores
608+
Environment:
609+
JULIA_EDITOR = code
610+
JULIA_NUM_THREADS = 8
611+
```
612+
</details>

src/Polyester.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,11 @@ using ManualMemory: Reference
88
using Static
99
using Requires
1010
using PolyesterWeave:
11-
request_threads, free_threads!, mask, UnsignedIteratorEarlyStop, assume
11+
request_threads, free_threads!, mask, UnsignedIteratorEarlyStop, assume,
12+
disable_polyester_threads
1213
using CPUSummary: num_threads, num_cores
1314

14-
export batch, @batch, num_threads
15+
export batch, @batch, num_threads, disable_polyester_threads
1516

1617

1718
include("batch.jl")

test/runtests.jl

Lines changed: 92 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ println(
22
"Starting tests with $(Threads.nthreads()) threads out of `Sys.CPU_THREADS = $(Sys.CPU_THREADS)`...",
33
)
44
using Polyester, Aqua, ForwardDiff
5+
using Base.Threads: @threads
56
using Test
67

78
function bsin!(y, x, r = eachindex(y, x))
@@ -395,10 +396,10 @@ end
395396
# issue 78 (lack of support for keyword arguments using only variable names without `=`)
396397

397398
f(a; b=10.0, c=100.0) = a + b + c
398-
399+
399400
buf = [0, 0]
400401
b = 0.0
401-
402+
402403
Threads.nthreads() == 1 && println("the issue arises only on multithreading runs")
403404

404405
@batch for i in 1:2
@@ -408,6 +409,95 @@ end
408409
@test buf == [1, 2]
409410
end
410411

412+
@testset "disable_polyester_threads" begin
413+
function inner(x,y,j)
414+
for i axes(x,1)
415+
y[i,j] = sin(x[i,j])
416+
end
417+
end
418+
419+
function inner_polyester(x,y,j)
420+
@batch for i axes(x,1)
421+
y[i,j] = sin(x[i,j])
422+
end
423+
end
424+
425+
function inner_thread(x,y,j)
426+
@threads for i axes(x,1)
427+
y[i,j] = sin(x[i,j])
428+
end
429+
end
430+
431+
function sequential_sequential(x,y)
432+
for j axes(x,2)
433+
inner(x,y,j)
434+
end
435+
end
436+
437+
function sequential_polyester(x,y)
438+
for j axes(x,2)
439+
inner_polyester(x,y,j)
440+
end
441+
end
442+
443+
function sequential_thread(x,y)
444+
for j axes(x,2)
445+
inner_thread(x,y,j)
446+
end
447+
end
448+
449+
function threads_of_polyester(x,y)
450+
@threads for j axes(x,2)
451+
inner_polyester(x,y,j)
452+
end
453+
end
454+
455+
function threads_of_polyester_inner_disable(x,y)
456+
@threads for j axes(x,2)
457+
Polyester.disable_polyester_threads() do
458+
inner_polyester(x,y,j)
459+
end
460+
end
461+
end
462+
463+
function threads_of_thread(x,y)
464+
@threads for j axes(x,2)
465+
inner_thread(x,y,j)
466+
end
467+
end
468+
469+
function threads_of_sequential(x,y)
470+
@threads for j axes(x,2)
471+
inner(x,y,j)
472+
end
473+
end
474+
475+
y = rand(10,10); # (size of inner problem, size of outer problem)
476+
x = rand(size(y)...);
477+
inner(x,y,1)
478+
good_y = copy(y)
479+
inner_polyester(x,y,1)
480+
@assert good_y == y
481+
inner_thread(x,y,1)
482+
@assert good_y == y
483+
sequential_sequential(x,y)
484+
good_y = copy(y)
485+
sequential_polyester(x,y)
486+
@assert good_y == y
487+
sequential_thread(x,y)
488+
@assert good_y == y
489+
threads_of_polyester(x,y)
490+
@assert good_y == y
491+
threads_of_polyester_inner_disable(x,y)
492+
@assert good_y == y
493+
disable_polyester_threads() do; threads_of_polyester(x,y) end
494+
@assert good_y == y
495+
threads_of_sequential(x,y)
496+
@assert good_y == y
497+
threads_of_thread(x,y)
498+
@assert good_y == y
499+
end
500+
411501
if VERSION v"1.6"
412502
println("Package tests complete. Running `Aqua` checks.")
413503
Aqua.test_all(Polyester)

0 commit comments

Comments
 (0)