Skip to content

Commit dc18a50

Browse files
authored
Merge pull request #11 from efwright/updating_graphs
Updating images
2 parents b55c62d + 9eb561e commit dc18a50

6 files changed

+19
-18
lines changed

04-Parallelize.markdown

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -626,25 +626,26 @@ discussed in a later chapter.
626626
At this point we have expressed all of the parallelism in the example code and
627627
the compiler has parallelized it for an accelerator device. Analyzing the
628628
performance of this code may yield surprising results on some accelerators,
629-
however. The results below demonstrate the performance of this code on 1 - 8
630-
CPU threads on a modern CPU at the time of publication and an NVIDIA Tesla K40
629+
however. The results below demonstrate the performance of this code on 1 - 16
630+
CPU threads on an AMD Threadripper CPU and an NVIDIA Volta V100
631631
GPU using both implementations above. The *y axis* for figure 3.1 is execution
632632
time in seconds, so smaller is better. For the two OpenACC versions, the bar is
633-
divided by time transferring data between the host and device, time executing
634-
on the device, and other time.
633+
divided by time transferring data between the host and device and time executing
634+
on the device.
635635

636636
![Jacobi Iteration Performance - Step 1](images/jacobi_step1_graph.png)
637637

638-
Notice that the performance of this code improves as CPU threads are added to
639-
the calcuation, but the OpenACC versions perform poorly compared to the CPU
640-
baseline. The OpenACC `kernels` version performs slightly better than the
641-
serial version, but the `parallel loop` case performs dramaticaly worse than
642-
even the slowest CPU version. Further performance analysis is necessary to
638+
The performance of this improves as more CPU threads are added to the calculation,
639+
however, since the code is memory-bound the performance benefit of adding
640+
additional threads quickly diminishes. Also, the OpenACC versions perform poorly
641+
compared to the CPU
642+
baseline. The both the OpenACC `kernels` and `parallel loop` versions perform
643+
worse than the serial CPU baseline. It is also clear that the `parallel loop` version
644+
spends significantly more time in data transfer than the `kernels` version.
645+
Further performance analysis is necessary to
643646
identify the source of this slowdown. This analysis has already been applied to
644647
the graph above, which breaks down time spent
645-
computing the solution, copying data to and from the accelerator, and
646-
miscelaneous time, which includes various overheads involved in scheduling data
647-
transfers and computation.
648+
computing the solution and copying data to and from the accelerator.
648649

649650
A variety of tools are available for performing this analysis, but since this
650651
case study was compiled for an NVIDIA GPU, NVIDIA Nsight Systems will be

06-Loops.markdown

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -486,7 +486,7 @@ parallelism to fill each *gang* with more of these short vectors. Below is the
486486
modified code.
487487

488488
~~~~ {.c .numberLines}
489-
#pragma acc parallel loop gang worker num_workers(32) vector_length(32)
489+
#pragma acc parallel loop gang worker num_workers(4) vector_length(32)
490490
for(int i=0;i<num_rows;i++) {
491491
double sum=0;
492492
int row_start=row_offsets[i];
@@ -521,19 +521,19 @@ modified code.
521521
enddo
522522
~~~~
523523

524-
In this version of the code, I've explicitly mapped the outermost look to both
524+
In this version of the code, I've explicitly mapped the outermost loop to both
525525
gang and worker parallelism and will vary the number of workers using the
526526
`num_workers` clause. The results follow.
527527

528528
![Speed-up from varying number of workers for a vector length of
529529
32.](images/spmv_speedup_num_workers.png)
530530

531531
On this particular hardware, the best performance comes from a vector length of
532-
32 and 32 workers. This turns out to be the maximum amount of parallelism that
533-
the particular accelerator being used supports within a gang. In this case, we
534-
observed a 1.3X speed-up from decreasing the vector length and another 2.1X
532+
32 and 4 workers, which is similar to the simpler loop with a default vector length of 128.
533+
In this case, we
534+
observed a 2.5X speed-up from decreasing the vector length and another 1.26X
535535
speed-up from varying the number of workers within each gang, resulting in an
536-
overall 2.9X performance improvement from the untuned OpenACC code.
536+
overall 3.15X performance improvement from the untuned OpenACC code.
537537

538538
***Best Practice:*** Although not shown in order to save space, it's generally
539539
best to use the `device_type` clause whenever specifying the sorts of

images/jacobi_step1_graph.png

15.6 KB
Loading

images/jacobi_step2_graph.png

15.3 KB
Loading

images/spmv_speedup_num_workers.png

21 KB
Loading

images/spmv_speedup_vector_length.png

16.1 KB
Loading

0 commit comments

Comments
 (0)