@@ -626,25 +626,26 @@ discussed in a later chapter.
626
626
At this point we have expressed all of the parallelism in the example code and
627
627
the compiler has parallelized it for an accelerator device. Analyzing the
628
628
performance of this code may yield surprising results on some accelerators,
629
- however. The results below demonstrate the performance of this code on 1 - 8
630
- CPU threads on a modern CPU at the time of publication and an NVIDIA Tesla K40
629
+ however. The results below demonstrate the performance of this code on 1 - 16
630
+ CPU threads on an AMD Threadripper CPU and an NVIDIA Volta V100
631
631
GPU using both implementations above. The * y axis* for figure 3.1 is execution
632
632
time in seconds, so smaller is better. For the two OpenACC versions, the bar is
633
- divided by time transferring data between the host and device, time executing
634
- on the device, and other time .
633
+ divided by time transferring data between the host and device and time executing
634
+ on the device.
635
635
636
636
![ Jacobi Iteration Performance - Step 1] ( images/jacobi_step1_graph.png )
637
637
638
- Notice that the performance of this code improves as CPU threads are added to
639
- the calcuation, but the OpenACC versions perform poorly compared to the CPU
640
- baseline. The OpenACC ` kernels ` version performs slightly better than the
641
- serial version, but the ` parallel loop ` case performs dramaticaly worse than
642
- even the slowest CPU version. Further performance analysis is necessary to
638
+ The performance of this improves as more CPU threads are added to the calculation,
639
+ however, since the code is memory-bound the performance benefit of adding
640
+ additional threads quickly diminishes. Also, the OpenACC versions perform poorly
641
+ compared to the CPU
642
+ baseline. The both the OpenACC ` kernels ` and ` parallel loop ` versions perform
643
+ worse than the serial CPU baseline. It is also clear that the ` parallel loop ` version
644
+ spends significantly more time in data transfer than the ` kernels ` version.
645
+ Further performance analysis is necessary to
643
646
identify the source of this slowdown. This analysis has already been applied to
644
647
the graph above, which breaks down time spent
645
- computing the solution, copying data to and from the accelerator, and
646
- miscelaneous time, which includes various overheads involved in scheduling data
647
- transfers and computation.
648
+ computing the solution and copying data to and from the accelerator.
648
649
649
650
A variety of tools are available for performing this analysis, but since this
650
651
case study was compiled for an NVIDIA GPU, NVIDIA Nsight Systems will be
0 commit comments