OpenACC
diff --git a/‎04-Parallelize.markdown
Lines changed: 13 additions & 12 deletions b/‎04-Parallelize.markdown
Lines changed: 13 additions & 12 deletions
diff --git a/‎06-Loops.markdown
Lines changed: 6 additions & 6 deletions b/‎06-Loops.markdown
Lines changed: 6 additions & 6 deletions
diff --git a/‎images/jacobi_step1_graph.png
15.6 KB b/‎images/jacobi_step1_graph.png
15.6 KB
diff --git a/‎images/jacobi_step2_graph.png
15.3 KB b/‎images/jacobi_step2_graph.png
15.3 KB
diff --git a/‎images/spmv_speedup_num_workers.png
21 KB b/‎images/spmv_speedup_num_workers.png
21 KB
diff --git a/‎images/spmv_speedup_vector_length.png
16.1 KB b/‎images/spmv_speedup_vector_length.png
16.1 KB
@@ -626,25 +626,26 @@ discussed in a later chapter.
 At this point we have expressed all of the parallelism in the example code and
 the compiler has parallelized it for an accelerator device. Analyzing the
 performance of this code may yield surprising results on some accelerators,
-however. The results below demonstrate the performance of this code on 1 - 8
-CPU threads on a modern CPU at the time of publication and an NVIDIA Tesla K40
+however. The results below demonstrate the performance of this code on 1 - 16
+CPU threads on an AMD Threadripper CPU and an NVIDIA Volta V100
 GPU using both implementations above. The *y axis* for figure 3.1 is execution
 time in seconds, so smaller is better. For the two OpenACC versions, the bar is
-divided by time transferring data between the host and device, time executing
-on the device, and other time.
+divided by time transferring data between the host and device and time executing
+on the device.
 
 ![Jacobi Iteration Performance - Step 1](images/jacobi_step1_graph.png)
 
-Notice that the performance of this code improves as CPU threads are added to
-the calcuation, but the OpenACC versions perform poorly compared to the CPU
-baseline. The OpenACC `kernels` version performs slightly better than the
-serial version, but the `parallel loop` case performs dramaticaly worse than
-even the slowest CPU version. Further performance analysis is necessary to
+The performance of this improves as more CPU threads are added to the calculation,
+however, since the code is memory-bound the performance benefit of adding
+additional threads quickly diminishes. Also, the OpenACC versions perform poorly
+compared to the CPU
+baseline. The both the OpenACC `kernels` and `parallel loop` versions perform
+worse than the serial CPU baseline. It is also clear that the `parallel loop` version
+spends significantly more time in data transfer than the `kernels` version.
+Further performance analysis is necessary to
 identify the source of this slowdown. This analysis has already been applied to
 the graph above, which breaks down time spent
-computing the solution, copying data to and from the accelerator, and 
-miscelaneous time, which includes various overheads involved in scheduling data
-transfers and computation. 
+computing the solution and copying data to and from the accelerator.
 
 A variety of tools are available for performing this analysis, but since this
 case study was compiled for an NVIDIA GPU, NVIDIA Nsight Systems will be
 
@@ -486,7 +486,7 @@ parallelism to fill each *gang* with more of these short vectors. Below is the
 modified code.
 
 ~~~~ {.c .numberLines}
-    #pragma acc parallel loop gang worker num_workers(32) vector_length(32)
+    #pragma acc parallel loop gang worker num_workers(4) vector_length(32)
     for(int i=0;i<num_rows;i++) {
       double sum=0;
       int row_start=row_offsets[i];
@@ -521,19 +521,19 @@ modified code.
     enddo
 ~~~~
 
-In this version of the code, I've explicitly mapped the outermost look to both
+In this version of the code, I've explicitly mapped the outermost loop to both
 gang and worker parallelism and will vary the number of workers using the
 `num_workers` clause. The results follow.
 
 ![Speed-up from varying number of workers for a vector length of
 32.](images/spmv_speedup_num_workers.png)
 
 On this particular hardware, the best performance comes from a vector length of
-32 and 32 workers. This turns out to be the maximum amount of parallelism that
-the particular accelerator being used supports within a gang. In this case, we
-observed a 1.3X speed-up from decreasing the vector length and another 2.1X
+32 and 4 workers, which is similar to the simpler loop with a default vector length of 128.
+In this case, we
+observed a 2.5X speed-up from decreasing the vector length and another 1.26X
 speed-up from varying the number of workers within each gang, resulting in an
-overall 2.9X performance improvement from the untuned OpenACC code.
+overall 3.15X performance improvement from the untuned OpenACC code.
 
 ***Best Practice:*** Although not shown in order to save space, it's generally
 best to use the `device_type` clause whenever specifying the sorts of