Skip to content

Commit 44c7404

Browse files
committed
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into optimze/remove_in_out_operatorbase
2 parents d06d855 + 7783d3b commit 44c7404

File tree

2 files changed

+97
-48
lines changed

2 files changed

+97
-48
lines changed

paddle/fluid/inference/tests/api/int8_mkldnn_quantization.md

Lines changed: 65 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# INT8 MKL-DNN quantization
22

3-
This document describes how to use Paddle inference Engine to convert the FP32 model to INT8 model on ResNet-50 and MobileNet-V1. We provide the instructions on enabling INT8 MKL-DNN quantization in Paddle inference and show the ResNet-50 and MobileNet-V1 results in accuracy and performance.
3+
This document describes how to use Paddle inference Engine to convert the FP32 models to INT8 models. We provide the instructions on enabling INT8 MKL-DNN quantization in Paddle inference and show the accuracy and performance results of the quantized models, including 7 image classification models: GoogleNet, MobileNet-V1, MobileNet-V2, ResNet-101, ResNet-50, VGG16, VGG19, and 1 object detection model Mobilenet-SSD.
44

55
## 0. Install PaddlePaddle
66

@@ -15,7 +15,7 @@ Note: MKL-DNN and MKL are required.
1515

1616
## 1. Enable INT8 MKL-DNN quantization
1717

18-
For reference, please examine the code of unit test enclosed in [analyzer_int8_image_classification_tester.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/analyzer_int8_image_classification_tester.cc).
18+
For reference, please examine the code of unit test enclosed in [analyzer_int8_image_classification_tester.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/analyzer_int8_image_classification_tester.cc) and [analyzer_int8_object_detection_tester.cc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/analyzer_int8_object_detection_tester.cc).
1919

2020
* ### Create Analysis config
2121

@@ -34,12 +34,10 @@ cfg.mkldnn_quantizer_config()->SetWarmupData(warmup_data);
3434
cfg.mkldnn_quantizer_config()->SetWarmupBatchSize(100);
3535
```
3636
37-
## 2. Accuracy and Performance benchmark
37+
## 2. Accuracy and Performance benchmark for Image Classification models
3838
3939
We provide the results of accuracy and performance measured on Intel(R) Xeon(R) Gold 6271 on single core.
4040
41-
>**Dataset: ILSVRC2012 Validation dataset**
42-
4341
>**I. Top-1 Accuracy on Intel(R) Xeon(R) Gold 6271**
4442
4543
| Model | FP32 Accuracy | INT8 Accuracy | Accuracy Diff(FP32-INT8) |
@@ -64,20 +62,10 @@ We provide the results of accuracy and performance measured on Intel(R) Xeon(R)
6462
| VGG16 | 3.64 | 10.56 | 2.90 |
6563
| VGG19 | 2.95 | 9.02 | 3.05 |
6664
67-
Notes:
68-
69-
* Measurement of accuracy requires a model which accepts two inputs: data and labels.
70-
71-
* Different sampling batch size data may cause slight difference on INT8 top accuracy.
72-
* CAPI performance data is better than python API performance data because of the python overhead. Especially for the small computational model, python overhead will be more obvious.
73-
74-
## 3. Commands to reproduce the above accuracy and performance benchmark
75-
76-
Two steps to reproduce the above-mentioned accuracy results, and we take GoogleNet benchmark as an example:
7765
78-
* ### Prepare dataset
66+
* ## Prepare dataset
7967
80-
Running the following commands to download and preprocess the ILSVRC2012 Validation dataset.
68+
Run the following commands to download and preprocess the ILSVRC2012 Validation dataset.
8169
8270
```bash
8371
cd /PATH/TO/PADDLE/build
@@ -86,12 +74,13 @@ python ../paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py
8674

8775
Then the ILSVRC2012 Validation dataset will be preprocessed and saved by default in `~/.cache/paddle/dataset/int8/download/int8_full_val.bin`
8876

89-
* ### Commands to reproduce benchmark
77+
* ## Commands to reproduce image classification benchmark
9078

91-
You can run `test_analyzer_int8_imagenet_classification` with the following arguments to reproduce the accuracy result on GoogleNet.
79+
You can run `test_analyzer_int8_imagenet_classification` with the following arguments to reproduce the accuracy result on Resnet50.
9280

9381
```bash
94-
./paddle/fluid/inference/tests/api/test_analyzer_int8_image_classification --infer_model=third_party/inference_demo/int8v2/resnet50/model --infer_data=/~/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1
82+
cd /PATH/TO/PADDLE/build
83+
./paddle/fluid/inference/tests/api/test_analyzer_int8_image_classification --infer_model=third_party/inference_demo/int8v2/resnet50/model --infer_data=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin --batch_size=1 --paddle_num_threads=1
9584
```
9685

9786
To verify all the 7 models, you need to set the parameter of `--infer_model` to one of the following values in command line:
@@ -103,3 +92,59 @@ To verify all the 7 models, you need to set the parameter of `--infer_model` to
10392
```text
10493
MODEL_NAME=googlenet, mobilenetv1, mobilenetv2, resnet101, resnet50, vgg16, vgg19
10594
```
95+
96+
## 3. Accuracy and Performance benchmark for Object Detection models
97+
98+
>**I. mAP on Intel(R) Xeon(R) Gold 6271 (batch size 1 on single core):**
99+
100+
| Model | FP32 Accuracy | INT8 Accuracy | Accuracy Diff(FP32-INT8) |
101+
| :----------: | :-------------: | :------------: | :--------------: |
102+
| Mobilenet-SSD| 73.80% | 73.17% | 0.63% |
103+
104+
>**II. Throughput on Intel(R) Xeon(R) Gold 6271 (batch size 1 on single core)**
105+
106+
| Model | FP32 Throughput(images/s) | INT8 Throughput(images/s) | Ratio(INT8/FP32)|
107+
| :-----------:| :------------: | :------------: | :------------: |
108+
| Mobilenet-SSD | 37.8180 | 115.0604 |3.04 |
109+
110+
* ## Prepare dataset
111+
112+
* Run the following commands to download and preprocess the Pascal VOC2007 test set.
113+
114+
```bash
115+
cd /PATH/TO/PADDLE/build
116+
python ./paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py --choice=VOC_test_2007 \\
117+
```
118+
119+
Then the Pascal VOC2007 test set will be preprocessed and saved by default in `~/.cache/paddle/dataset/pascalvoc/pascalvoc_full.bin`
120+
121+
* Run the following commands to prepare your own dataset.
122+
123+
```bash
124+
cd /PATH/TO/PADDLE/build
125+
python ./paddle/fluid/inference/tests/api/full_pascalvoc_test_preprocess.py --choice=local \\
126+
--data_dir=./third_party/inference_demo/int8v2/pascalvoc_small \\
127+
--img_annotation_list=test_100.txt \\
128+
--label_file=label_list \\
129+
--output_file=pascalvoc_small.bin \\
130+
--resize_h=300 \\
131+
--resize_w=300 \\
132+
--mean_value=[127.5, 127.5, 127.5] \\
133+
--ap_version=11point \\
134+
```
135+
Then the user dataset will be preprocessed and saved by default in `/PATH/TO/PADDLE/build/third_party/inference_demo/int8v2/pascalvoc_small/pascalvoc_small.bin`
136+
137+
* ## Commands to reproduce object detection benchmark
138+
139+
You can run `test_analyzer_int8_object_detection` with the following arguments to reproduce the benchmark results for Mobilenet-SSD.
140+
141+
```bash
142+
cd /PATH/TO/PADDLE/build
143+
./paddle/fluid/inference/tests/api/test_analyzer_int8_object_detection --infer_model=third_party/inference_demo/int8v2/mobilenet-ssd/model --infer_data=$HOME/.cache/paddle/dataset/pascalvoc/pascalvoc_full.bin --warmup_batch_size=10 --batch_size=100 --paddle_num_threads=1
144+
```
145+
146+
## 4. Notes
147+
148+
* Measurement of accuracy requires a model which accepts two inputs: data and labels.
149+
* Different sampling batch size data may cause slight difference on INT8 accuracy.
150+
* CAPI performance data is better than python API performance data because of the python overhead. Especially for the small computational model, python overhead will be more obvious.

paddle/fluid/operators/conv_cudnn_op.cu

Lines changed: 32 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -540,23 +540,25 @@ class CUDNNConvGradOpKernel : public framework::OpKernel<T> {
540540
workspace_size);
541541
}
542542

543-
std::vector<int> starts(transformed_input_channel.dims().size(), 0);
544-
std::vector<int> axes(transformed_input_channel.dims().size(), 0);
543+
if (!is_sys_pad) {
544+
std::vector<int> starts(transformed_input_channel.dims().size(), 0);
545+
std::vector<int> axes(transformed_input_channel.dims().size(), 0);
545546

546-
for (size_t i = 0; i < transformed_input_channel.dims().size(); ++i) {
547-
starts[i] = input_pad[2 * i];
548-
axes[i] = i;
549-
}
547+
for (size_t i = 0; i < transformed_input_channel.dims().size(); ++i) {
548+
starts[i] = input_pad[2 * i];
549+
axes[i] = i;
550+
}
550551

551-
transformed_input_grad_channel.mutable_data(ctx.GetPlace());
552-
if (transformed_input_channel.dims().size() == 4) {
553-
Slice_2<paddle::platform::CUDADeviceContext, T, 4>(
554-
ctx, &transformed_input_grad, &transformed_input_grad_channel,
555-
starts, axes);
556-
} else {
557-
Slice_2<paddle::platform::CUDADeviceContext, T, 5>(
558-
ctx, &transformed_input_grad, &transformed_input_grad_channel,
559-
starts, axes);
552+
transformed_input_grad_channel.mutable_data(ctx.GetPlace());
553+
if (transformed_input_channel.dims().size() == 4) {
554+
Slice_2<paddle::platform::CUDADeviceContext, T, 4>(
555+
ctx, &transformed_input_grad, &transformed_input_grad_channel,
556+
starts, axes);
557+
} else {
558+
Slice_2<paddle::platform::CUDADeviceContext, T, 5>(
559+
ctx, &transformed_input_grad, &transformed_input_grad_channel,
560+
starts, axes);
561+
}
560562
}
561563

562564
if (channel_last) {
@@ -982,20 +984,22 @@ class CUDNNConvDoubleGradOpKernel : public framework::OpKernel<T> {
982984
workspace_size);
983985
}
984986

985-
// reverse padded input
986-
std::vector<int> starts(X->dims().size(), 0);
987-
std::vector<int> axes(X->dims().size(), 0);
987+
if (!is_sys_pad) {
988+
// reverse padded input
989+
std::vector<int> starts(X->dims().size(), 0);
990+
std::vector<int> axes(X->dims().size(), 0);
988991

989-
for (size_t i = 0; i < X->dims().size(); ++i) {
990-
starts[i] = input_pad[2 * i];
991-
axes[i] = i;
992-
}
993-
if (X->dims().size() == 4) {
994-
Slice_2<paddle::platform::CUDADeviceContext, T, 4>(
995-
ctx, &transformed_dX, &transformed_dX_channel, starts, axes);
996-
} else {
997-
Slice_2<paddle::platform::CUDADeviceContext, T, 5>(
998-
ctx, &transformed_dX, &transformed_dX_channel, starts, axes);
992+
for (size_t i = 0; i < X->dims().size(); ++i) {
993+
starts[i] = input_pad[2 * i];
994+
axes[i] = i;
995+
}
996+
if (X->dims().size() == 4) {
997+
Slice_2<paddle::platform::CUDADeviceContext, T, 4>(
998+
ctx, &transformed_dX, &transformed_dX_channel, starts, axes);
999+
} else {
1000+
Slice_2<paddle::platform::CUDADeviceContext, T, 5>(
1001+
ctx, &transformed_dX, &transformed_dX_channel, starts, axes);
1002+
}
9991003
}
10001004
if (channel_last) {
10011005
TransToChannelLast<paddle::platform::CUDADeviceContext, T>(

0 commit comments

Comments
 (0)