Merge pull request #520 from huangjun12/pptsmv2-addatt-0830

huangjun12 · web-flow · commit 48e7afecb783 · 2022-10-20T20:18:49.000+08:00
add attention module for PP-TSMv2
diff --git a/configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform_dml_distillation.yaml b/configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform_dml_distillation.yaml
@@ -22,6 +22,7 @@ MODEL: #MODEL field
             name: "PPTSM_v2" #Mandatory, The name of backbone.
             pretrained: "data/PPLCNetV2_base_ssld_pretrained.pdparams" #Optional, pretrained model path.
             num_seg: 16
+            use_temporal_att: True
             class_num: 400
         head:
             name: "MoViNetHead" #Mandatory, indicate the type of head, associate to the 'paddlevideo/modeling/heads'
diff --git a/docs/zh-CN/benchmark.md b/docs/zh-CN/benchmark.md
@@ -73,7 +73,7 @@ mean fps:  25
 | TSM | R50 | [tsm_k400_frames.yaml](../../configs/recognition/tsm/tsm_k400_frames.yaml) | 71.06 | 52.02 | 9.87 | 61.89 |
 |**PP-TSM**	| R50 |	[pptsm_k400_frames_uniform.yaml](../../configs/recognition/pptsm/pptsm_k400_frames_uniform.yaml) | **75.11** | 51.84 | 11.26 | **63.1** |
 |PP-TSM	| R101 | [pptsm_k400_frames_dense_r101.yaml](../../configs/recognition/pptsm/pptsm_k400_frames_dense_r101.yaml) | 76.35| 52.1 | 17.91 | 70.01 |
-| PP-TSMv2 | PP-LCNet_v2.16f |	[pptsm_lcnet_k400_16frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform.yaml) | 74.38 |  69.4 | 7.26 | 76.66 |
+| PP-TSMv2 | PP-LCNet_v2.16f |	[pptsm_lcnet_k400_16frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform.yaml) | 74.38 |  69.4 | 7.55 | 76.95 |
 | SlowFast | 4*16 |	[slowfast.yaml](../../configs/recognition/slowfast/slowfast.yaml) | 74.35 | 99.27 | 27.4 | 126.67 |
 | *VideoSwin | B | [videoswin_k400_videos.yaml](../../configs/recognition/videoswin/videoswin_k400_videos.yaml) | 82.4 | 95.65 | 117.22 | 212.88 |
 | MoViNet | A0 | [movinet_k400_frame.yaml](../../configs/recognition/movinet/movinet_k400_frame.yaml) | 66.62 | 150.36 | 47.24 | 197.60 |
@@ -95,7 +95,7 @@ mean fps:  25
 | PP-TSM | MobileNetV2 | [pptsm_mv2_k400_videos_uniform.yaml](../../configs/recognition/pptsm/pptsm_mv2_k400_videos_uniform.yaml) | 68.09 | 52.62 | 137.03 | 189.65 |
 | PP-TSM | MobileNetV3 | [pptsm_mv3_k400_frames_uniform.yaml](../../configs/recognition/pptsm/pptsm_mv3_k400_frames_uniform.yaml) | 69.84| 53.44 | 139.13 | 192.58 |
 | **PP-TSMv2** | PP-LCNet_v2.8f |	[pptsm_lcnet_k400_8frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_8frames_uniform.yaml) | **72.45**| 53.37 | 189.62 | **242.99** |
-| **PP-TSMv2** | PP-LCNet_v2.16f |	[pptsm_lcnet_k400_16frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform.yaml) | **74.38**|  68.07 | 365.23 | **433.31** |
+| **PP-TSMv2** | PP-LCNet_v2.16f |	[pptsm_lcnet_k400_16frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform.yaml) | **74.38**|  68.07 | 388.64 | **456.71** |
 | SlowFast | 4*16 |	[slowfast.yaml](../../configs/recognition/slowfast/slowfast.yaml) | 74.35 | 110.04 | 1201.36 | 1311.41 |
 | TSM | R50 | [tsm_k400_frames.yaml](../../configs/recognition/tsm/tsm_k400_frames.yaml) | 71.06 | 52.47 | 1302.49 | 1354.96 |
 |PP-TSM	| R50 |	[pptsm_k400_frames_uniform.yaml](../../configs/recognition/pptsm/pptsm_k400_frames_uniform.yaml) | 75.11 | 52.26  | 1354.21 | 1406.48 |
diff --git a/docs/zh-CN/model_zoo/recognition/pp-tsm.md b/docs/zh-CN/model_zoo/recognition/pp-tsm.md
@@ -37,7 +37,7 @@ PP-TSM基于ResNet-50骨干网络进行优化，从数据增强、网络结构
 
 ### PP-TSMv2
 
-PP-TSMv2是轻量化的视频分类模型，基于CPU端模型[PP-LCNetV2](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/zh_CN/models/PP-LCNetV2.md)进行优化，从骨干网络与预训练模型选择、数据增强、网络结构调整(使用最优的tsm模块插入数量和位置、新增时序attention模块)、输入帧数优化、解码速度优化、dml蒸馏等6个方面进行模型调优，在中心采样评估方式下，精度达到74.38%，输入10s视频在CPU端的推理速度仅需433ms。更多细节参考[PP-TSMv2技术报告](./pp-tsm_v2.md)。
+PP-TSMv2是轻量化的视频分类模型，基于CPU端模型[PP-LCNetV2](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/zh_CN/models/PP-LCNetV2.md)进行优化，从骨干网络与预训练模型选择、数据增强、tsm模块调优、输入帧数优化、解码速度优化、dml蒸馏、新增时序attention模块等7个方面进行模型调优，在中心采样评估方式下，精度达到75.16%，输入10s视频在CPU端的推理速度仅需456ms。更多细节参考[PP-TSMv2技术报告](./pp-tsm_v2.md)。
 
 
 <a name="2"></a>
@@ -50,7 +50,7 @@ PP-TSMv2模型与主流模型之间CPU推理速度对比(按预测总时间排
 | PP-TSM | MobileNetV2 |  68.09 | 52.62 | 137.03 | 189.65 |
 | PP-TSM | MobileNetV3 |  69.84| 53.44 | 139.13 | 192.58 |
 | **PP-TSMv2** | PP-LCNet_v2.8f | **72.45**| 53.37 | 189.62 | **242.99** |
-| **PP-TSMv2** | PP-LCNet_v2.16f |	**74.38**|  68.07 | 365.23 | **433.31** |
+| **PP-TSMv2** | PP-LCNet_v2.16f |	**75.16**|  68.07 | 388.64 | **456.71** |
 | SlowFast | 4*16 |74.35 | 110.04 | 1201.36 | 1311.41 |
 | TSM | R50 |  71.06 | 52.47 | 1302.49 | 1354.96 |
 |PP-TSM	| R50 |	75.11 | 52.26  | 1354.21 | 1406.48 |
@@ -271,7 +271,7 @@ PaddleVideo 提供了基于 Paddle2ONNX 来完成 inference 模型转换 ONNX 
 | 模型名称 | 骨干网络 | 蒸馏方式 | 测试方式 | 采样帧数 | Top-1% | 训练模型 |
 | :------: | :----------: | :----: | :----: | :----: | :---- | :---- |
 | PP-TSMv2 | LCNet_v2 | DML | Uniform | 8 | 72.45 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_8f_dml.pdparams) \| [Student模型](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_8f_dml_student.pdparams) |
-| PP-TSMv2 | LCNet_v2 | DML | Uniform | 16 | 74.38 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_16f_dml.pdparams) \| [Student模型](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_16f_dml_student.pdparams) |
+| PP-TSMv2 | LCNet_v2 | DML | Uniform | 16 | 75.16 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_16f_dml.pdparams) \| [Student模型](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_16f_dml_student.pdparams) |
 | PP-TSM | ResNet50 | KD | Uniform | 8 | 75.11 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.1/PPTSM/ppTSM_k400_uniform_distill.pdparams) |
 | PP-TSM | ResNet50 | KD | Dense | 8 | 76.16 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.1/PPTSM/ppTSM_k400_dense_distill.pdparams) |
 | PP-TSM | ResNet101 | KD | Uniform | 8 | 76.35 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.2/ppTSM_k400_uniform_distill_r101.pdparams) |
diff --git a/docs/zh-CN/model_zoo/recognition/pp-tsm_v2.md b/docs/zh-CN/model_zoo/recognition/pp-tsm_v2.md
@@ -6,10 +6,11 @@
 - [2. 模型细节](#2)
     - [2.1 骨干网络与预训练模型选择](#21)
     - [2.2 数据增强](#22)
-    - [2.3 网络结构调整](#23)
+    - [2.3 tsm模块调优](#23)
     - [2.4 输入帧数优化](#24)
     - [2.5 解码速度优化](#25)
     - [2.6 DML蒸馏](#26)
+    - [2.7 新增时序attention模块](#27)  
 - [3. 快速体验](#3)
 - [4. 模型训练、压缩、推理部署](#4)
 
@@ -19,7 +20,7 @@
 
 视频分类任务是指输入视频，输出标签类别。如果标签都是行为类别，则该任务也称为行为识别。随着AI在各个行业的应用普及，工业及体育场景下对轻量化行为识别模型的需求日益增多，为此我们提出了高效的轻量化行为识别模型PP-TSMv2。
 
-PP-TSMv2沿用了部分PP-TSM的优化策略，从骨干网络与预训练模型选择、数据增强、网络结构调整、输入帧数优化、解码速度优化、dml蒸馏等6个方面进行模型调优，在中心采样评估方式下，精度达到74.38%，输入10s视频在CPU端的推理速度仅需433ms。
+PP-TSMv2沿用了部分PP-TSM的优化策略，从骨干网络与预训练模型选择、数据增强、tsm模块调优、输入帧数优化、解码速度优化、dml蒸馏、新增时序attention模块等7个方面进行模型调优，在中心采样评估方式下，精度达到75.16%，输入10s视频在CPU端的推理速度仅需456ms。
 
 
 <a name="2"></a>
@@ -54,7 +55,7 @@ PP-TSMv2沿用了部分PP-TSM的优化策略，从骨干网络与预训练模型
 | baseline + VideoMix | 69.36(+**0.3**) |
 
 <a name="23"></a>
-### 2.3 网络结构调整
+### 2.3 tsm模块调优
 
 在骨干网络的基础上，我们添加了时序位移模块提取时序信息。对于插入位置，TSM原论文中将temporal_shift模块插入残差结构之中，但PP-LCNetV2为了加快模型速度，去除了部分残差连接。PP-LCNetV2整体结构分为4个stage，我们实验探索了时序位移模块最佳插入位置。对于插入数量，temporal_shift模块会加大模型的运行时间，我们探索了其最优插入数量，实验结果如下表所示。
 
@@ -164,6 +165,16 @@ PP-TSMv2沿用了部分PP-TSM的优化策略，从骨干网络与预训练模型
 | DML | PP-TSM_ResNet50 | 71.27%(**+2.20%**) |
 
 
+<a name="27"></a>
+### 2.7 新增时序attention模块
+
+temporal shift模块通过把特征在时间通道上位移，获取时序信息。但这种位移方式仅让局部的特征进行交互，缺少对全局时序信息的建模能力。为此我们提出了轻量化的时序attention模块，通过全局池化组合可学习的fc层，得到全局尺度上的时序attention。在tsm模块之前，添加时序attention模块，使得网络在全局信息的指导下进行时序位移。轻量化的时序attention模块，能够在基本不增加推理时间的前提下，进一步提升模型精度。
+
+| 策略 | Top-1 Acc(\%) |
+|:--:|:--:|
+| pptsmv2 不加时序attention | 74.38 |
+| pptsmv2 加时序attention | 75.16(+**0.78**) |
+
 <a name="3"></a>
 ## 3. 快速体验
 
@@ -172,4 +183,5 @@ PP-TSMv2沿用了部分PP-TSM的优化策略，从骨干网络与预训练模型
 <a name="4"></a>
 ## 4. 模型训练、压缩、推理部署
 
-更多教程，包括模型训练、模型压缩、推理部署等，请参考[PP-TSM文档教程](./pp-tsm.md)。
+
+更多教程，包括模型训练、模型压缩、推理部署等，请参考[使用文档](./pp-tsm.md)。
diff --git a/paddlevideo/modeling/backbones/pptsm_v2.py b/paddlevideo/modeling/backbones/pptsm_v2.py
@@ -53,7 +53,41 @@ def make_divisible(v, divisor=8, min_value=None):
     return new_v
 
 
+class GlobalAttention(nn.Layer):
+    """
+    Lightweight temporal attention module.
+    """
+
+    def __init__(self, num_seg=8):
+        super().__init__()
+        self.fc = nn.Linear(in_features=num_seg,
+                            out_features=num_seg,
+                            weight_attr=ParamAttr(learning_rate=5.0,
+                                                  regularizer=L2Decay(1e-4)),
+                            bias_attr=ParamAttr(learning_rate=10.0,
+                                                regularizer=L2Decay(0.0)))
+        self.num_seg = num_seg
+
+    def forward(self, x):
+        _, C, H, W = x.shape
+        x0 = x
+
+        x = x.reshape([-1, self.num_seg, C * H * W])
+        x = paddle.mean(x, axis=2)  # efficient way of avg_pool
+        x = x.squeeze(axis=-1)
+        x = self.fc(x)
+        attention = F.sigmoid(x)
+        attention = attention.reshape(
+            (-1, self.num_seg, 1, 1, 1))  #for broadcast
+
+        x0 = x0.reshape([-1, self.num_seg, C, H, W])
+        y = paddle.multiply(x0, attention)
+        y = y.reshape_([-1, C, H, W])
+        return y
+
+
 class ConvBNLayer(nn.Layer):
+
     def __init__(self,
                  in_channels,
                  out_channels,
@@ -87,6 +121,7 @@ def forward(self, x):
 
 
 class SEModule(nn.Layer):
+
     def __init__(self, channel, reduction=4):
         super().__init__()
         self.avg_pool = AdaptiveAvgPool2D(1)
@@ -115,6 +150,7 @@ def forward(self, x):
 
 
 class RepDepthwiseSeparable(nn.Layer):
+
     def __init__(self,
                  in_channels,
                  out_channels,
@@ -242,12 +278,14 @@ def _pad_tensor(self, tensor, to_size):
 
 
 class PPTSM_v2_LCNet(nn.Layer):
+
     def __init__(self,
                  scale,
                  depths,
                  class_num=400,
                  dropout_prob=0,
                  num_seg=8,
+                 use_temporal_att=False,
                  pretrained=None,
                  use_last_conv=True,
                  class_expand=1280):
@@ -256,6 +294,7 @@ def __init__(self,
         self.use_last_conv = use_last_conv
         self.class_expand = class_expand
         self.num_seg = num_seg
+        self.use_temporal_att = use_temporal_att
         self.pretrained = pretrained
 
         self.stem = nn.Sequential(*[
@@ -306,6 +345,8 @@ def __init__(self,
         in_features = self.class_expand if self.use_last_conv else NET_CONFIG[
             "stage4"][0] * 2 * scale
         self.fc = Linear(in_features, class_num)
+        if self.use_temporal_att:
+            self.global_attention = GlobalAttention(num_seg=self.num_seg)
 
     def init_weights(self):
         """Initiate the parameters.
@@ -325,6 +366,9 @@ def forward(self, x):
         for stage in self.stages:
             # only add temporal attention and tsm in stage3 for efficiency
             if count == 2:
+                # add temporal attention
+                if self.use_temporal_att:
+                    x = self.global_attention(x)
                 x = F.temporal_shift(x, self.num_seg, 1.0 / self.num_seg)
             count += 1
             x = stage(x)