Skip to content

Commit 48e7afe

Browse files
authored
Merge pull request #520 from huangjun12/pptsmv2-addatt-0830
add attention module for PP-TSMv2
2 parents 9b75aca + 05c34cc commit 48e7afe

File tree

5 files changed

+66
-9
lines changed

5 files changed

+66
-9
lines changed

configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform_dml_distillation.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ MODEL: #MODEL field
2222
name: "PPTSM_v2" #Mandatory, The name of backbone.
2323
pretrained: "data/PPLCNetV2_base_ssld_pretrained.pdparams" #Optional, pretrained model path.
2424
num_seg: 16
25+
use_temporal_att: True
2526
class_num: 400
2627
head:
2728
name: "MoViNetHead" #Mandatory, indicate the type of head, associate to the 'paddlevideo/modeling/heads'

docs/zh-CN/benchmark.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ mean fps: 25
7373
| TSM | R50 | [tsm_k400_frames.yaml](../../configs/recognition/tsm/tsm_k400_frames.yaml) | 71.06 | 52.02 | 9.87 | 61.89 |
7474
|**PP-TSM** | R50 | [pptsm_k400_frames_uniform.yaml](../../configs/recognition/pptsm/pptsm_k400_frames_uniform.yaml) | **75.11** | 51.84 | 11.26 | **63.1** |
7575
|PP-TSM | R101 | [pptsm_k400_frames_dense_r101.yaml](../../configs/recognition/pptsm/pptsm_k400_frames_dense_r101.yaml) | 76.35| 52.1 | 17.91 | 70.01 |
76-
| PP-TSMv2 | PP-LCNet_v2.16f | [pptsm_lcnet_k400_16frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform.yaml) | 74.38 | 69.4 | 7.26 | 76.66 |
76+
| PP-TSMv2 | PP-LCNet_v2.16f | [pptsm_lcnet_k400_16frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform.yaml) | 74.38 | 69.4 | 7.55 | 76.95 |
7777
| SlowFast | 4*16 | [slowfast.yaml](../../configs/recognition/slowfast/slowfast.yaml) | 74.35 | 99.27 | 27.4 | 126.67 |
7878
| *VideoSwin | B | [videoswin_k400_videos.yaml](../../configs/recognition/videoswin/videoswin_k400_videos.yaml) | 82.4 | 95.65 | 117.22 | 212.88 |
7979
| MoViNet | A0 | [movinet_k400_frame.yaml](../../configs/recognition/movinet/movinet_k400_frame.yaml) | 66.62 | 150.36 | 47.24 | 197.60 |
@@ -95,7 +95,7 @@ mean fps: 25
9595
| PP-TSM | MobileNetV2 | [pptsm_mv2_k400_videos_uniform.yaml](../../configs/recognition/pptsm/pptsm_mv2_k400_videos_uniform.yaml) | 68.09 | 52.62 | 137.03 | 189.65 |
9696
| PP-TSM | MobileNetV3 | [pptsm_mv3_k400_frames_uniform.yaml](../../configs/recognition/pptsm/pptsm_mv3_k400_frames_uniform.yaml) | 69.84| 53.44 | 139.13 | 192.58 |
9797
| **PP-TSMv2** | PP-LCNet_v2.8f | [pptsm_lcnet_k400_8frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_8frames_uniform.yaml) | **72.45**| 53.37 | 189.62 | **242.99** |
98-
| **PP-TSMv2** | PP-LCNet_v2.16f | [pptsm_lcnet_k400_16frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform.yaml) | **74.38**| 68.07 | 365.23 | **433.31** |
98+
| **PP-TSMv2** | PP-LCNet_v2.16f | [pptsm_lcnet_k400_16frames_uniform.yaml](../../configs/recognition/pptsm/v2/pptsm_lcnet_k400_16frames_uniform.yaml) | **74.38**| 68.07 | 388.64 | **456.71** |
9999
| SlowFast | 4*16 | [slowfast.yaml](../../configs/recognition/slowfast/slowfast.yaml) | 74.35 | 110.04 | 1201.36 | 1311.41 |
100100
| TSM | R50 | [tsm_k400_frames.yaml](../../configs/recognition/tsm/tsm_k400_frames.yaml) | 71.06 | 52.47 | 1302.49 | 1354.96 |
101101
|PP-TSM | R50 | [pptsm_k400_frames_uniform.yaml](../../configs/recognition/pptsm/pptsm_k400_frames_uniform.yaml) | 75.11 | 52.26 | 1354.21 | 1406.48 |

docs/zh-CN/model_zoo/recognition/pp-tsm.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ PP-TSM基于ResNet-50骨干网络进行优化,从数据增强、网络结构
3737

3838
### PP-TSMv2
3939

40-
PP-TSMv2是轻量化的视频分类模型,基于CPU端模型[PP-LCNetV2](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/zh_CN/models/PP-LCNetV2.md)进行优化,从骨干网络与预训练模型选择、数据增强、网络结构调整(使用最优的tsm模块插入数量和位置、新增时序attention模块)、输入帧数优化、解码速度优化、dml蒸馏等6个方面进行模型调优,在中心采样评估方式下,精度达到74.38%,输入10s视频在CPU端的推理速度仅需433ms。更多细节参考[PP-TSMv2技术报告](./pp-tsm_v2.md)
40+
PP-TSMv2是轻量化的视频分类模型,基于CPU端模型[PP-LCNetV2](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/zh_CN/models/PP-LCNetV2.md)进行优化,从骨干网络与预训练模型选择、数据增强、tsm模块调优、输入帧数优化、解码速度优化、dml蒸馏、新增时序attention模块等7个方面进行模型调优,在中心采样评估方式下,精度达到75.16%,输入10s视频在CPU端的推理速度仅需456ms。更多细节参考[PP-TSMv2技术报告](./pp-tsm_v2.md)
4141

4242

4343
<a name="2"></a>
@@ -50,7 +50,7 @@ PP-TSMv2模型与主流模型之间CPU推理速度对比(按预测总时间排
5050
| PP-TSM | MobileNetV2 | 68.09 | 52.62 | 137.03 | 189.65 |
5151
| PP-TSM | MobileNetV3 | 69.84| 53.44 | 139.13 | 192.58 |
5252
| **PP-TSMv2** | PP-LCNet_v2.8f | **72.45**| 53.37 | 189.62 | **242.99** |
53-
| **PP-TSMv2** | PP-LCNet_v2.16f | **74.38**| 68.07 | 365.23 | **433.31** |
53+
| **PP-TSMv2** | PP-LCNet_v2.16f | **75.16**| 68.07 | 388.64 | **456.71** |
5454
| SlowFast | 4*16 |74.35 | 110.04 | 1201.36 | 1311.41 |
5555
| TSM | R50 | 71.06 | 52.47 | 1302.49 | 1354.96 |
5656
|PP-TSM | R50 | 75.11 | 52.26 | 1354.21 | 1406.48 |
@@ -271,7 +271,7 @@ PaddleVideo 提供了基于 Paddle2ONNX 来完成 inference 模型转换 ONNX
271271
| 模型名称 | 骨干网络 | 蒸馏方式 | 测试方式 | 采样帧数 | Top-1% | 训练模型 |
272272
| :------: | :----------: | :----: | :----: | :----: | :---- | :---- |
273273
| PP-TSMv2 | LCNet_v2 | DML | Uniform | 8 | 72.45 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_8f_dml.pdparams) \| [Student模型](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_8f_dml_student.pdparams) |
274-
| PP-TSMv2 | LCNet_v2 | DML | Uniform | 16 | 74.38 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_16f_dml.pdparams) \| [Student模型](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_16f_dml_student.pdparams) |
274+
| PP-TSMv2 | LCNet_v2 | DML | Uniform | 16 | 75.16 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_16f_dml.pdparams) \| [Student模型](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/PPTSMv2_k400_16f_dml_student.pdparams) |
275275
| PP-TSM | ResNet50 | KD | Uniform | 8 | 75.11 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.1/PPTSM/ppTSM_k400_uniform_distill.pdparams) |
276276
| PP-TSM | ResNet50 | KD | Dense | 8 | 76.16 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.1/PPTSM/ppTSM_k400_dense_distill.pdparams) |
277277
| PP-TSM | ResNet101 | KD | Uniform | 8 | 76.35 | [下载链接](https://videotag.bj.bcebos.com/PaddleVideo-release2.2/ppTSM_k400_uniform_distill_r101.pdparams) |

docs/zh-CN/model_zoo/recognition/pp-tsm_v2.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,11 @@
66
- [2. 模型细节](#2)
77
- [2.1 骨干网络与预训练模型选择](#21)
88
- [2.2 数据增强](#22)
9-
- [2.3 网络结构调整](#23)
9+
- [2.3 tsm模块调优](#23)
1010
- [2.4 输入帧数优化](#24)
1111
- [2.5 解码速度优化](#25)
1212
- [2.6 DML蒸馏](#26)
13+
- [2.7 新增时序attention模块](#27)
1314
- [3. 快速体验](#3)
1415
- [4. 模型训练、压缩、推理部署](#4)
1516

@@ -19,7 +20,7 @@
1920

2021
视频分类任务是指输入视频,输出标签类别。如果标签都是行为类别,则该任务也称为行为识别。随着AI在各个行业的应用普及,工业及体育场景下对轻量化行为识别模型的需求日益增多,为此我们提出了高效的轻量化行为识别模型PP-TSMv2。
2122

22-
PP-TSMv2沿用了部分PP-TSM的优化策略,从骨干网络与预训练模型选择、数据增强、网络结构调整、输入帧数优化、解码速度优化、dml蒸馏等6个方面进行模型调优,在中心采样评估方式下,精度达到74.38%,输入10s视频在CPU端的推理速度仅需433ms
23+
PP-TSMv2沿用了部分PP-TSM的优化策略,从骨干网络与预训练模型选择、数据增强、tsm模块调优、输入帧数优化、解码速度优化、dml蒸馏、新增时序attention模块等7个方面进行模型调优,在中心采样评估方式下,精度达到75.16%,输入10s视频在CPU端的推理速度仅需456ms
2324

2425

2526
<a name="2"></a>
@@ -54,7 +55,7 @@ PP-TSMv2沿用了部分PP-TSM的优化策略,从骨干网络与预训练模型
5455
| baseline + VideoMix | 69.36(+**0.3**) |
5556

5657
<a name="23"></a>
57-
### 2.3 网络结构调整
58+
### 2.3 tsm模块调优
5859

5960
在骨干网络的基础上,我们添加了时序位移模块提取时序信息。对于插入位置,TSM原论文中将temporal_shift模块插入残差结构之中,但PP-LCNetV2为了加快模型速度,去除了部分残差连接。PP-LCNetV2整体结构分为4个stage,我们实验探索了时序位移模块最佳插入位置。对于插入数量,temporal_shift模块会加大模型的运行时间,我们探索了其最优插入数量,实验结果如下表所示。
6061

@@ -164,6 +165,16 @@ PP-TSMv2沿用了部分PP-TSM的优化策略,从骨干网络与预训练模型
164165
| DML | PP-TSM_ResNet50 | 71.27%(**+2.20%**) |
165166

166167

168+
<a name="27"></a>
169+
### 2.7 新增时序attention模块
170+
171+
temporal shift模块通过把特征在时间通道上位移,获取时序信息。但这种位移方式仅让局部的特征进行交互,缺少对全局时序信息的建模能力。为此我们提出了轻量化的时序attention模块,通过全局池化组合可学习的fc层,得到全局尺度上的时序attention。在tsm模块之前,添加时序attention模块,使得网络在全局信息的指导下进行时序位移。轻量化的时序attention模块,能够在基本不增加推理时间的前提下,进一步提升模型精度。
172+
173+
| 策略 | Top-1 Acc(\%) |
174+
|:--:|:--:|
175+
| pptsmv2 不加时序attention | 74.38 |
176+
| pptsmv2 加时序attention | 75.16(+**0.78**) |
177+
167178
<a name="3"></a>
168179
## 3. 快速体验
169180

@@ -172,4 +183,5 @@ PP-TSMv2沿用了部分PP-TSM的优化策略,从骨干网络与预训练模型
172183
<a name="4"></a>
173184
## 4. 模型训练、压缩、推理部署
174185

175-
更多教程,包括模型训练、模型压缩、推理部署等,请参考[PP-TSM文档教程](./pp-tsm.md)
186+
187+
更多教程,包括模型训练、模型压缩、推理部署等,请参考[使用文档](./pp-tsm.md)

paddlevideo/modeling/backbones/pptsm_v2.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,41 @@ def make_divisible(v, divisor=8, min_value=None):
5353
return new_v
5454

5555

56+
class GlobalAttention(nn.Layer):
57+
"""
58+
Lightweight temporal attention module.
59+
"""
60+
61+
def __init__(self, num_seg=8):
62+
super().__init__()
63+
self.fc = nn.Linear(in_features=num_seg,
64+
out_features=num_seg,
65+
weight_attr=ParamAttr(learning_rate=5.0,
66+
regularizer=L2Decay(1e-4)),
67+
bias_attr=ParamAttr(learning_rate=10.0,
68+
regularizer=L2Decay(0.0)))
69+
self.num_seg = num_seg
70+
71+
def forward(self, x):
72+
_, C, H, W = x.shape
73+
x0 = x
74+
75+
x = x.reshape([-1, self.num_seg, C * H * W])
76+
x = paddle.mean(x, axis=2) # efficient way of avg_pool
77+
x = x.squeeze(axis=-1)
78+
x = self.fc(x)
79+
attention = F.sigmoid(x)
80+
attention = attention.reshape(
81+
(-1, self.num_seg, 1, 1, 1)) #for broadcast
82+
83+
x0 = x0.reshape([-1, self.num_seg, C, H, W])
84+
y = paddle.multiply(x0, attention)
85+
y = y.reshape_([-1, C, H, W])
86+
return y
87+
88+
5689
class ConvBNLayer(nn.Layer):
90+
5791
def __init__(self,
5892
in_channels,
5993
out_channels,
@@ -87,6 +121,7 @@ def forward(self, x):
87121

88122

89123
class SEModule(nn.Layer):
124+
90125
def __init__(self, channel, reduction=4):
91126
super().__init__()
92127
self.avg_pool = AdaptiveAvgPool2D(1)
@@ -115,6 +150,7 @@ def forward(self, x):
115150

116151

117152
class RepDepthwiseSeparable(nn.Layer):
153+
118154
def __init__(self,
119155
in_channels,
120156
out_channels,
@@ -242,12 +278,14 @@ def _pad_tensor(self, tensor, to_size):
242278

243279

244280
class PPTSM_v2_LCNet(nn.Layer):
281+
245282
def __init__(self,
246283
scale,
247284
depths,
248285
class_num=400,
249286
dropout_prob=0,
250287
num_seg=8,
288+
use_temporal_att=False,
251289
pretrained=None,
252290
use_last_conv=True,
253291
class_expand=1280):
@@ -256,6 +294,7 @@ def __init__(self,
256294
self.use_last_conv = use_last_conv
257295
self.class_expand = class_expand
258296
self.num_seg = num_seg
297+
self.use_temporal_att = use_temporal_att
259298
self.pretrained = pretrained
260299

261300
self.stem = nn.Sequential(*[
@@ -306,6 +345,8 @@ def __init__(self,
306345
in_features = self.class_expand if self.use_last_conv else NET_CONFIG[
307346
"stage4"][0] * 2 * scale
308347
self.fc = Linear(in_features, class_num)
348+
if self.use_temporal_att:
349+
self.global_attention = GlobalAttention(num_seg=self.num_seg)
309350

310351
def init_weights(self):
311352
"""Initiate the parameters.
@@ -325,6 +366,9 @@ def forward(self, x):
325366
for stage in self.stages:
326367
# only add temporal attention and tsm in stage3 for efficiency
327368
if count == 2:
369+
# add temporal attention
370+
if self.use_temporal_att:
371+
x = self.global_attention(x)
328372
x = F.temporal_shift(x, self.num_seg, 1.0 / self.num_seg)
329373
count += 1
330374
x = stage(x)

0 commit comments

Comments
 (0)