Skip to content

Commit 77558c6

Browse files
committed
commit fat
1 parent 2f30632 commit 77558c6

File tree

15 files changed

+777
-0
lines changed

15 files changed

+777
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # 静态图训
132132
| 排序 | [Dnn](models/rank/dnn/)([文档](https://paddlerec.readthedocs.io/en/latest/models/rank/dnn.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3240347) ||||| >=2.1.0 | / |
133133
| 排序 | [FM](models/rank/fm/)([文档](https://paddlerec.readthedocs.io/en/latest/models/rank/fm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3240371) |||| x | >=2.1.0 | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) |
134134
| 排序 | [BERT4REC](models/rank/bert4rec/) | - |||| x | >=2.1.0 | [CIKM 2019][BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer](https://arxiv.org/pdf/1904.06690.pdf) |
135+
| 排序 | [FAT_DeepFFM](models/rank/fat_deepffm/) | - |||| x | >=2.1.0 | [2019][FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine](https://arxiv.org/pdf/1905.06336.pdf) |
135136
| 排序 | [FFM](models/rank/ffm/)([文档](https://paddlerec.readthedocs.io/en/latest/models/rank/ffm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3240369) |||| x | >=2.1.0 | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) |
136137
| 排序 | [FNN](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rank/fnn/) | - |||| x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) |
137138
| 排序 | [Deep Crossing](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rank/deep_crossing/) | - |||| x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) |

README_EN.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # Training wit
119119
| Rank | [Dnn](models/rank/dnn/)([doc](https://paddlerec.readthedocs.io/en/latest/models/rank/dnn.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3240347) ||||| >=2.1.0 | / |
120120
| Rank | [FM](models/rank/fm/)([doc](https://paddlerec.readthedocs.io/en/latest/models/rank/fm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3240371) |||| x | >=2.1.0 | [IEEE Data Mining 2010][Factorization machines](https://analyticsconsultores.com.mx/wp-content/uploads/2019/03/Factorization-Machines-Steffen-Rendle-Osaka-University-2010.pdf) |
121121
| Rank | [BERT4REC](models/rank/bert4rec/) | - |||| x | >=2.1.0 | [CIKM 2019][BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer](https://arxiv.org/pdf/1904.06690.pdf) |
122+
| Rank | [FAT_DeepFFM](models/rank/fat_deepffm/) | - |||| x | >=2.1.0 | [2019][FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine](https://arxiv.org/pdf/1905.06336.pdf) |
122123
| Rank | [FFM](models/rank/ffm/)([doc](https://paddlerec.readthedocs.io/en/latest/models/rank/ffm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3240369) |||| x | >=2.1.0 | [RECSYS 2016][Field-aware Factorization Machines for CTR Prediction](https://dl.acm.org/doi/pdf/10.1145/2959100.2959134) |
123124
| Rank | [FNN](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rank/fnn/) | - |||| x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [ECIR 2016][Deep Learning over Multi-field Categorical Data](https://arxiv.org/pdf/1601.02376.pdf) |
124125
| Rank | [Deep Crossing](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rank/deep_crossing/) | - |||| x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [ACM 2016][Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features](https://www.kdd.org/kdd2016/papers/files/adf0975-shanA.pdf) |

models/rank/fat_deepffm/README.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# 基于 FAT_DeepFFM 模型的点击率预估模型
2+
3+
以下是本例的简要目录结构及说明:
4+
5+
```
6+
├── data # 样例数据
7+
├── sample_data # 样例数据
8+
├── train
9+
├── sample_train.txt # 训练数据样例
10+
├── __init__.py
11+
├── README.md # 文档
12+
├── config.yaml # sample数据配置
13+
├── config_bigdata.yaml # 全量数据配置
14+
├── net.py # 模型核心组网(动静统一)
15+
├── criteo_reader.py # 数据读取程序
16+
├── dygraph_model.py # 构建动态图
17+
```
18+
19+
注:在阅读该示例前,建议您先了解以下内容:
20+
21+
[PaddleRec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
22+
23+
## 内容
24+
25+
- [模型简介](#模型简介)
26+
- [数据准备](#数据准备)
27+
- [运行环境](#运行环境)
28+
- [快速开始](#快速开始)
29+
- [模型组网](#模型组网)
30+
- [效果复现](#效果复现)
31+
- [进阶使用](#进阶使用)
32+
- [FAQ](#FAQ)
33+
34+
## 模型简介
35+
`CTR(Click Through Rate)`,即点击率,是“推荐系统/计算广告”等领域的重要指标,对其进行预估是商品推送/广告投放等决策的基础。简单来说,CTR预估对每次广告的点击情况做出预测,预测用户是点击还是不点击。CTR预估模型综合考虑各种因素、特征,在大量历史数据上训练,最终对商业决策提供帮助。本模型实现了下述论文中的 FAT_DeepFFM 模型:
36+
37+
```text
38+
@article{FAT-DeepFFM2019,
39+
title={FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine},
40+
author={Junlin Zhang, Tongwen Huang, Zhiqi Zhang},
41+
journal={arXiv preprint arXiv:1905.06336},
42+
year={2019},
43+
url={https://arxiv.org/pdf/1905.06336},
44+
}
45+
```
46+
47+
## 数据准备
48+
49+
训练及测试数据集选用[Display Advertising Challenge](https://www.kaggle.com/c/criteo-display-ad-challenge/)所用的Criteo数据集。该数据集包括两部分:训练集和测试集。训练集包含一段时间内Criteo的部分流量,测试集则对应训练数据后一天的广告点击流量。
50+
每一行数据格式如下所示:
51+
```
52+
<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
53+
```
54+
其中```<label>```表示广告是否被点击,点击用1表示,未点击用0表示。```<integer feature>```代表数值特征(连续特征),共有13个连续特征。```<categorical feature>```代表分类特征(离散特征),共有26个离散特征。相邻两个特征用```\t```分隔,缺失特征用空格表示。测试集中```<label>```特征已被移除。
55+
在模型目录的data目录下为您准备了快速运行的示例数据,若需要使用全量数据可以参考下方[效果复现](#效果复现)部分。
56+
57+
## 运行环境
58+
PaddlePaddle>=2.0
59+
60+
python 2.7/3.5/3.6/3.7
61+
62+
os : windows/linux/macos
63+
64+
## 快速开始
65+
本文提供了样例数据可以供您快速体验,在任意目录下均可执行。在fat_deepffm模型目录的快速执行命令如下:
66+
```bash
67+
# 进入模型目录
68+
# cd models/rank/fat_deepffm # 在任意目录均可运行
69+
# 动态图训练
70+
python -u ../../../tools/trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml
71+
# 动态图预测
72+
python -u ../../../tools/infer.py -m config.yaml
73+
74+
# 静态图训练
75+
python -u ../../../tools/static_trainer.py -m config.yaml # 全量数据运行config_bigdata.yaml
76+
# 静态图预测
77+
python -u ../../../tools/static_infer.py -m config.yaml
78+
```
79+
80+
## 模型组网
81+
82+
FAT_DeepFFM 模型的组网,代码参考 `net.py`。模型主要组成是 Embedding 层,CENet 层,DeepFFM特征交叉层,DNN层以及相应的分类任务的loss计算和auc计算。模型架构如下:
83+
84+
<img align="center" src="picture/11.jpg" width="400" height="300">
85+
86+
87+
### **CENet 层**
88+
89+
FAT_DeepFFM 模型的特征输入,主要包括 sparse 类别特征。(在处理 dense 数值型特征时,进行升维与sparse 类别特征拼接)
90+
sparse features 经由 embedding 层查找得到相应的 embedding 向量。使用CENet显示地建模特征之间的依赖关系。CENet网络结构如下图所示:
91+
92+
<img align="center" src="picture/2.jpg" width="400" height="300">
93+
94+
根据网络结构图,通过CENet的注意力机制有选择性地突出信息特征并抑制不太有用的特征,公式如下所示:
95+
96+
<img align="center" src="picture/3.jpg" width="400" height="60">
97+
98+
99+
### **DeepFFM层**
100+
DeepFFM网络结构如下图所示:
101+
102+
<img align="center" src="picture/4.jpg" width="400" height="300">
103+
104+
使用FFM对特征的不同field的关系进行建模,计算公式如下所示:
105+
106+
<img align="center" src="picture/55.jpg" width="500" height="100">
107+
108+
109+
110+
### **Loss 及 Auc 计算**
111+
- 为了得到每条样本分属于正负样本的概率,我们将预测结果和 `1-predict` 合并起来得到 `predict_2d`,以便接下来计算 `auc`
112+
- 每条样本的损失为负对数损失值,label的数据类型将转化为float输入。
113+
- 该batch的损失 `avg_cost` 是各条样本的损失之和
114+
- 我们同时还会计算预测的auc指标。
115+
116+
## 效果复现
117+
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现 README 中的效果,请按如下步骤依次操作即可。
118+
在全量数据下模型的指标如下:
119+
120+
| 模型 | auc | batch_size | epoch_num| Time of each epoch |
121+
| :------| :------ | :------ | :------| :------ |
122+
| FAT_DeepFFM | 0.8037 | 1000 | 1 | 约 3.5 小时 |
123+
124+
1. 确认您当前所在目录为 `PaddleRec/models/rank/fat_deepffm`
125+
2. 进入 `PaddleRec/datasets/criteo` 目录下,执行该脚本,会从国内源的服务器上下载我们预处理完成的criteo全量数据集,并解压到指定文件夹。
126+
``` bash
127+
cd ../../../datasets/criteo
128+
sh run.sh
129+
```
130+
3. 切回模型目录,执行命令运行全量数据
131+
```bash
132+
cd - # 切回模型目录
133+
# 动态图训练
134+
python -u ../../../tools/trainer.py -m config_bigdata.yaml # 全量数据运行config_bigdata.yaml
135+
python -u ../../../tools/infer.py -m config_bigdata.yaml # 全量数据运行config_bigdata.yaml
136+
```
137+
138+
## 进阶使用
139+
140+
## FAQ

models/rank/fat_deepffm/__init__.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.

models/rank/fat_deepffm/config.yaml

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# workspace
16+
#workspace: "models/rank/fat_deepffm"
17+
18+
19+
runner:
20+
train_data_dir: "data/sample_data/train"
21+
train_reader_path: "criteo_reader" # importlib format
22+
use_gpu: False
23+
use_auc: True
24+
train_batch_size: 1
25+
epochs: 1
26+
print_interval: 100
27+
28+
model_save_path: "output_model_fat_deepffm"
29+
infer_batch_size: 1000
30+
infer_reader_path: "criteo_reader" # importlib format
31+
test_data_dir: "data/sample_data/train"
32+
33+
infer_load_path: "output_model_fat_deepffm"
34+
infer_start_epoch: 0
35+
infer_end_epoch: 1
36+
37+
# distribute_config
38+
sync_mode: "async"
39+
split_file_list: False
40+
thread_num: 1 # 1
41+
42+
43+
# hyper parameters of user-defined network
44+
hyper_parameters:
45+
# optimizer config
46+
optimizer:
47+
class: Adam
48+
learning_rate: 0.0001
49+
strategy: async
50+
# user-defined <key, value> pairs
51+
sparse_inputs_slots: 27
52+
sparse_feature_number: 1000001
53+
sparse_feature_dim: 10
54+
dense_input_dim: 13
55+
distributed_embedding: 0
56+
layer_sizes_dnn: [1600,1600]
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# workspace
16+
#workspace: "models/rank/fat_deepffm"
17+
18+
19+
runner:
20+
train_data_dir: "../../../datasets/criteo/slot_train_data_full" # criteo_sample_train slot_train_data_full
21+
train_reader_path: "criteo_reader" # importlib format
22+
use_gpu: True
23+
use_auc: True
24+
train_batch_size: 1000
25+
epochs: 1
26+
print_interval: 100
27+
28+
model_save_path: "output_model_all_fat_deepffm"
29+
infer_batch_size: 1000
30+
infer_reader_path: "criteo_reader" # importlib format
31+
test_data_dir: "../../../datasets/criteo/slot_test_data_full"
32+
33+
infer_load_path: "output_model_fat_deepffm"
34+
infer_start_epoch: 0
35+
infer_end_epoch: 1
36+
37+
# distribute_config
38+
sync_mode: "async"
39+
split_file_list: False
40+
thread_num: 1 # 1
41+
42+
43+
# hyper parameters of user-defined network
44+
hyper_parameters:
45+
# optimizer config
46+
optimizer:
47+
class: Adam
48+
learning_rate: 0.0001
49+
strategy: async
50+
# user-defined <key, value> pairs
51+
sparse_inputs_slots: 27
52+
sparse_feature_number: 1000001
53+
sparse_feature_dim: 10
54+
dense_input_dim: 13
55+
distributed_embedding: 0
56+
layer_sizes_dnn: [1600,1600]
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from __future__ import print_function
16+
import numpy as np
17+
18+
from paddle.io import IterableDataset
19+
20+
21+
class RecDataset(IterableDataset):
22+
def __init__(self, file_list, config):
23+
super(RecDataset, self).__init__()
24+
self.file_list = file_list
25+
self.init()
26+
27+
def init(self):
28+
from operator import mul
29+
padding = 0
30+
sparse_slots = "click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
31+
self.sparse_slots = sparse_slots.strip().split(" ")
32+
self.dense_slots = ["dense_feature"]
33+
self.dense_slots_shape = [13]
34+
self.slots = self.sparse_slots + self.dense_slots
35+
self.slot2index = {}
36+
self.visit = {}
37+
for i in range(len(self.slots)):
38+
self.slot2index[self.slots[i]] = i
39+
self.visit[self.slots[i]] = False
40+
self.padding = padding
41+
42+
def __iter__(self):
43+
full_lines = []
44+
self.data = []
45+
for file in self.file_list:
46+
with open(file, "r") as rf:
47+
for l in rf:
48+
line = l.strip().split(" ")
49+
output = [(i, []) for i in self.slots]
50+
for i in line:
51+
slot_feasign = i.split(":")
52+
slot = slot_feasign[0]
53+
if slot not in self.slots:
54+
continue
55+
if slot in self.sparse_slots:
56+
feasign = int(slot_feasign[1])
57+
else:
58+
feasign = float(slot_feasign[1])
59+
output[self.slot2index[slot]][1].append(feasign)
60+
self.visit[slot] = True
61+
for i in self.visit:
62+
slot = i
63+
if not self.visit[slot]:
64+
if i in self.dense_slots:
65+
output[self.slot2index[i]][1].extend(
66+
[self.padding] *
67+
self.dense_slots_shape[self.slot2index[i]])
68+
else:
69+
output[self.slot2index[i]][1].extend(
70+
[self.padding])
71+
else:
72+
self.visit[slot] = False
73+
# sparse
74+
output_list = []
75+
for key, value in output[:-1]:
76+
output_list.append(np.array(value).astype('int64'))
77+
# dense
78+
output_list.append(
79+
np.array(output[-1][1]).astype("float32"))
80+
# list
81+
yield output_list

0 commit comments

Comments
 (0)