Skip to content

Commit 46b09dd

Browse files
committed
2 parents d23d1db + a54cd10 commit 46b09dd

File tree

154 files changed

+8714
-269
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

154 files changed

+8714
-269
lines changed

README_CN.md

+64-60
Large diffs are not rendered by default.

README_EN.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
<h2 align="center">News<img src="./doc/imgs/rec_new_icon.png" width="40"/></h2>
1010

11+
* [2022/5/18] Add 3 algorithms::[aitm](models/multitask/aitm),[sign](models/rank/sign),[dsin](models/rank/dsin)
1112
* [2022/3/21] Add a new [paper](./paper) directory , show our analysis of the top meeting papers of the recommendation system in 2021 years and the list of recommendation system papers in the industry for your reference.
1213
* [2022/3/10] Add 5 algorithms: [DCN_V2](models/rank/dcn_v2), [MHCN](models/recall/mhcn), [FLEN](models/rank/flen), [Dselect_K](models/multitask/dselect_k)[AutoFIS](models/rank/autofis)
1314
* [2022/1/12] Add AI Studio [Online running](https://aistudio.baidu.com/aistudio/projectdetail/3240640) function, you can easily and quickly online experience our model on AI studio platform.
@@ -159,13 +160,18 @@ python -u tools/static_trainer.py -m models/rank/dnn/config.yaml # Training wit
159160
| Rank | [FLEN](models/rank/flen/) | - ||| >=2.1.0 | [2019][FLEN: Leveraging Field for Scalable CTR Prediction]( https://arxiv.org/pdf/1911.04690.pdf) |
160161
| Rank | [DeepRec](models/rank/deeprec/) | - ||| >=2.1.0 | [2017][Training Deep AutoEncoders for Collaborative Filtering](https://arxiv.org/pdf/1708.01715v3.pdf) |
161162
| Rank | [AutoFIS](models/rank/autofis/) | - ||| >=2.1.0 | [KDD 2020][AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction](https://arxiv.org/pdf/2003.11235v3.pdf) |
162-
| Rank | [DCN_V2](models/rank/dcn_v2/) | - | ✓ | ✓ | >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)
163+
| Rank | [DCN_V2](models/rank/dcn_v2/) | - ||| >=2.1.0 | [WWW 2021][DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535v2.pdf)|
164+
| Rank | [DSIN](models/rank/dsin/) | - ||| >=2.1.0 | [IJCAI 2019][Deep Session Interest Network for Click-Through Rate Prediction](https://arxiv.org/pdf/1905.06482v1.pdf) |
165+
| Rank | [SIGN](models/rank/sign/)([doc](https://paddlerec.readthedocs.io/en/latest/models/rank/sign.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3869111) ||| >=2.1.0 | [AAAI 2021][Detecting Beneficial Feature Interactions for Recommender Systems](https://arxiv.org/pdf/2008.00404v6.pdf) |
166+
| Rank | [IPRec](models/rank/iprec/)([doc](https://paddl7erec.readthedocs.io/en/latest/models/rank/iprec.html)) | - ||| >=2.1.0 | [SIGIR 2021][Package Recommendation with Intra- and Inter-Package Attention Networks](http://nlp.csai.tsinghua.edu.cn/~xrb/publications/SIGIR-21_IPRec.pdf) |
167+
| Multi-Task | [AITM](models/rank/aitm/) | - ||| >=2.1.0 | [KDD 2021][Modeling the Sequential Dependence among Audience Multi-step Conversions with Multi-task Learning in Targeted Display Advertising](https://arxiv.org/pdf/2105.08489v2.pdf) |
163168
| Multi-Task | [PLE](models/multitask/ple/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/ple.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238938) ||| >=2.1.0 | [RecSys 2020][Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations](https://dl.acm.org/doi/abs/10.1145/3383313.3412236) |
164169
| Multi-Task | [ESMM](models/multitask/esmm/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/esmm.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238583) ||| >=2.1.0 | [SIGIR 2018][Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate](https://arxiv.org/abs/1804.07931) |
165170
| Multi-Task | [MMOE](models/multitask/mmoe/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/mmoe.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238934) ||| >=2.1.0 | [KDD 2018][Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://dl.acm.org/doi/abs/10.1145/3219819.3220007) |
166171
| Multi-Task | [ShareBottom](models/multitask/share_bottom/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/share_bottom.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238943) ||| >=2.1.0 | [1998][Multitask learning](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf) |
167172
| Multi-Task | [Maml](models/multitask/maml/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/maml.html)) | [Python CPU/GPU](https://aistudio.baidu.com/aistudio/projectdetail/3238412) | x | x | >=2.1.0 | [PMLR 2017][Model-agnostic meta-learning for fast adaptation of deep networks](https://arxiv.org/pdf/1703.03400.pdf) |
168173
| Multi-Task | [DSelect_K](models/multitask/dselect_k/)<br>([doc](https://paddlerec.readthedocs.io/en/latest/models/multitask/dselect_k.html)) | - | x | x | >=2.1.0 | [NeurIPS 2021][DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning](https://arxiv.org/pdf/2106.03760v3.pdf) |
174+
| Multi-Task | [ESCM2](models/multitask/escm2/) | - | x | x | >=2.1.0 | [SIGIR 2022][ESCM2: Entire Space Counterfactual Multi-Task Model for Post-Click Conversion Rate Estimation](https://arxiv.org/pdf/2204.05125.pdf) |
169175
| Re-Rank | [Listwise](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5/models/rerank/listwise/) | - || x | [1.8.5](https://github.com/PaddlePaddle/PaddleRec/tree/release/1.8.5) | [2019][Sequential Evaluation and Generation Framework for Combinatorial Recommender System](https://arxiv.org/pdf/1902.00245.pdf) |
170176

171177
<h2 align="center">Community</h2>

contributor.md

+1
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,6 @@
2020
| [FLEN](models/rank/flen/) | [LinJayan](https://github.com/LinJayan) | https://github.com/PaddlePaddle/PaddleRec/pull/685 | 论文复现赛第五期 |
2121
| [MHCN](models/recall/mhcn/) | [Andy1314Chen](https://github.com/Andy1314Chen) | https://github.com/PaddlePaddle/PaddleRec/pull/679 | 论文复现赛第五期 |
2222
| [DCN_V2](models/rank/dcn_v2/) | [LinJayan](https://github.com/LinJayan) | https://github.com/PaddlePaddle/PaddleRec/pull/677 | 论文复现赛第五期 |
23+
| [SIGN](models/rank/sign/) | [BamLubi](https://github.com/BamLubi) | https://github.com/PaddlePaddle/PaddleRec/pull/748 | 论文复现赛第六期 |
2324

2425
</div>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
mkdir raw_data
2+
cd raw_data
3+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/user_profile.csv.tar.gz
4+
tar -zxvf user_profile.csv.tar.gz
5+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/raw_sample.csv.tar.gz
6+
tar -zxvf raw_sample.csv.tar.gz
7+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/behavior_log.csv.tar.gz
8+
tar -zxvf behavior_log.csv.tar.gz
9+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/ad_feature.csv.tar.gz
10+
tar -zxvf ad_feature.csv.tar.gz
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Ali_Display_Ad_Click数据集
2+
[Ali_Display_Ad_Click](https://tianchi.aliyun.com/dataset/dataDetail?dataId=56)是阿里巴巴提供的一个淘宝展示广告点击率预估数据集
3+
4+
## 原始数据集介绍
5+
- 原始样本骨架raw_sample:淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志(2600万条记录),构成原始的样本骨架
6+
1. user:脱敏过的用户ID;
7+
2. adgroup_id:脱敏过的广告单元ID;
8+
3. time_stamp:时间戳;
9+
4. pid:资源位;
10+
5. nonclk:为1代表没有点击;为0代表点击;
11+
6. clk:为0代表没有点击;为1代表点击;
12+
13+
```
14+
user,time_stamp,adgroup_id,pid,nonclk,clk
15+
581738,1494137644,1,430548_1007,1,0
16+
```
17+
18+
- 广告基本信息表ad_feature:本数据集涵盖了raw_sample中全部广告的基本信息
19+
1. adgroup_id:脱敏过的广告ID;
20+
2. cate_id:脱敏过的商品类目ID;
21+
3. campaign_id:脱敏过的广告计划ID;
22+
4. customer: 脱敏过的广告主ID;
23+
5. brand:脱敏过的品牌ID;
24+
6. price: 宝贝的价格
25+
```
26+
adgroup_id,cate_id,campaign_id,customer,brand,price
27+
63133,6406,83237,1,95471,170.0
28+
```
29+
30+
- 用户基本信息表user_profile:本数据集涵盖了raw_sample中全部用户的基本信息
31+
1. userid:脱敏过的用户ID;
32+
2. cms_segid:微群ID;
33+
3. cms_group_id:cms_group_id;
34+
4. final_gender_code:性别 1:男,2:女;
35+
5. age_level:年龄层次; 1234
36+
6. pvalue_level:消费档次,1:低档,2:中档,3:高档;
37+
7. shopping_level:购物深度,1:浅层用户,2:中度用户,3:深度用户
38+
8. occupation:是否大学生 ,1:是,0:否
39+
9. new_user_class_level:城市层级
40+
```
41+
userid,cms_segid,cms_group_id,final_gender_code,age_level,pvalue_level,shopping_level,occupation,new_user_class_level
42+
234,0,5,2,5,,3,0,3
43+
```
44+
45+
- 用户的行为日志behavior_log:本数据集涵盖了raw_sample中全部用户22天内的购物行为
46+
1. user:脱敏过的用户ID;
47+
2. time_stamp:时间戳;
48+
3. btag:行为类型, 包括以下四种:(pv:浏览),(cart:加入购物车),(fav:喜欢),(buy:购买)
49+
4. cate:脱敏过的商品类目id;
50+
5. brand: 脱敏过的品牌id;
51+
```
52+
user,time_stamp,btag,cate,brand
53+
558157,1493741625,pv,6250,91286
54+
```
55+
56+
## 预处理数据集介绍
57+
对原始数据集中的四个文件,参考[原论文的数据预处理过程](https://github.com/shenweichen/DSIN/tree/master/code)对数据进行处理,形成满足DSIN论文条件且可以被reader直接读取的数据集。
58+
数据集共有八个pkl文件,训练集和测试集各自拥有四个,以训练集为例,这四个文件为train_feat_input.pkl、train_sess_input、train_sess_length和train_label.pkl。各自存储了按0.25的采样比进行采样后的user及item特征输入,用户会话特征输入、用户会话长度和标签数据。
+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
mkdir big_train
2+
mkdir big_test
3+
wget https://paddlerec.bj.bcebos.com/datasets/Ali_Display_Ad_Click/model_input.tar.gz
4+
tar -zxvf model_input.tar.gz
5+
mv model_input/test_feat_input.pkl big_test/
6+
mv model_input/test_label.pkl big_test/
7+
mv model_input/test_sess_input.pkl big_test/
8+
mv model_input/test_session_length.pkl big_test/
9+
mv model_input/train_feat_input.pkl big_train/
10+
mv model_input/train_label.pkl big_train/
11+
mv model_input/train_sess_input.pkl big_train/
12+
mv model_input/train_session_length.pkl big_train/

datasets/Avazu_flen/data_config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515

1616
runner:
17-
raw_file_dir: "path" # raw_data dir
17+
raw_file_dir: "raw_file/train" # raw_data dir
1818
raw_filled_file_dir: "./raw_data" # raw_data_filled dir
1919
train_data_dir: "./train_data_full" # train datasets
2020
test_data_dir: "./test_data_full" # test datasets

datasets/Avazu_flen/preprocess.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ def __init__(self, config):
5959
self.min_threshold = self.config.get("runner.min_threshold")
6060
self.feature_map_cache = self.config.get("runner.feature_map_cache")
6161

62-
# self.filled_raw()
62+
self.filled_raw()
6363

6464
self.init()
6565

datasets/Avazu_flen/readme.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22
#### 1.Get raw datasets:
33
you can go to:[https://www.kaggle.com/c/avazu-ctr-prediction/data](https://www.kaggle.com/c/avazu-ctr-prediction)
44

5-
将下载的原始数据目录配置在data_config.yaml中,执行命令获取全量数据
5+
将下载的数据解压后,只保留训练集即可,且命名为`train``
66

77
| 名称 | 说明 |
88
| -------- | -------- |
9-
| raw_file_dir | 原始数据集目录 |
9+
| raw_file | 原始数据集目录 |
1010
| raw_filled_file_dir | 原始数据缺失值处理后的目录 |
1111
| train_data_dir | 训练集存放目录 |
1212
| test_data_dir | 测试集存放目录 |
@@ -15,9 +15,9 @@ you can go to:[https://www.kaggle.com/c/avazu-ctr-prediction/data](https://www
1515
| feature_map_cache | 特征缓存数据 |
1616

1717

18-
18+
然后执行脚本
1919
```bash
20-
sh data_process.sh
20+
sh run.sh
2121
```
2222
#### 2.Get preprocessd datasets:
2323
you can also go to: [AiStudio数据集](https://aistudio.baidu.com/aistudio/datasetdetail/125200)

datasets/Avazu_flen/run.sh

+6
Original file line numberDiff line numberDiff line change
@@ -1 +1,7 @@
1+
mkdir train_data_full
2+
mkdir test_data_full
3+
mkdir raw_file
4+
mkdir raw_filled_file_dir
5+
mv train ./raw_file
6+
17
python preprocess.py -m data_config.yaml

datasets/ali-cpp_aitm/data_process.sh

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
mkdir data
2+
mkdir data/whole_data && mkdir data/whole_data/train && mkdir data/whole_data/test
3+
tar zxvf data/sample_train.tar.gz -C data
4+
tar zxvf data/sample_test.tar.gz -C data
5+
python process_public_data.py
6+
mv data/ctr_cvr.train data/whole_data/train
7+
mv data/ctr_cvr.test data/whole_data/test

0 commit comments

Comments
 (0)