Skip to content

Commit 1bdde2b

Browse files
MrChengmoyinhaofengfuyinno4seiriosPlus
authored
Six PR for new paddlerec version (#240)
* logistic_regression * adaptation windows Co-authored-by: yinhaofeng <1841837261@qq.com> Co-authored-by: yinhaofeng <66763551+yinhaofeng@users.noreply.github.com> Co-authored-by: wuzhihua <35824027+fuyinno4@users.noreply.github.com> Co-authored-by: tangwei12 <tangwei12@baidu.com>
1 parent 3b9c100 commit 1bdde2b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+1304
-155
lines changed

models/match/dssm/data/preprocess.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,8 @@
6363
#划分训练集和测试集
6464
query_list = list(pos_dict.keys())
6565
#print(len(query_list))
66-
#random.shuffle(query_list)
66+
np.random.seed(107)
67+
np.random.shuffle(query_list)
6768
train_query = query_list[:11600]
6869
test_query = query_list[11600:]
6970

models/match/dssm/readme.md

+11-5
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,12 @@ rm -f dssm%2Fbq.tar.gz
6060
已经在银行换了新预留号码。 我现在换了电话号码,这个需要更换吗 1
6161
每个字段以tab键分隔,第1,2列表示两个文本。第3列表示类别(0或1,0表示两个文本不相似,1表示两个文本相似)。
6262
```
63+
在本例中需要调用jieba库和sklearn库,如环境中没有提前安装,可以使用以下命令安装。
64+
```
65+
pip install sklearn
66+
pip install jieba
67+
```
68+
6369
## 运行环境
6470
PaddlePaddle>=1.7.2
6571

@@ -153,11 +159,11 @@ label.txt中对应的测试集中的标签
153159
4. 退回dssm目录中,打开文件config.yaml,更改其中的参数
154160

155161
将workspace改为您当前的绝对路径。(可用pwd命令获取绝对路径)
156-
将dataset_train中的batch_size从8改为128
157-
将hyper_parameters中的slice_end从8改为128.当您需要改变batchsize的时候,这个参数也需要随之变化
158-
将dataset_train中的data_path改为{workspace}/data/big_train
159-
将dataset_infer中的data_path改为{workspace}/data/big_test
160-
将hyper_parameters中的trigram_d改为5913
162+
将dataset_train中的batch_size从8改为128
163+
将hyper_parameters中的slice_end从8改为128.当您需要改变batchsize的时候,这个参数也需要随之变化
164+
将dataset_train中的data_path改为{workspace}/data/big_train
165+
将dataset_infer中的data_path改为{workspace}/data/big_test
166+
将hyper_parameters中的trigram_d改为5913
161167

162168
5. 执行脚本,开始训练.脚本会运行python -m paddlerec.run -m ./config.yaml启动训练,并将结果输出到result文件中。然后启动transform.py整合数据,最后计算出正逆序指标:
163169
```

models/match/dssm/run.sh

+1-5
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,7 @@
1313
# limitations under the License.
1414
#!/bin/bash
1515
echo "................run................."
16-
python -m paddlerec.run -m ./config.yaml &> result1.txt
17-
grep -i "query_doc_sim" ./result1.txt >./result2.txt
18-
sed '$d' result2.txt >result.txt
19-
rm -f result1.txt
20-
rm -f result2.txt
16+
python -m paddlerec.run -m ./config.yaml &> result.txt
2117
python transform.py
2218
sort -t $'\t' -k1,1 -k 2nr,2 pair.txt >result.txt
2319
rm -f pair.txt

models/match/dssm/transform.py

+14
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,20 @@
1616
import numpy as np
1717
import sklearn.metrics
1818

19+
filename = './result.txt'
20+
f = open(filename, "r")
21+
lines = f.readlines()
22+
f.close()
23+
result = []
24+
for line in lines:
25+
if "query_doc_sim" in str(line):
26+
result.append(line)
27+
result = result[:-1]
28+
f = open(filename, "w")
29+
for i in range(len(result)):
30+
f.write(str(result[i]))
31+
f.close()
32+
1933
label = []
2034
filename = './data/label.txt'
2135
f = open(filename, "r")

models/match/match-pyramid/eval.py

+15-1
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,20 @@ def eval_MAP(pred, gt):
3232
return map_value / r
3333

3434

35+
filename = './result.txt'
36+
f = open(filename, "r")
37+
lines = f.readlines()
38+
f.close()
39+
result = []
40+
for line in lines:
41+
if "prediction" in str(line):
42+
result.append(line)
43+
result = result[:-1]
44+
f = open(filename, "w")
45+
for i in range(len(result)):
46+
f.write(str(result[i]))
47+
f.close()
48+
3549
filename = './data/relation.test.fold1.txt'
3650
gt = []
3751
qid = []
@@ -56,7 +70,7 @@ def eval_MAP(pred, gt):
5670
pred.append(float(line))
5771

5872
result_dict = {}
59-
for i in range(len(qid)):
73+
for i in range(len(pred)):
6074
if qid[i] not in result_dict:
6175
result_dict[qid[i]] = []
6276
result_dict[qid[i]].append([gt[i], pred[i]])

models/match/match-pyramid/readme.md

+6
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,12 @@
5555
3.关系文件:关系文件被用来存储两个句子之间的关系,如query 和document之间的关系。例如:relation.train.fold1.txt, relation.test.fold1.txt
5656
4.嵌入层文件:我们将预训练的词向量存储在嵌入文件中。例如:embed_wiki-pdc_d50_norm
5757

58+
在本例中需要调用jieba库和sklearn库,如环境中没有提前安装,可以使用以下命令安装。
59+
```
60+
pip install sklearn
61+
pip install jieba
62+
```
63+
5864
## 运行环境
5965
PaddlePaddle>=1.7.2
6066
python 2.7/3.5/3.6/3.7

models/match/match-pyramid/run.sh

+1-5
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
11
#!/bin/bash
22
echo "................run................."
3-
python -m paddlerec.run -m ./config.yaml &>result1.txt
4-
grep -i "prediction" ./result1.txt >./result2.txt
5-
sed '$d' result2.txt >result.txt
6-
rm -f result2.txt
7-
rm -f result1.txt
3+
python -m paddlerec.run -m ./config.yaml &>result.txt
84
python eval.py

models/match/multiview-simnet/readme.md

+5
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,11 @@ rm -f dssm%2Fbq.tar.gz
6161
0:358 0:206 0:205 0:250 0:9 0:3 0:207 0:10 0:330 0:164 1:1144 1:217 1:206 1:9 1:3 1:207 1:10 1:398 1:2 2:217 2:206 2:9 2:3 2:207 2:10 2:398 2:2
6262
0:358 0:206 0:205 0:250 0:9 0:3 0:207 0:10 0:330 0:164 1:951 1:952 1:206 1:9 1:3 1:207 1:10 1:398 2:217 2:206 2:9 2:3 2:207 2:10 2:398 2:2
6363
```
64+
在本例中需要调用jieba库和sklearn库,如环境中没有提前安装,可以使用以下命令安装。
65+
```
66+
pip install sklearn
67+
pip install jieba
68+
```
6469

6570
## 运行环境
6671
PaddlePaddle>=1.7.2

models/match/multiview-simnet/run.sh

+2-6
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,8 @@
1414

1515
#!/bin/bash
1616
echo "................run................."
17-
python -m paddlerec.run -m ./config.yaml &>result1.txt
18-
grep -i "query_pt_sim" ./result1.txt >./result2.txt
19-
sed '$d' result2.txt >result.txt
20-
rm -f result1.txt
21-
rm -f result2.txt
17+
python -m paddlerec.run -m ./config.yaml &>result.txt
2218
python transform.py
23-
sort -t $'\t' -k1,1 -k 2nr,2 pair.txt >result.txt
19+
sort -t $'\t' -k1,1 -k 2nr,2 pair.txt &>result.txt
2420
rm -f pair.txt
2521
python ../../../tools/cal_pos_neg.py result.txt

models/match/multiview-simnet/transform.py

+16-2
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,20 @@
1515
import random
1616
import numpy as np
1717

18+
filename = './result.txt'
19+
f = open(filename, "r")
20+
lines = f.readlines()
21+
f.close()
22+
result = []
23+
for line in lines:
24+
if "query_pt_sim" in str(line):
25+
result.append(line)
26+
result = result[:-1]
27+
f = open(filename, "w")
28+
for i in range(len(result)):
29+
f.write(str(result[i]))
30+
f.close()
31+
1832
label = []
1933
filename = './data/label.txt'
2034
f = open(filename, "r")
@@ -31,7 +45,7 @@
3145
sim = []
3246
for line in open(filename):
3347
line = line.strip().split(",")
34-
print(line)
48+
#print(line)
3549
line[3] = line[3].split(":")
3650
line = line[3][1].strip(" ")
3751
line = line.strip("[")
@@ -50,6 +64,6 @@
5064
filename = 'pair.txt'
5165
f = open(filename, "w")
5266
for i in range(len(sim)):
53-
print(i)
67+
#print(i)
5468
f.write(str(query[i]) + "\t" + str(sim[i]) + "\t" + str(label[i]) + "\n")
5569
f.close()

models/match/readme.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,6 @@ python -m paddlerec.run -m models/contentunderstanding/match-pyramid/config.yaml
5353

5454
| 数据集 | 模型 | 正逆序比 | map |
5555
| :------------------: | :--------------------: | :---------: |:---------: |
56-
| zhidao | DSSM | 2.25 | -- |
56+
| zhidao | DSSM | 2.75 | -- |
5757
| Letor07 | match-pyramid | -- | 0.42 |
58-
| zhidao | multiview-simnet | 1.72 | -- |
58+
| zhidao | multiview-simnet | 13.67 | -- |

models/rank/deepfm/config.yaml

+15-12
Original file line numberDiff line numberDiff line change
@@ -19,22 +19,22 @@ workspace: "models/rank/deepfm"
1919

2020
dataset:
2121
- name: train_sample
22-
type: QueueDataset
22+
type: DataLoader
2323
batch_size: 5
2424
data_path: "{workspace}/data/sample_data/train"
2525
sparse_slots: "label feat_idx"
2626
dense_slots: "feat_value:39"
2727
- name: infer_sample
28-
type: QueueDataset
28+
type: DataLoader
2929
batch_size: 5
3030
data_path: "{workspace}/data/sample_data/train"
3131
sparse_slots: "label feat_idx"
3232
dense_slots: "feat_value:39"
3333

3434
hyper_parameters:
3535
optimizer:
36-
class: SGD
37-
learning_rate: 0.0001
36+
class: Adam
37+
learning_rate: 0.001
3838
sparse_feature_number: 1086460
3939
sparse_feature_dim: 9
4040
num_field: 39
@@ -43,7 +43,7 @@ hyper_parameters:
4343
act: "relu"
4444

4545

46-
mode: train_runner
46+
mode: [train_runner,infer_runner]
4747
# if infer, change mode to "infer_runner" and change phase to "infer_phase"
4848

4949
runner:
@@ -57,19 +57,22 @@ runner:
5757
save_checkpoint_path: "increment"
5858
save_inference_path: "inference"
5959
print_interval: 1
60+
phases: phase1
6061
- name: infer_runner
6162
class: infer
6263
device: cpu
63-
init_model_path: "increment/0"
64+
init_model_path: "increment/1"
6465
print_interval: 1
65-
66+
phases: infer_phase
67+
68+
6669

6770
phase:
6871
- name: phase1
6972
model: "{workspace}/model.py"
7073
dataset_name: train_sample
71-
thread_num: 1
72-
#- name: infer_phase
73-
# model: "{workspace}/model.py"
74-
# dataset_name: infer_sample
75-
# thread_num: 1
74+
thread_num: 10
75+
- name: infer_phase
76+
model: "{workspace}/model.py"
77+
dataset_name: infer_sample
78+
thread_num: 10

models/rank/deepfm/data/download_preprocess.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828

2929
print("download and extract starting...")
3030
download_file_and_uncompress(url)
31-
download_file(url2, "./sample_data/feat_dict_10.pkl2", True)
31+
download_file(url2, "./deepfm%2Ffeat_dict_10.pkl2", True)
3232
print("download and extract finished")
3333

3434
print("preprocessing...")

models/rank/deepfm/data/get_slot_data.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -79,8 +79,7 @@ def data_iter():
7979
v = i[1]
8080
for j in v:
8181
s += " " + k + ":" + str(j)
82-
print(s.strip())
83-
yield None
82+
print(s.strip()) # add print for data preprocessing
8483

8584
return data_iter
8685

models/rank/deepfm/data/run.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
python download_preprocess.py
2-
2+
mv ./deepfm%2Ffeat_dict_10.pkl2 sample_data/feat_dict_10.pkl2
33
mkdir slot_train_data
44
for i in `ls ./train_data`
55
do

models/rank/deepfm/picture/1.jpg

6.64 KB
Loading

models/rank/deepfm/picture/2.jpg

4.12 KB
Loading

models/rank/deepfm/picture/3.jpg

11.3 KB
Loading

models/rank/deepfm/picture/4.jpg

25.8 KB
Loading

0 commit comments

Comments
 (0)