Skip to content

Commit d2e419a

Browse files
committed
Merge branch 'develop' into link
2 parents deed483 + 8e25fbb commit d2e419a

File tree

2 files changed

+183
-28
lines changed

2 files changed

+183
-28
lines changed
+155
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# 运行分布式训练
2+
3+
在本文中,我们将阐释如何在集群上运行分布式 Paddle 训练作业。我们将以[推荐系统](https://github.com/baidu/Paddle/tree/develop/demo/recommendation)为例创建分布式的单进程训练。
4+
5+
在本文中使用的[脚本](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train)通过 SSH 运行分布式作业。 它们还可以供那些运行更复杂的集群管理系统(如 MPI 和 [Kubernetes](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/k8s) )的用户参考。
6+
7+
## 前提条件
8+
9+
1. 上述脚本使用 Python 库 [fabric](http://www.fabfile.org/) 来运行 SSH 命令。 我们使用 `pip` 来安装 fabric:
10+
11+
```bash
12+
pip install fabric
13+
```
14+
15+
2. 我们需要在集群的所有节点上安装 PaddlePaddle。 如果要启用GPU,需要在 `/usr/local/cuda` 中安装 CUDA; 否则 Paddle 将在运行时报错。
16+
17+
3.[`cluster_train/conf.py`] 中设置 `ROOT_DIR`, 该 ROOT_DIR 要在所有节点上存在。为了方便起见,我们通常在所有节点上创建一个 Unix 用户 `paddle`,并设置 `ROOT_DIR=/home/paddle`。这样,我们可以将 SSH 公钥写入 `/home/paddle/.ssh/authorized_keys`,以便用户 `paddle` 可以 SSH 到所有节点而不用密码。
18+
19+
## 准备工作空间
20+
21+
我们将放置依赖库、配置等文件的目录视为 *工作空间(workspace)*
22+
23+
这些 `train/test` 数据应该在启动集群作业之前准备好。 为了满足训练/测试数据放置在工作空间中不同目录的要求,PADDLE 根据在模型配置文件中使用的名为 `train.list/test.list` 的索引文件引用训练/测试数据,所以训练/测试数据也包含 train.list/test.list 两个列表文件。所有本地训练 demo 已经提供了脚本来帮助您创建这两个文件,并且集群作业中的所有节点将在正常情况下处理具有相同逻辑代码的文件。
24+
25+
通常,你可以使用本地训练中的相同模型文件进行集群训练。请记住,在模型文件的 `setting`函数中设置的 `batch_size` 表示在集群作业**每个**节点中的 batch 大小,而不是使用同步 SGD 的总 batch 大小。
26+
27+
以下步骤基于 demo 目录中的 [demo/recommendation](https://github.com/PaddlePaddle/Paddle/tree/develop/demo/recommendation)
28+
29+
你只需完成 demo/recommendation 教程文档到 `Train` 的部分,之后你会得到训练/测试数据和模型配置文件。最后,只需使用 demo/recommendation 作为集群训练的工作空间。
30+
31+
最后,你的工作空间应如下所示:
32+
```
33+
.
34+
|-- common_utils.py
35+
|-- data
36+
| |-- config.json
37+
| |-- config_generator.py
38+
| |-- meta.bin
39+
| |-- meta_config.json
40+
| |-- meta_generator.py
41+
| |-- ml-1m
42+
| |-- ml_data.sh
43+
| |-- ratings.dat.test
44+
| |-- ratings.dat.train
45+
| |-- split.py
46+
| |-- test.list
47+
| `-- train.list
48+
|-- dataprovider.py
49+
|-- evaluate.sh
50+
|-- prediction.py
51+
|-- preprocess.sh
52+
|-- requirements.txt
53+
|-- run.sh
54+
`-- trainer_config.py
55+
```
56+
虽然这些文件并非都需要集群训练,但是也没有必要删除无用的文件。
57+
58+
`trainer_config.py`
59+
表示模型配置文件。
60+
61+
`train.list``test.list`
62+
文件索引。它存储当前节点所有训练/测试数据的所有相对或绝对文件路径。
63+
64+
`dataprovider.py`
65+
用于读取训练/测试样本。这与本地训练相同。
66+
67+
`data`
68+
数据目录中的所有文件被 train.list/test.list 引用。
69+
70+
71+
## 准备集群作业配置
72+
73+
以下选项必须在 cluster_train/conf.py 中认真设置
74+
75+
`HOSTS` 所有节点运行集群作业的主机名或 IP 。你还可以将用户和 ssh 端口附加到主机名上,例如 root@192.168.100.17:9090。
76+
77+
`ROOT_DIR` 用于放置 JOB 工作空间目录的工作空间 ROOT 目录
78+
79+
`PADDLE_NIC` 集群通信通道的 NIC(Network Interface Card, 网络接口卡) 接口名称,例如以太网的 eth0,infiniband 的 ib0。
80+
81+
`PADDLE_PORT` 集群通信通道的端口号
82+
83+
`PADDLE_PORTS_NUM` 用于集群通信通道的端口数。 如果集群节点数量少(少于5〜6个节点),建议将其设置为较大,如2〜8,以获得更好的网络性能。
84+
85+
`PADDLE_PORTS_NUM_FOR_SPARSE` 用于 sparse remote updater 集群通信信道的端口数。如果使用 sparse remote update,则可以像 `PADDLE_PORTS_NUM` 一样设置。
86+
87+
`LD_LIBRARY_PATH` 为集群作业设置额外的 LD_LIBRARY_PATH。你可以使用它来设置 CUDA 库的路径。
88+
89+
默认配置如下:
90+
91+
```python
92+
HOSTS = [
93+
"root@192.168.100.17",
94+
"root@192.168.100.18",
95+
]
96+
97+
'''
98+
工作空间配置
99+
'''
100+
101+
#工作空间根目录
102+
ROOT_DIR = "/home/paddle"
103+
104+
'''
105+
网络配置
106+
'''
107+
#pserver NIC
108+
PADDLE_NIC = "eth0"
109+
#pserver 端口
110+
PADDLE_PORT = 7164
111+
#pserver 端口数
112+
PADDLE_PORTS_NUM = 2
113+
#pserver sparse ports num
114+
PADDLE_PORTS_NUM_FOR_SPARSE = 2
115+
116+
#集群作业中所有进程的环境设置
117+
LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"
118+
```
119+
120+
### 启动集群作业
121+
`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为```paddle.py``` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。
122+
123+
`paddle.py` 为方便作业启动提供了两个独特的命令选项。
124+
125+
`job_dispatch_package` 设为本地 `workspace` 目录,它将被分发到 conf.py 中设置的所有节点。 它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。
126+
`job_workspace` 设为已部署的工作空间目录,`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。
127+
128+
`cluster_train/run.sh` 提供了命令样例来运行 `demo/recommendation` 集群工作,只需用你定义的目录修改 `job_dispatch_package``job_workspace`,然后:
129+
```
130+
sh run.sh
131+
```
132+
133+
集群作业将会在几秒后启动。
134+
135+
### 终止集群作业
136+
`paddle.py`能获取`Ctrl + C` SIGINT 信号来自动终止它启动的所有进程。只需中断 `paddle.py` 任务来终止集群作业。如果程序崩溃你也可以手动终止。
137+
138+
### 检查集群训练结果
139+
详细信息请检查 $workspace/log 里的日志,每一个节点都有相同的日志结构。
140+
141+
`paddle_trainer.INFO`
142+
提供几乎所有训练的内部输出日志,与本地训练相同。这里检验运行时间模型的收敛。
143+
144+
`paddle_pserver2.INFO`
145+
提供 pserver 运行日志,有助于诊断分布式错误。
146+
147+
`server.log`
148+
提供 pserver 进程的 stderr 和 stdout。训练失败时可以检查错误日志。
149+
150+
`train.log`
151+
提供训练过程的 stderr 和 stdout。训练失败时可以检查错误日志。
152+
153+
### 检查模型输出
154+
运行完成后,模型文件将被写入节点 0 的 `output` 目录中。
155+
工作空间中的 `nodefile` 表示当前集群作业的节点 ID。

doc/howto/usage/cluster/cluster_train_en.md

+28-28
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
In this article, we explain how to run distributed Paddle training jobs on clusters. We will create the distributed version of the single-process training example, [recommendation](https://github.com/baidu/Paddle/tree/develop/demo/recommendation).
44

5-
[Scripts](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train) used in this article launch distributed jobs via SSH. They also work as a reference for users running more sophisticated cluster management systems like MPI and Kubernetes.
5+
[Scripts](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train) used in this article launch distributed jobs via SSH. They also work as a reference for users running more sophisticated cluster management systems like MPI and [Kubernetes](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/k8s).
66

77
## Prerequisite
88

@@ -20,13 +20,13 @@ In this article, we explain how to run distributed Paddle training jobs on clust
2020

2121
We refer to the directory where we put dependent libraries, config files, etc., as *workspace*.
2222

23-
These ```train/test``` data should be prepared before launching cluster job. To satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as ```train.list/test.list``` which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files, and all nodes in cluster job will handle files with same logical code in normal condition.
23+
These `train/test` data should be prepared before launching cluster job. To satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as `train.list/test.list` which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files, and all nodes in cluster job will handle files with same logical code in normal condition.
2424

25-
Generally, you can use same model file from local training for cluster training. What you should have in mind that, the ```batch_size``` set in ```setting``` function in model file means batch size in ```each``` node of cluster job instead of total batch size if synchronization SGD was used.
25+
Generally, you can use same model file from local training for cluster training. What you should have in mind that, the `batch_size` set in `setting` function in model file means batch size in `each` node of cluster job instead of total batch size if synchronization SGD was used.
2626

27-
Following steps are based on demo/recommendation demo in demo directory.
27+
Following steps are based on [demo/recommendation](https://github.com/PaddlePaddle/Paddle/tree/develop/demo/recommendation) demo in demo directory.
2828

29-
You just go through demo/recommendation tutorial doc until ```Train``` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
29+
You just go through demo/recommendation tutorial doc until `Train` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
3030

3131
At last your workspace should look like as follow:
3232
```
@@ -55,36 +55,36 @@ At last your workspace should look like as follow:
5555
```
5656
Not all of these files are needed for cluster training, but it's not necessary to remove useless files.
5757

58-
```trainer_config.py```
58+
`trainer_config.py`
5959
Indicates the model config file.
6060

61-
```train.list``` and ```test.list```
61+
`train.list` and `test.list`
6262
File index. It stores all relative or absolute file paths of all train/test data at current node.
6363

64-
```dataprovider.py```
64+
`dataprovider.py`
6565
used to read train/test samples. It's same as local training.
6666

67-
```data```
67+
`data`
6868
all files in data directory are refered by train.list/test.list which are refered by data provider.
6969

7070

7171
## Prepare Cluster Job Configuration
7272

7373
The options below must be carefully set in cluster_train/conf.py
7474

75-
```HOSTS``` all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root@192.168.100.17:9090.
75+
`HOSTS` all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root@192.168.100.17:9090.
7676

77-
```ROOT_DIR``` workspace ROOT directory for placing JOB workspace directory
77+
`ROOT_DIR` workspace ROOT directory for placing JOB workspace directory
7878

79-
```PADDLE_NIC``` the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.
79+
`PADDLE_NIC` the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.
8080

81-
```PADDLE_PORT``` port number for cluster commnunication channel
81+
`PADDLE_PORT` port number for cluster commnunication channel
8282

83-
```PADDLE_PORTS_NUM``` the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.
83+
`PADDLE_PORTS_NUM` the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.
8484

85-
```PADDLE_PORTS_NUM_FOR_SPARSE``` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like ```PADDLE_PORTS_NUM```
85+
`PADDLE_PORTS_NUM_FOR_SPARSE` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like `PADDLE_PORTS_NUM`
8686

87-
```LD_LIBRARY_PATH``` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.
87+
`LD_LIBRARY_PATH` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.
8888

8989
Default Configuration as follow:
9090

@@ -118,39 +118,39 @@ LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"
118118
```
119119

120120
### Launching Cluster Job
121-
```paddle.py``` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as ```paddle.py``` command options and ```paddle.py``` will transparently and automatically set these options to PaddlePaddle lower level processes.
121+
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.
122122

123-
```paddle.py```provides two distinguished command option for easy job launching.
123+
`paddle.py`provides two distinguished command option for easy job launching.
124124

125-
```job_dispatch_package``` set it with local ```workspace```directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
126-
```job_workspace``` set it with already deployed workspace directory, ```paddle.py``` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
125+
`job_dispatch_package` set it with local `workspace`directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
126+
`job_workspace` set it with already deployed workspace directory, `paddle.py` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
127127
dispatch latency.
128128

129-
```cluster_train/run.sh``` provides command line sample to run ```demo/recommendation``` cluster job, just modify ```job_dispatch_package``` and ```job_workspace``` with your defined directory, then:
129+
`cluster_train/run.sh` provides command line sample to run `demo/recommendation` cluster job, just modify `job_dispatch_package` and `job_workspace` with your defined directory, then:
130130
```
131131
sh run.sh
132132
```
133133

134134
The cluster Job will start in several seconds.
135135

136136
### Kill Cluster Job
137-
```paddle.py``` can capture ```Ctrl + C``` SIGINT signal to automatically kill all processes launched by it. So just stop ```paddle.py``` to kill cluster job. You should mannally kill job if program crashed.
137+
`paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should mannally kill job if program crashed.
138138

139139
### Check Cluster Training Result
140140
Check log in $workspace/log for details, each node owns same log structure.
141141

142-
```paddle_trainer.INFO```
142+
`paddle_trainer.INFO`
143143
It provides almost all interal output log for training, same as local training. Check runtime model convergence here.
144144

145-
```paddle_pserver2.INFO```
145+
`paddle_pserver2.INFO`
146146
It provides pserver running log, which could help to diagnose distributed error.
147147

148-
```server.log```
148+
`server.log`
149149
It provides stderr and stdout of pserver process. Check error log if training crashs.
150150

151-
```train.log```
151+
`train.log`
152152
It provides stderr and stdout of trainer process. Check error log if training crashs.
153153

154154
### Check Model Output
155-
After one pass finished, model files will be writed in ```output``` directory in node 0.
156-
```nodefile``` in workspace indicates the node id of current cluster job.
155+
After one pass finished, model files will be writed in `output` directory in node 0.
156+
`nodefile` in workspace indicates the node id of current cluster job.

0 commit comments

Comments
 (0)