You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/howto/usage/cluster/cluster_train_en.md
+28-28
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
In this article, we explain how to run distributed Paddle training jobs on clusters. We will create the distributed version of the single-process training example, [recommendation](https://github.com/baidu/Paddle/tree/develop/demo/recommendation).
4
4
5
-
[Scripts](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train) used in this article launch distributed jobs via SSH. They also work as a reference for users running more sophisticated cluster management systems like MPI and Kubernetes.
5
+
[Scripts](https://github.com/baidu/Paddle/tree/develop/paddle/scripts/cluster_train) used in this article launch distributed jobs via SSH. They also work as a reference for users running more sophisticated cluster management systems like MPI and [Kubernetes](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/howto/usage/cluster/k8s).
6
6
7
7
## Prerequisite
8
8
@@ -20,13 +20,13 @@ In this article, we explain how to run distributed Paddle training jobs on clust
20
20
21
21
We refer to the directory where we put dependent libraries, config files, etc., as *workspace*.
22
22
23
-
These ```train/test``` data should be prepared before launching cluster job. To satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as ```train.list/test.list``` which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files, and all nodes in cluster job will handle files with same logical code in normal condition.
23
+
These `train/test` data should be prepared before launching cluster job. To satisfy the requirement that train/test data are placed in different directory from workspace, PADDLE refers train/test data according to index file named as `train.list/test.list` which are used in model config file. So the train/test data also contains train.list/test.list two list file. All local training demo already provides scripts to help you create these two files, and all nodes in cluster job will handle files with same logical code in normal condition.
24
24
25
-
Generally, you can use same model file from local training for cluster training. What you should have in mind that, the ```batch_size``` set in ```setting``` function in model file means batch size in ```each``` node of cluster job instead of total batch size if synchronization SGD was used.
25
+
Generally, you can use same model file from local training for cluster training. What you should have in mind that, the `batch_size` set in `setting` function in model file means batch size in `each` node of cluster job instead of total batch size if synchronization SGD was used.
26
26
27
-
Following steps are based on demo/recommendation demo in demo directory.
27
+
Following steps are based on [demo/recommendation](https://github.com/PaddlePaddle/Paddle/tree/develop/demo/recommendation) demo in demo directory.
28
28
29
-
You just go through demo/recommendation tutorial doc until ```Train``` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
29
+
You just go through demo/recommendation tutorial doc until `Train` section, and at last you will get train/test data and model configuration file. Finaly, just use demo/recommendation as workspace for cluster training.
30
30
31
31
At last your workspace should look like as follow:
32
32
```
@@ -55,36 +55,36 @@ At last your workspace should look like as follow:
55
55
```
56
56
Not all of these files are needed for cluster training, but it's not necessary to remove useless files.
57
57
58
-
```trainer_config.py```
58
+
`trainer_config.py`
59
59
Indicates the model config file.
60
60
61
-
```train.list``` and ```test.list```
61
+
`train.list` and `test.list`
62
62
File index. It stores all relative or absolute file paths of all train/test data at current node.
63
63
64
-
```dataprovider.py```
64
+
`dataprovider.py`
65
65
used to read train/test samples. It's same as local training.
66
66
67
-
```data```
67
+
`data`
68
68
all files in data directory are refered by train.list/test.list which are refered by data provider.
69
69
70
70
71
71
## Prepare Cluster Job Configuration
72
72
73
73
The options below must be carefully set in cluster_train/conf.py
74
74
75
-
```HOSTS``` all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root@192.168.100.17:9090.
75
+
`HOSTS` all nodes hostname or ip that will run cluster job. You can also append user and ssh port with hostname, such as root@192.168.100.17:9090.
76
76
77
-
```ROOT_DIR``` workspace ROOT directory for placing JOB workspace directory
77
+
`ROOT_DIR` workspace ROOT directory for placing JOB workspace directory
78
78
79
-
```PADDLE_NIC``` the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.
79
+
`PADDLE_NIC` the NIC(Network Interface Card) interface name for cluster communication channel, such as eth0 for ethternet, ib0 for infiniband.
80
80
81
-
```PADDLE_PORT``` port number for cluster commnunication channel
81
+
`PADDLE_PORT` port number for cluster commnunication channel
82
82
83
-
```PADDLE_PORTS_NUM``` the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.
83
+
`PADDLE_PORTS_NUM` the number of port used for cluster communication channle. if the number of cluster nodes is small(less than 5~6nodes), recommend you set it to larger, such as 2 ~ 8, for better network performance.
84
84
85
-
```PADDLE_PORTS_NUM_FOR_SPARSE``` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like ```PADDLE_PORTS_NUM```
85
+
`PADDLE_PORTS_NUM_FOR_SPARSE` the number of port used for sparse updater cluster commnunication channel. if sparse remote update is used, set it like `PADDLE_PORTS_NUM`
86
86
87
-
```LD_LIBRARY_PATH``` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.
87
+
`LD_LIBRARY_PATH` set addtional LD_LIBRARY_PATH for cluster job. You can use it to set CUDA libraries path.
```paddle.py``` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as ```paddle.py``` command options and ```paddle.py``` will transparently and automatically set these options to PaddlePaddle lower level processes.
121
+
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.
122
122
123
-
```paddle.py```provides two distinguished command option for easy job launching.
123
+
`paddle.py`provides two distinguished command option for easy job launching.
124
124
125
-
```job_dispatch_package``` set it with local ```workspace```directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
126
-
```job_workspace``` set it with already deployed workspace directory, ```paddle.py``` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
125
+
`job_dispatch_package` set it with local `workspace`directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
126
+
`job_workspace` set it with already deployed workspace directory, `paddle.py` will skip dispatch stage to directly launch cluster job with all nodes. It could help to reduce heavy
127
127
dispatch latency.
128
128
129
-
```cluster_train/run.sh``` provides command line sample to run ```demo/recommendation``` cluster job, just modify ```job_dispatch_package``` and ```job_workspace``` with your defined directory, then:
129
+
`cluster_train/run.sh` provides command line sample to run `demo/recommendation` cluster job, just modify `job_dispatch_package` and `job_workspace` with your defined directory, then:
130
130
```
131
131
sh run.sh
132
132
```
133
133
134
134
The cluster Job will start in several seconds.
135
135
136
136
### Kill Cluster Job
137
-
```paddle.py``` can capture ```Ctrl + C``` SIGINT signal to automatically kill all processes launched by it. So just stop ```paddle.py``` to kill cluster job. You should mannally kill job if program crashed.
137
+
`paddle.py` can capture `Ctrl + C` SIGINT signal to automatically kill all processes launched by it. So just stop `paddle.py` to kill cluster job. You should mannally kill job if program crashed.
138
138
139
139
### Check Cluster Training Result
140
140
Check log in $workspace/log for details, each node owns same log structure.
141
141
142
-
```paddle_trainer.INFO```
142
+
`paddle_trainer.INFO`
143
143
It provides almost all interal output log for training, same as local training. Check runtime model convergence here.
144
144
145
-
```paddle_pserver2.INFO```
145
+
`paddle_pserver2.INFO`
146
146
It provides pserver running log, which could help to diagnose distributed error.
147
147
148
-
```server.log```
148
+
`server.log`
149
149
It provides stderr and stdout of pserver process. Check error log if training crashs.
150
150
151
-
```train.log```
151
+
`train.log`
152
152
It provides stderr and stdout of trainer process. Check error log if training crashs.
153
153
154
154
### Check Model Output
155
-
After one pass finished, model files will be writed in ```output``` directory in node 0.
156
-
```nodefile``` in workspace indicates the node id of current cluster job.
155
+
After one pass finished, model files will be writed in `output` directory in node 0.
156
+
`nodefile` in workspace indicates the node id of current cluster job.
0 commit comments