Skip to content

Commit 70d25c4

Browse files
authored
Merge pull request #293 from drinktee/k8s
Add PaddlePaddle QuickStart Demo on Kubernetes
2 parents 8a6b744 + 3bfb898 commit 70d25c4

File tree

7 files changed

+742
-0
lines changed

7 files changed

+742
-0
lines changed
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# Paddle On Kubernetes:单机训练
2+
3+
在这篇文档里,我们介绍如何在 Kubernetes 集群上启动一个单机使用CPU的Paddle训练作业。在下一篇中,我们将介绍如何启动分布式训练作业。
4+
5+
## 制作Docker镜像
6+
7+
在一个功能齐全的Kubernetes机群里,通常我们会安装Ceph等分布式文件系统来存储训练数据。这样的话,一个分布式Paddle训练任务中的每个进程都可以从Ceph读取数据。在这个例子里,我们只演示一个单机作业,所以可以简化对环境的要求,把训练数据直接放在
8+
Paddle的Docker image里。为此,我们需要制作一个包含训练数据的Paddle镜像。
9+
10+
Paddle 的 [Quick Start Tutorial](http://www.paddlepaddle.org/doc/demo/quick_start/index_en.html)
11+
里介绍了用Paddle源码中的脚本下载训练数据的过程。
12+
`paddledev/paddle:cpu-demo-latest` 镜像里有 Paddle 源码与demo,( 请注意,默认的
13+
Paddle镜像 `paddledev/paddle:cpu-latest` 是不包括源码的, Paddle的各版本镜像可以参考 [Docker installation guide](http://www.paddlepaddle.org/doc/build/docker_install.html) ),所以我们使用这个镜像来下载训练数据到Docker container中,然后把这个包含了训练数据的container保存为一个新的镜像。
14+
15+
### 运行容器
16+
17+
```
18+
$ docker run --name quick_start_data -it paddledev/paddle:cpu-demo-latest
19+
```
20+
21+
### 下载数据
22+
23+
进入容器`/root/paddle/demo/quick_start/data`目录,使用`get_data.sh`下载数据
24+
25+
```
26+
$ root@fbd1f2bb71f4:~/paddle/demo/quick_start/data# ./get_data.sh
27+
28+
Downloading Amazon Electronics reviews data...
29+
--2016-10-31 01:33:43-- http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
30+
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
31+
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
32+
HTTP request sent, awaiting response... 200 OK
33+
Length: 495854086 (473M) [application/x-gzip]
34+
Saving to: 'reviews_Electronics_5.json.gz'
35+
36+
10% [=======> ] 874,279 64.7KB/s eta 2h 13m
37+
38+
```
39+
40+
### 修改启动脚本
41+
42+
下载完数据后,修改`/root/paddle/demo/quick_start/train.sh`文件,内容如下(增加了一条cd命令)
43+
```
44+
set -e
45+
cd /root/paddle/demo/quick_start
46+
cfg=trainer_config.lr.py
47+
#cfg=trainer_config.emb.py
48+
#cfg=trainer_config.cnn.py
49+
#cfg=trainer_config.lstm.py
50+
#cfg=trainer_config.bidi-lstm.py
51+
#cfg=trainer_config.db-lstm.py
52+
paddle train \
53+
--config=$cfg \
54+
--save_dir=./output \
55+
--trainer_count=4 \
56+
--log_period=20 \
57+
--num_passes=15 \
58+
--use_gpu=false \
59+
--show_parameter_stats_period=100 \
60+
--test_all_data_in_one_period=1 \
61+
2>&1 | tee 'train.log'
62+
```
63+
64+
### 提交镜像
65+
66+
修改启动脚本后,退出容器,使用`docker commit`命令创建新镜像。
67+
68+
```
69+
$ docker commit quick_start_data mypaddle/paddle:quickstart
70+
```
71+
72+
## 使用 Kubernetes 进行训练
73+
74+
>针对任务运行完成后容器自动退出的场景,Kubernetes有Job类型的资源来支持。下文就是用Job类型的资源来进行训练。
75+
76+
### 编写yaml文件
77+
78+
在训练时,输出结果可能会随着容器的消耗而被删除,需要在创建容器前挂载卷以便我们保存训练结果。使用我们之前构造的镜像,可以创建一个 [Kubernetes Job](http://kubernetes.io/docs/user-guide/jobs/#what-is-a-job),简单的yaml文件如下:
79+
80+
```
81+
apiVersion: batch/v1
82+
kind: Job
83+
metadata:
84+
name: quickstart
85+
spec:
86+
parallelism: 1
87+
completions: 1
88+
template:
89+
metadata:
90+
name: quickstart
91+
spec:
92+
volumes:
93+
- name: output
94+
hostPath:
95+
path: /home/work/paddle_output
96+
containers:
97+
- name: pi
98+
image: mypaddle/paddle:quickstart
99+
command: ["bin/bash", "-c", "/root/paddle/demo/quick_start/train.sh"]
100+
volumeMounts:
101+
- name: output
102+
mountPath: /root/paddle/demo/quick_start/output
103+
restartPolicy: Never
104+
```
105+
106+
### 创建Paddle Job
107+
108+
使用上文创建的yaml文件创建Kubernetes Job,命令为:
109+
110+
```
111+
$ kubectl create -f paddle.yaml
112+
```
113+
114+
查看job的详细情况:
115+
116+
```
117+
$ kubectl get job
118+
NAME DESIRED SUCCESSFUL AGE
119+
quickstart 1 0 58s
120+
121+
$ kubectl describe job quickstart
122+
Name: quickstart
123+
Namespace: default
124+
Image(s): registry.baidu.com/public/paddle:cpu-demo-latest
125+
Selector: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84
126+
Parallelism: 1
127+
Completions: 1
128+
Start Time: Mon, 31 Oct 2016 11:20:16 +0800
129+
Labels: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart
130+
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
131+
Volumes:
132+
output:
133+
Type: HostPath (bare host directory volume)
134+
Path: /home/work/paddle_output
135+
Events:
136+
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
137+
--------- -------- ----- ---- ------------- -------- ------ -------
138+
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: quickstart-fa0wx
139+
```
140+
141+
### 查看训练结果
142+
143+
根据Job对应的Pod信息,可以查看此Pod运行的宿主机。
144+
145+
```
146+
kubectl describe pod quickstart-fa0wx
147+
Name: quickstart-fa0wx
148+
Namespace: default
149+
Node: paddle-demo-let02/10.206.202.44
150+
Start Time: Mon, 31 Oct 2016 11:20:17 +0800
151+
Labels: controller-uid=f120da72-9f18-11e6-b363-448a5b355b84,job-name=quickstart
152+
Status: Succeeded
153+
IP: 10.0.0.9
154+
Controllers: Job/quickstart
155+
Containers:
156+
quickstart:
157+
Container ID: docker://b8561f5c79193550d64fa47418a9e67ebdd71546186e840f88de5026b8097465
158+
Image: registry.baidu.com/public/paddle:cpu-demo-latest
159+
Image ID: docker://18e457ce3d362ff5f3febf8e7f85ffec852f70f3b629add10aed84f930a68750
160+
Port:
161+
Command:
162+
bin/bash
163+
-c
164+
/root/paddle/demo/quick_start/train.sh
165+
QoS Tier:
166+
cpu: BestEffort
167+
memory: BestEffort
168+
State: Terminated
169+
Reason: Completed
170+
Exit Code: 0
171+
Started: Mon, 31 Oct 2016 11:20:20 +0800
172+
Finished: Mon, 31 Oct 2016 11:21:46 +0800
173+
Ready: False
174+
Restart Count: 0
175+
Environment Variables:
176+
Conditions:
177+
Type Status
178+
Ready False
179+
Volumes:
180+
output:
181+
Type: HostPath (bare host directory volume)
182+
Path: /home/work/paddle_output
183+
```
184+
185+
我们还可以登录到宿主机上查看训练结果。
186+
187+
```
188+
[root@paddle-demo-let02 paddle_output]# ll
189+
total 60
190+
drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00000
191+
drwxr-xr-x 2 root root 4096 Oct 31 11:20 pass-00001
192+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00002
193+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00003
194+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00004
195+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00005
196+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00006
197+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00007
198+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00008
199+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00009
200+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00010
201+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00011
202+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00012
203+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00013
204+
drwxr-xr-x 2 root root 4096 Oct 31 11:21 pass-00014
205+
```

doc_cn/cluster/k8s/Dockerfile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
FROM paddledev/paddle:cpu-latest
2+
3+
MAINTAINER zjsxzong89@gmail.com
4+
5+
COPY start.sh /root/
6+
COPY start_paddle.py /root/
7+
CMD ["bash"," -c","/root/start.sh"]

0 commit comments

Comments
 (0)