Skip to content

Commit 2ad6078

Browse files
committed
update docs according to testing results
1 parent fe4090f commit 2ad6078

File tree

8 files changed

+23
-20
lines changed

8 files changed

+23
-20
lines changed

examples/gpt/README.md

+1-4
Original file line numberDiff line numberDiff line change
@@ -28,21 +28,18 @@ GPT-[2](https://cdn.openai.com/better-language-models/language_models_are_unsupe
2828
- regex
2929
- colorlog
3030
- colorama
31-
- cached_path >= 1.1.5
3231
- omegaconf
3332
- sentencepiece >= 0.1.94
3433
- tqdm
3534
- visualdl
36-
- paddlepaddle-gpu >= 2.2rc
3735
- pybind11
3836
- lac (可选)
3937
- zstandard (可选)
4038

4139
**安装命令**
4240
```shell
43-
pip install regex colorlog colorama cached_path omegaconf sentencepiece tqdm visualdl pybind11 lac zstandard
41+
python -m pip install regex colorlog colorama omegaconf sentencepiece tqdm visualdl
4442
```
45-
注:需要PaddlePaddle版本大于等于2.2rc,或者使用最新develop版本,安装方法请参见Paddle[官网](https://www.paddlepaddle.org.cn)
4643

4744
### 数据准备
4845

examples/gpt/hybrid_parallel/README.md

+14-4
Original file line numberDiff line numberDiff line change
@@ -215,18 +215,28 @@ python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,
215215

216216
### 多机训练
217217

218-
若需要在更多机器上进行大模型训练,则需要在每个参与训练的节点上执行启动命令。以2机6.7B模型分组切分并行训练为例,启动命令为:
218+
若需要在更多机器上进行大模型训练,则需要在每个参与训练的节点上设置master节点ip/port信息后执行启动命令(master节点ip为训练所用某一台机器的ip即可)。
219+
220+
以2机6.7B模型分组切分并行训练为例,启动命令为:
219221

220222
```shell
223+
master_ip=master节点ip
224+
master_port=可用的空闲端口号
225+
221226
log_dir=log_sharding16
222-
python -m paddle.distributed.launch --log_dir $log_dir --master=10.10.1.1:49178 --nnodes=2 --devices "0,1,2,3,4,5,6,7" run_pretrain.py \
227+
python -m paddle.distributed.launch --log_dir $log_dir --master=$master_ip:$master_port --nnodes=2 --devices "0,1,2,3,4,5,6,7" run_pretrain.py \
223228
-c ./configs_6.7B_sharding16.yaml
224229
```
225230

226-
若要执行16机175B大模型混合并行训练,由于节点较多,可以考虑使用 `ssh` 脚本或 `mpirun` 进行跨节点命令分发,启动命令为:
231+
若要执行16机175B大模型混合并行训练,启动命令为:
227232

228233
```shell
234+
master_ip=master节点ip
235+
master_port=可用的空闲端口号
236+
229237
log_dir=log_mp8_pp16
230-
mpirun python -m paddle.distributed.launch --log_dir $log_dir --master=10.10.1.1:49178 --nnodes=16 --devices "0,1,2,3,4,5,6,7" run_pretrain.py \
238+
python -m paddle.distributed.launch --log_dir $log_dir --master=$master_ip:$master_port --nnodes=16 --devices "0,1,2,3,4,5,6,7" run_pretrain.py \
231239
-c ./configs_175B_mp8_pp16.yaml
232240
```
241+
242+
当节点较多时,可以考虑使用 `ssh` 脚本或 `mpirun` 进行跨节点命令分发。

examples/gpt/hybrid_parallel/run.sh

-2
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,6 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
export PYTHONPATH=$PYTHONPATH:../../../
16-
1715
log_dir=log_hybrid
1816
rm -rf $log_dir
1917

examples/gpt/hybrid_parallel/run_pretrain.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,17 @@
1818
import random
1919
import time
2020
import sys
21-
sys.path.append("..")
22-
from examples.gpt.tools import parse_args, parse_yaml
23-
2421
import numpy as np
22+
2523
import paddle
2624
from paddle.distributed import fleet
2725
from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
2826
from paddle.distributed.sharding import group_sharded_parallel
2927
from paddle.fluid.dygraph.parallel import sync_params_buffers
3028
from paddle.distributed.fleet.utils.hybrid_parallel_util import fused_allreduce_gradients
3129

30+
sys.path.append("../../../")
31+
from examples.gpt.tools import parse_args, parse_yaml
3232
from fleetx.datasets.gpt import create_pretrained_dataset, get_train_data_file
3333
from fleetx.data.tokenizers import GPTTokenizer
3434
from fleetx.utils import logger

examples/gpt/single/run.sh

-2
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,6 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
export PYTHONPATH=$PYTHONPATH:../../../
16-
1715
# 345M
1816
python run_pretrain.py -c ./configs_345m_single_card.yaml
1917

examples/gpt/single/run_pretrain.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@
1818
import random
1919
import time
2020
import sys
21-
sys.path.append("..")
22-
from examples.gpt.tools import parse_args, parse_yaml
23-
2421
import numpy as np
22+
2523
import paddle
24+
sys.path.append("../../../")
25+
from examples.gpt.tools import parse_args, parse_yaml
2626
from fleetx.models.gpt_model.modeling import GPTModel, GPTForPretraining, GPTPretrainingCriterion
2727
from fleetx.datasets.gpt import create_pretrained_dataset, get_train_data_file
2828
from fleetx.data.tokenizers import GPTTokenizer

fleetx/data/data_tools/cpp/Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
CXXFLAGS += -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color
2-
CPPFLAGS += $(shell python3 -m pybind11 --includes)
2+
CPPFLAGS += $(shell python -m pybind11 --includes)
33
LIBNAME = fast_index_map_helpers
44
LIBEXT = ".so"
55

requirements.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
regex
22
colorlog
33
colorama
4-
inspect
54
omegaconf
65
tqdm
6+
pybind11

0 commit comments

Comments
 (0)