Skip to content

How to correctly run transformer? #1059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sfraczek opened this issue Jul 19, 2018 · 25 comments
Open

How to correctly run transformer? #1059

sfraczek opened this issue Jul 19, 2018 · 25 comments
Assignees
Labels

Comments

@sfraczek
Copy link
Contributor

sfraczek commented Jul 19, 2018

Hi,

I have encountered a number of problems with fluid/neural_machine_translation/transformer model. Am I doing something wrong? How to correctly run it?

Steps I have taken

Following instructions in https://github.com/PaddlePaddle/models/blob/develop/fluid/neural_machine_translation/transformer/README_cn.md I have downloaded WMT'16 EN-DE from https://github.com/google/seq2seq/blob/master/docs/data.md by clicking download.

Next I extracted it to wmt16_en_de directory.

Next I did paste -d ' \ t ' train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.tok.clean.bpe.32000.en-de

Then I did sed -i '1i\<s>\n<e>\n<unk>' vocab.bpe.32000

in config.py I changed use_gpu = True to False.
In train.py I added import multiprocessing and changed dev_count = fluid.core.get_cuda_device_count() to dev_count = fluid.core.get_cuda_device_count() if TrainTaskConfig.use_gpu else multiprocessing.cpu_count().

Training

I launched training by python -u train.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de --use_token_batch True --batch_size 3200 --sort_type pool --pool_size 200000

but I got

E0719 14:26:29.439303 55138 graph.cc:43] softmax_with_cross_entropy_grad input var not in all_var list: softmax_with_cross_entropy_0.tmp_0@GRAD
epoch: 0, consumed 0.000161s
Traceback (most recent call last):
  File "train.py", line 428, in <module>
    train(args)
  File "train.py", line 419, in train
    "pass_" + str(pass_id) + ".checkpoint"))
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 288, in save_persistables
    filename=filename)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 166, in save_vars
    filename=filename)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 197, in save_vars
    executor.run(save_program)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/executor.py", line 449, in run
    self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: holder_ should not be null
Tensor not initialized yet when Tensor::type() is called. at [/home/sfraczek/Paddle/paddle/fluid/framework/tensor.h:139]
PaddlePaddle Call Stacks:
0       0x7f060e948f1cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
1       0x7f060e94b901p paddle::framework::Tensor::type() const + 209
2       0x7f060f617bf6p paddle::operators::SaveOp::SaveLodTensor(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::va
riant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> con
st&, paddle::framework::Variable*) const + 614
3       0x7f060f618472p paddle::operators::SaveOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boos
t::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::varian
t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::
detail::variant::void_> const&) const + 210

So I have commented out

#fluid.io.save_persistables(
#    exe,
#    os.path.join(TrainTaskConfig.ckpt_dir,
#                 "pass_" + str(pass_id) + ".checkpoint"))

and it worked.

Inference

So next I have tried to run inference.
I have found that the file wmt16_en_de/newstest2013.tok.bpe.32000.en-de doesn't exist but based on the README I guessed that I should run
paste -d ' \ t ' newstest2013.tok.bpe.32000.en newstest2013.tok.bpe.32000.de > newstest2013.tok.bpe.32000.en-de is this correct?

python -u infer.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --test_file_pattern wmt16_en_de/newstest2013.tok.bpe.32000.en-de --batch_size 4 model_path trained_models/pass_20.infer.model beam_size 5 but there was no ouptut from the script. It ended without error too.

I tried giving other files but it doesn't output anything either.

I added profiling by adding import paddle.fluid.profiler as profiler and

+    parser.add_argument(
+        "--profile",
+        type=bool,
+        default=False,
+        help="Enables/disables profiling.")

and

+    if args.profile:
+        with profiler.profiler("CPU", sorted_key='total') as cpuprof:
+            infer(args)
+    else:
+        infer(args)

But there is no output from the profile.

Please help.

@sfraczek
Copy link
Contributor Author

sfraczek commented Jul 20, 2018

update on debugging

I found that there was a problem with paste command. It didn't insert the delimiter between the English and German sentences, so the files were not parsed correctly. Are you not experiencing this problem?

I have replaced the call to paste with a python script:

from itertools import izip

def concat_files(f1, f2, f3):
    with open(f1) as textfile1, open(f2) as textfile2, open(f3, "w") as output_file:
        for x, y in izip(textfile1, textfile2):
            x = x.strip()
            y = y.strip()
            output_file.write("{0} \t {1}\n".format(x, y))

def main():
    f1="newstest2013.tok.bpe.32000.en"
    f2="newstest2013.tok.bpe.32000.de"
    f3="newstest2013.tok.bpe.32000.en-de"
    concat_files(f1,f2,f3)

    f1="train.tok.clean.bpe.32000.en"
    f2="train.tok.clean.bpe.32000.de"
    f3="train.tok.clean.bpe.32000.en-de"
    concat_files(f1,f2,f3)

if __name__ == "__main__":
    main()

now I'm repeating the training and inference and I will keep you updated.

@kuke
Copy link
Collaborator

kuke commented Jul 20, 2018

Are you sure you didn't miss \t when using paste cmd? It works fine in our linux systems.

@sfraczek
Copy link
Contributor Author

I have run into

Traceback (most recent call last):
  File "train.py", line 428, in <module>
    train(args)
  File "train.py", line 401, in train
    feed=feed_list)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/parallel_executor.py", line 269, in run
    self.executor.run(fetch_list, fetch_var_name)
paddle.fluid.core.EnforceNotMet: enforce posix_memalign(&p, 4096ul, size) == 0 failed, 12 != 0
Alloc 5242880000 error! at [/home/sfraczek/Paddle/paddle/fluid/memory/detail/system_allocator.cc:52]
PaddlePaddle Call Stacks:
0       0x7f7950c082e9p paddle::memory::detail::CPUAllocator::Alloc(unsigned long*, unsigned long) + 5209
1       0x7f7950c044ecp paddle::memory::detail::BuddyAllocator::RefillPool() + 92
2       0x7f7950c04d68p paddle::memory::detail::BuddyAllocator::Alloc(unsigned long) + 760
3       0x7f7950b2a1f0p void* paddle::memory::Alloc<paddle::platform::CPUPlace>(paddle::platform::CPUPlace, unsigned long) + 192
4       0x7f7950b1d7fap paddle::framework::Tensor::mutable_data(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::type_index) + 410
5       0x7f794fa7d571p float* paddle::framework::Tensor::mutable_data<float>(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>) + 97
6       0x7f7950592f3dp paddle::operators::LayerNormKernel<paddle::platform::CPUDeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const + 781
7       0x7f79505932cfp std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CPUPlace, false, 0ul, paddle::operators::LayerNormKernel<paddle::platform::CPUDeviceContext, float>, paddle::operators::LayerNormKernel<paddle::platform::CPUDeviceContext, double> >::operator()(char const*, char const*) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) + 47
8       0x7f7950a6114bp paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 747
9       0x7f7950a5ca0dp paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 221
10      0x7f7950935246p
11      0x7f795096a4c1p paddle::framework::details::OpHandleBase::RunAndRecordEvent(std::function<void ()> const&) + 65
12      0x7f7950934d63p paddle::framework::details::ComputationOpHandle::RunImpl() + 83
13      0x7f795096e1e9p paddle::framework::details::OpHandleBase::Run(bool) + 393
14      0x7f7950961208p
15      0x7f7950962210p
16      0x7f79507a152ep std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 46
17      0x7f79fea9ba99p
18      0x7f795096033dp
19      0x7f7950966304p std::thread::_Impl<std::_Bind_simple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run() + 340
20      0x7f79f8582c80p
21      0x7f79fea946bap
22      0x7f79fe7ca41dp clone + 109

@sfraczek
Copy link
Contributor Author

I need to update the transformer code to a new version and start again because there were a lot of changes.

@sfraczek
Copy link
Contributor Author

sfraczek commented Jul 24, 2018

I have been trying to run the transformer on the most recent develop branch but I keep running out of memory even with --batch_size 1 --pool_size 2 (I don't know what pool_size does exactly though). The process gets killed by the system displaying just Killed. I have roughly 180GB RAM. I don't see any iteration before it gets killed and it takes a lot of time before it gets killed. Also I didn't notice any obvious memory usage increasing steadily so it probably happens quickly.
Can you help me? Have you successfully ran training on CPU?

The command I have been using

python -u train.py   --src_vocab_fpath wmt16_en_de/vocab.bpe.32000   --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000   --special_token '<s>' '<e>' '<unk>'   --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de   --use_token_batch True   --batch_size 1   --sort_type pool_size 2 --device CPU

I think it is caused by out of memory because of this command dmesg -T| grep -E -i -B100 'killed process' returning this:

[wto lip 24 14:58:27 2018] Out of memory: Kill process 128566 (python) score 878 or sacrifice child
[wto lip 24 14:58:27 2018] Killed process 128566 (python) total-vm:393885000kB, anon-rss:176935556kB, file-rss:70104kB, shmem-rss:4096kB

@sfraczek sfraczek reopened this Jul 24, 2018
@sfraczek
Copy link
Contributor Author

I tried running transformer with valgrind's tool named massif but the tool didn't work so I couldn't track the issue.
This is a blocker for us. Please fix the CPU path so we can work on optimizing it.
We must suspend the effort until it starts working.

@guoshengCS
Copy link
Collaborator

I reproduce the error, while I can run successfully with export CPU_NUM=1 and the number of CPU increases the memories request linearly.

@sfraczek
Copy link
Contributor Author

sfraczek commented Jul 30, 2018

Thanks. I have tried it but it didn't help.

sfraczek@gklab-48-118:~/paddle-models/fluid/neural_machine_translation/transformer$ export CPU_NUM=1
sfraczek@gklab-48-118:~/paddle-models/fluid/neural_machine_translation/transformer$ python -u train.py   --src_vocab_fpath wmt16_en_de/vocab.bpe.32000   --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000   --special_token '<s>' '<e>' '<unk>'   --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de   --use_token_batch True   --batch_size 1   --sort_type pool --pool_size 1
Killed

Running dmesg -T| grep -E -i -B100 'killed process' still returns

[pon lip 30 12:13:09 2018] Out of memory: Kill process 279624 (python) score 909 or sacrifice child
[pon lip 30 12:13:09 2018] Killed process 279624 (python) total-vm:400701208kB, anon-rss:182189832kB, file-rss:71028kB, shmem-rss:4096kB

I will try again with latest develop branch - I forgot to change branch.

@sfraczek
Copy link
Contributor Author

sfraczek commented Jul 30, 2018

It ran out of memory on develop too. ~188 GB
Log:

sfraczek@gklab-48-118:~/paddle-models/fluid/neural_machine_translation/transformer$ python -u train.py   --src_vocab_fpath wmt16_en_de/vocab.bpe.32000   --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000   --special_token '<s>' '<e>' '<unk>'   --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de   --use_token_batch True   --batch_size 1   --sort_type pool --pool_size 1
Namespace(batch_size=1, device='GPU', local=True, opts=[], pool_size=1, shuffle=True, shuffle_batch=True, sort_type='pool', special_token=['<s>', '<e>', '<unk>'], src_vocab_fpath='wmt16_en_de/vocab.bpe.32000', sync=True, train_file_pattern='wmt16_en_de/train.tok.clean.bpe.32000.en-de', trg_vocab_fpath='wmt16_en_de/vocab.bpe.32000', use_token_batch=True, val_file_pattern=None)
local start_up:
init fluid.framework.default_startup_program
Killed

@sfraczek
Copy link
Contributor Author

Any idea what this error means?

local start_up:
init fluid.framework.default_startup_program
epoch: 0, batch: 0, avg loss: 3.810426, normalized loss: 3.176239, ppl: 45.169689
epoch: 0, batch: 1, avg loss: 3.837661, normalized loss: 3.203474, ppl: 46.416790
epoch: 0, batch: 2, avg loss: 3.781586, normalized loss: 3.147399, ppl: 43.885597
epoch: 0, batch: 3, avg loss: 3.787708, normalized loss: 3.153520, ppl: 44.155060
epoch: 0, batch: 4, avg loss: 3.759843, normalized loss: 3.125656, ppl: 42.941700
epoch: 0, batch: 5, avg loss: 3.691916, normalized loss: 3.057729, ppl: 40.121647
epoch: 0, batch: 6, avg loss: 3.644374, normalized loss: 3.010186, ppl: 38.258801
Traceback (most recent call last):
  File "train.py", line 553, in <module>
    train(args)
  File "train.py", line 509, in train
    lr_scheduler, token_num, predict)
  File "train.py", line 388, in train_loop
    ModelHyperParams.d_model)
  File "train.py", line 189, in prepare_batch_input
    [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
  File "train.py", line 140, in pad_batch_data
    max_len = max(len(inst) for inst in insts)
ValueError: max() arg is an empty sequence

@guoshengCS
Copy link
Collaborator

guoshengCS commented Aug 29, 2018

It might be caused by the empty list insts.

@sfraczek
Copy link
Contributor Author

sfraczek commented Aug 29, 2018

But what is this from the perspective of a user of the framework who doesn't know the internals? Is it a bug or something in the data?

@guoshengCS
Copy link
Collaborator

The insts is the list of samples and will not be empty normally. I am not sure if there is a bug when processing data
https://github.com/PaddlePaddle/models/blob/develop/fluid/neural_machine_translation/transformer/train.py#L246 . May you paste the command to run and the size of data(number of samples), please

@mrysztow
Copy link

@guoshengCS can you share, how much RAM does your platform have?

@sfraczek
Copy link
Contributor Author

I run:

python -u train.py \
    --src_vocab_fpath toy/train/vocab.sources.txt \
    --trg_vocab_fpath toy/train/vocab.targets.txt \
    --train_file_pattern toy/train/pattern.txt \
    --use_token_batch True \
    --batch_size 1 \
    --sort_type pool \
    --pool_size 2 \
    --token_delimiter '\t' \
    --device CPU

I have data of size:

[sfraczek@nervana-skx42 train]$ wc -l vocab.sources.txt
23 vocab.sources.txt
[sfraczek@nervana-skx42 train]$ wc -l vocab.targets.txt
23 vocab.targets.txt
[sfraczek@nervana-skx42 train]$ wc -l pattern.txt
10003 pattern.txt

And my dataset looks like this:

[sfraczek@nervana-skx42 train]$ head vocab.sources.txt
<s>
<e>
<unk>
2       6404
3       6370
12      6295
9       6294
10      6293
笑      6280
6       6274
[sfraczek@nervana-skx42 train]$ head vocab.targets.txt
<s>
<e>
<unk>
2       6404
3       6370
12      6295
9       6294
10      6293
笑      6280
6       6274
[sfraczek@nervana-skx42 train]$ head pattern.txt
<s>      <s>
<e>      <e>
<unk>    <unk>
14 11 11 13 6 4 17 10 5 5 9 6 1 3 15     14 11 11 13 6 4 17 10 5 5 9 6 1 3 15
2 笑 7 11 5 8 8 7 14 0 11        2 笑 7 11 5 8 8 7 14 0 11
8 13 17 6 18 18 14 1 18 14 1 18 17 18    8 13 17 6 18 18 14 1 18 14 1 18 17 18
2 17 5 笑 8 3 8 3 17     2 17 5 笑 8 3 8 3 17
17 7 11 1 0 5 4 15 16 14 7 9 15 6 0      17 7 11 1 0 5 4 15 16 14 7 9 15 6 0
9 11 11 2 1 14 9 7 16 15 12 5 12 笑      9 11 11 2 1 14 9 7 16 15 12 5 12 笑
15 3 13 1 11 15 3 11 15          15 3 13 1 11 15 3 11 15

@sfraczek
Copy link
Contributor Author

sfraczek commented Aug 29, 2018

I have tried batch_size 23 and pool_size 230 this time and it worked. Thank for pointing me to the clue about data size. I wonder why the previous parameters failed though?

@guoshengCS
Copy link
Collaborator

guoshengCS commented Aug 29, 2018

@mrysztow The RAM size of my platform is about 216G

@guoshengCS
Copy link
Collaborator

@sfraczek I suggest using the setting --use_token_batch False when debugging, which means the --batch_size indicating the number of samples. Otherwise --batch_size=1 means the number of words rather than sentences(samples) included in one mini-batch is 1, which might lead to undesirable behavior, though I haven't find what exactly the bug is.

@sfraczek
Copy link
Contributor Author

The inference script prints only empty lines.
This is the only printing instruction:


and it always prints an empty line. Shouldn't there by anything? It seems to be doing something for a while, it enters that print statement many times over.

@sfraczek
Copy link
Contributor Author

@guoshengCS thank you for suggestion. When I specify use_token_batch False - what pool_size should I use?

@guoshengCS
Copy link
Collaborator

guoshengCS commented Aug 29, 2018

pool_size indicates the buffer size to pool data(samples). Then some process(sort by sentence length) will carry out on the data in buffer. It can be set as you want.

@sfraczek
Copy link
Contributor Author

sfraczek commented Aug 29, 2018

When I specify --pool_size 230 and --use_token_batch False I get
UnboundLocalError: local variable 'val_avg_cost' referenced before assignment
But when I specified batch_size 23 and pool_size 230 it worked on forged dataset.

@sfraczek
Copy link
Contributor Author

sfraczek commented Aug 30, 2018

I get weird errors still trying to run inference on wmt16_en_de. I think the cause is that I might have a bad model file from the forget dataset instead of from wmt16 dataset. If so, this error doesn't tell me anything. Paddle could use better error handling.

[sfraczek@nervana-skx42 transformer]$ ./infer2.sh
create data reader
running fast infer
iterating now
('batch_id ', 0)
preparing batch input
Traceback (most recent call last):
  File "infer.py", line 242, in <module>
    infer(args)
  File "infer.py", line 237, in infer
    inferencer(test_data, trg_idx2word, args.use_wordpiece)
  File "infer.py", line 186, in fast_infer
    return_numpy=False)
  File "/nfs/site/home/sfraczek/paddle/build.release/python/paddle/fluid/executor.py", line 470, in run
    self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: Enforce failed. Expected output_shape[unk_dim_idx] * capacity == -in_size, but received output_shape[unk_dim_idx] * capacity:-37000 != -in_size:-69000.
Invalid shape is given. at [/ec/fm/disks/nrv_algo_home01/sfraczek/paddle/paddle/fluid/operators/reshape_op.cc:98]

@sfraczek
Copy link
Contributor Author

That was indeed bad model file - from other dataset. Since the model is saved only once per epoch I had to shorten the epoch by inserting a break after few iterations (it could take maybe days to just finish 1 epoch). This got me a model I could use for inference and it finally run. But after few iterations Linux has "Killed" the process again.

[Tue Aug 28 06:09:03 2018] Out of memory: Kill process 430041 (python) score 990 or sacrifice child
[Tue Aug 28 06:09:03 2018] Killed process 429849 (python) total-vm:416198832kB, anon-rss:390268240kB, file-rss:0kB

So, the question is... how do you claim that this model works on CPU? Do you use some kind of multinode? How do you put this into GPUs if it consumes so much memory? Do you have any estimate of how much memory is required to train on wmt16?

@guoshengCS
Copy link
Collaborator

guoshengCS commented Sep 3, 2018

model code: commit 597bae2
Paddle code: release/0.15.0
command to run:

export CPU_NUM=1
export PYTHONPATH=/home/paddle/guosheng/repos/paddle-0.15.0/Paddle/build/python
python -u train.py \
  --src_vocab_fpath /home/paddle/guosheng/data/wmt_bpe/vocab_all.bpe.32000 \
  --trg_vocab_fpath /home/paddle/guosheng/data/wmt_bpe/vocab_all.bpe.32000 \
  --special_token '<s>' '<e>' '<unk>' \
  --train_file_pattern /home/paddle/guosheng/data/wmt_bpe/train.tok.clean.bpe.32000.en-de.tiny \
  --use_token_batch True \
  --batch_size 4096 \
  --sort_type pool \
  --pool_size 200000 \
  --device CPU \
  learning_rate 2.0 \
  warmup_steps 8000 \
  beta2 0.997 \
  pass_num 100

train.tok.clean.bpe.32000.en-de.tiny is got by head -n400000 on the whole training data.
The training log and memory usage is as following:
image

When training on GPU, similar memory usage(20G) on each device. While I haven't trained on wmt16 completely with CPU, and the CPU support is not added by me...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants