How to correctly run transformer? #1059

sfraczek · 2018-07-19T15:07:56Z

Hi,

I have encountered a number of problems with fluid/neural_machine_translation/transformer model. Am I doing something wrong? How to correctly run it?

Steps I have taken

Following instructions in https://github.com/PaddlePaddle/models/blob/develop/fluid/neural_machine_translation/transformer/README_cn.md I have downloaded WMT'16 EN-DE from https://github.com/google/seq2seq/blob/master/docs/data.md by clicking download.

Next I extracted it to wmt16_en_de directory.

Next I did paste -d ' \ t ' train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.tok.clean.bpe.32000.en-de

Then I did sed -i '1i\<s>\n<e>\n<unk>' vocab.bpe.32000

in config.py I changed use_gpu = True to False.
In train.py I added import multiprocessing and changed dev_count = fluid.core.get_cuda_device_count() to dev_count = fluid.core.get_cuda_device_count() if TrainTaskConfig.use_gpu else multiprocessing.cpu_count().

Training

I launched training by python -u train.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de --use_token_batch True --batch_size 3200 --sort_type pool --pool_size 200000

but I got

E0719 14:26:29.439303 55138 graph.cc:43] softmax_with_cross_entropy_grad input var not in all_var list: softmax_with_cross_entropy_0.tmp_0@GRAD
epoch: 0, consumed 0.000161s
Traceback (most recent call last):
  File "train.py", line 428, in <module>
    train(args)
  File "train.py", line 419, in train
    "pass_" + str(pass_id) + ".checkpoint"))
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 288, in save_persistables
    filename=filename)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 166, in save_vars
    filename=filename)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 197, in save_vars
    executor.run(save_program)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/executor.py", line 449, in run
    self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: holder_ should not be null
Tensor not initialized yet when Tensor::type() is called. at [/home/sfraczek/Paddle/paddle/fluid/framework/tensor.h:139]
PaddlePaddle Call Stacks:
0       0x7f060e948f1cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
1       0x7f060e94b901p paddle::framework::Tensor::type() const + 209
2       0x7f060f617bf6p paddle::operators::SaveOp::SaveLodTensor(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::va
riant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> con
st&, paddle::framework::Variable*) const + 614
3       0x7f060f618472p paddle::operators::SaveOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boos
t::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::varian
t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::
detail::variant::void_> const&) const + 210

So I have commented out

#fluid.io.save_persistables(
#    exe,
#    os.path.join(TrainTaskConfig.ckpt_dir,
#                 "pass_" + str(pass_id) + ".checkpoint"))

and it worked.

Inference

So next I have tried to run inference.
I have found that the file wmt16_en_de/newstest2013.tok.bpe.32000.en-de doesn't exist but based on the README I guessed that I should run
paste -d ' \ t ' newstest2013.tok.bpe.32000.en newstest2013.tok.bpe.32000.de > newstest2013.tok.bpe.32000.en-de is this correct?

python -u infer.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --test_file_pattern wmt16_en_de/newstest2013.tok.bpe.32000.en-de --batch_size 4 model_path trained_models/pass_20.infer.model beam_size 5 but there was no ouptut from the script. It ended without error too.

I tried giving other files but it doesn't output anything either.

I added profiling by adding import paddle.fluid.profiler as profiler and

+    parser.add_argument(
+        "--profile",
+        type=bool,
+        default=False,
+        help="Enables/disables profiling.")

and

+    if args.profile:
+        with profiler.profiler("CPU", sorted_key='total') as cpuprof:
+            infer(args)
+    else:
+        infer(args)

But there is no output from the profile.

Please help.

The text was updated successfully, but these errors were encountered:

sfraczek · 2018-07-20T14:21:37Z

update on debugging

I found that there was a problem with paste command. It didn't insert the delimiter between the English and German sentences, so the files were not parsed correctly. Are you not experiencing this problem?

I have replaced the call to paste with a python script:

from itertools import izip

def concat_files(f1, f2, f3):
    with open(f1) as textfile1, open(f2) as textfile2, open(f3, "w") as output_file:
        for x, y in izip(textfile1, textfile2):
            x = x.strip()
            y = y.strip()
            output_file.write("{0} \t {1}\n".format(x, y))

def main():
    f1="newstest2013.tok.bpe.32000.en"
    f2="newstest2013.tok.bpe.32000.de"
    f3="newstest2013.tok.bpe.32000.en-de"
    concat_files(f1,f2,f3)

    f1="train.tok.clean.bpe.32000.en"
    f2="train.tok.clean.bpe.32000.de"
    f3="train.tok.clean.bpe.32000.en-de"
    concat_files(f1,f2,f3)

if __name__ == "__main__":
    main()

now I'm repeating the training and inference and I will keep you updated.

kuke · 2018-07-20T14:34:38Z

Are you sure you didn't miss \t when using paste cmd? It works fine in our linux systems.

sfraczek · 2018-07-20T14:55:17Z

I have run into

Traceback (most recent call last):
  File "train.py", line 428, in <module>
    train(args)
  File "train.py", line 401, in train
    feed=feed_list)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/parallel_executor.py", line 269, in run
    self.executor.run(fetch_list, fetch_var_name)
paddle.fluid.core.EnforceNotMet: enforce posix_memalign(&p, 4096ul, size) == 0 failed, 12 != 0
Alloc 5242880000 error! at [/home/sfraczek/Paddle/paddle/fluid/memory/detail/system_allocator.cc:52]
PaddlePaddle Call Stacks:
0       0x7f7950c082e9p paddle::memory::detail::CPUAllocator::Alloc(unsigned long*, unsigned long) + 5209
1       0x7f7950c044ecp paddle::memory::detail::BuddyAllocator::RefillPool() + 92
2       0x7f7950c04d68p paddle::memory::detail::BuddyAllocator::Alloc(unsigned long) + 760
3       0x7f7950b2a1f0p void* paddle::memory::Alloc<paddle::platform::CPUPlace>(paddle::platform::CPUPlace, unsigned long) + 192
4       0x7f7950b1d7fap paddle::framework::Tensor::mutable_data(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::type_index) + 410
5       0x7f794fa7d571p float* paddle::framework::Tensor::mutable_data<float>(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>) + 97
6       0x7f7950592f3dp paddle::operators::LayerNormKernel<paddle::platform::CPUDeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const + 781
7       0x7f79505932cfp std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CPUPlace, false, 0ul, paddle::operators::LayerNormKernel<paddle::platform::CPUDeviceContext, float>, paddle::operators::LayerNormKernel<paddle::platform::CPUDeviceContext, double> >::operator()(char const*, char const*) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) + 47
8       0x7f7950a6114bp paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 747
9       0x7f7950a5ca0dp paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 221
10      0x7f7950935246p
11      0x7f795096a4c1p paddle::framework::details::OpHandleBase::RunAndRecordEvent(std::function<void ()> const&) + 65
12      0x7f7950934d63p paddle::framework::details::ComputationOpHandle::RunImpl() + 83
13      0x7f795096e1e9p paddle::framework::details::OpHandleBase::Run(bool) + 393
14      0x7f7950961208p
15      0x7f7950962210p
16      0x7f79507a152ep std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 46
17      0x7f79fea9ba99p
18      0x7f795096033dp
19      0x7f7950966304p std::thread::_Impl<std::_Bind_simple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run() + 340
20      0x7f79f8582c80p
21      0x7f79fea946bap
22      0x7f79fe7ca41dp clone + 109

sfraczek · 2018-07-20T15:17:12Z

I need to update the transformer code to a new version and start again because there were a lot of changes.

sfraczek · 2018-07-24T15:28:22Z

I have been trying to run the transformer on the most recent develop branch but I keep running out of memory even with --batch_size 1 --pool_size 2 (I don't know what pool_size does exactly though). The process gets killed by the system displaying just Killed. I have roughly 180GB RAM. I don't see any iteration before it gets killed and it takes a lot of time before it gets killed. Also I didn't notice any obvious memory usage increasing steadily so it probably happens quickly.
Can you help me? Have you successfully ran training on CPU?

The command I have been using

python -u train.py   --src_vocab_fpath wmt16_en_de/vocab.bpe.32000   --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000   --special_token '<s>' '<e>' '<unk>'   --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de   --use_token_batch True   --batch_size 1   --sort_type pool_size 2 --device CPU

I think it is caused by out of memory because of this command dmesg -T| grep -E -i -B100 'killed process' returning this:

[wto lip 24 14:58:27 2018] Out of memory: Kill process 128566 (python) score 878 or sacrifice child
[wto lip 24 14:58:27 2018] Killed process 128566 (python) total-vm:393885000kB, anon-rss:176935556kB, file-rss:70104kB, shmem-rss:4096kB

sfraczek · 2018-07-27T09:41:23Z

I tried running transformer with valgrind's tool named massif but the tool didn't work so I couldn't track the issue.
This is a blocker for us. Please fix the CPU path so we can work on optimizing it.
We must suspend the effort until it starts working.

guoshengCS · 2018-07-30T03:39:35Z

I reproduce the error, while I can run successfully with export CPU_NUM=1 and the number of CPU increases the memories request linearly.

sfraczek · 2018-07-30T10:55:54Z

Thanks. I have tried it but it didn't help.

sfraczek@gklab-48-118:~/paddle-models/fluid/neural_machine_translation/transformer$ export CPU_NUM=1
sfraczek@gklab-48-118:~/paddle-models/fluid/neural_machine_translation/transformer$ python -u train.py   --src_vocab_fpath wmt16_en_de/vocab.bpe.32000   --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000   --special_token '<s>' '<e>' '<unk>'   --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de   --use_token_batch True   --batch_size 1   --sort_type pool --pool_size 1
Killed

Running dmesg -T| grep -E -i -B100 'killed process' still returns

[pon lip 30 12:13:09 2018] Out of memory: Kill process 279624 (python) score 909 or sacrifice child
[pon lip 30 12:13:09 2018] Killed process 279624 (python) total-vm:400701208kB, anon-rss:182189832kB, file-rss:71028kB, shmem-rss:4096kB

I will try again with latest develop branch - I forgot to change branch.

sfraczek · 2018-07-30T11:56:11Z

It ran out of memory on develop too. ~188 GB
Log:

sfraczek@gklab-48-118:~/paddle-models/fluid/neural_machine_translation/transformer$ python -u train.py   --src_vocab_fpath wmt16_en_de/vocab.bpe.32000   --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000   --special_token '<s>' '<e>' '<unk>'   --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de   --use_token_batch True   --batch_size 1   --sort_type pool --pool_size 1
Namespace(batch_size=1, device='GPU', local=True, opts=[], pool_size=1, shuffle=True, shuffle_batch=True, sort_type='pool', special_token=['<s>', '<e>', '<unk>'], src_vocab_fpath='wmt16_en_de/vocab.bpe.32000', sync=True, train_file_pattern='wmt16_en_de/train.tok.clean.bpe.32000.en-de', trg_vocab_fpath='wmt16_en_de/vocab.bpe.32000', use_token_batch=True, val_file_pattern=None)
local start_up:
init fluid.framework.default_startup_program
Killed

sfraczek · 2018-08-28T14:09:12Z

Any idea what this error means?

local start_up:
init fluid.framework.default_startup_program
epoch: 0, batch: 0, avg loss: 3.810426, normalized loss: 3.176239, ppl: 45.169689
epoch: 0, batch: 1, avg loss: 3.837661, normalized loss: 3.203474, ppl: 46.416790
epoch: 0, batch: 2, avg loss: 3.781586, normalized loss: 3.147399, ppl: 43.885597
epoch: 0, batch: 3, avg loss: 3.787708, normalized loss: 3.153520, ppl: 44.155060
epoch: 0, batch: 4, avg loss: 3.759843, normalized loss: 3.125656, ppl: 42.941700
epoch: 0, batch: 5, avg loss: 3.691916, normalized loss: 3.057729, ppl: 40.121647
epoch: 0, batch: 6, avg loss: 3.644374, normalized loss: 3.010186, ppl: 38.258801
Traceback (most recent call last):
  File "train.py", line 553, in <module>
    train(args)
  File "train.py", line 509, in train
    lr_scheduler, token_num, predict)
  File "train.py", line 388, in train_loop
    ModelHyperParams.d_model)
  File "train.py", line 189, in prepare_batch_input
    [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
  File "train.py", line 140, in pad_batch_data
    max_len = max(len(inst) for inst in insts)
ValueError: max() arg is an empty sequence

guoshengCS · 2018-08-29T02:07:17Z

It might be caused by the empty list insts.

sfraczek · 2018-08-29T06:27:32Z

But what is this from the perspective of a user of the framework who doesn't know the internals? Is it a bug or something in the data?

guoshengCS · 2018-08-29T07:08:23Z

The insts is the list of samples and will not be empty normally. I am not sure if there is a bug when processing data
https://github.com/PaddlePaddle/models/blob/develop/fluid/neural_machine_translation/transformer/train.py#L246 . May you paste the command to run and the size of data(number of samples), please

mrysztow · 2018-08-29T07:37:53Z

@guoshengCS can you share, how much RAM does your platform have?

sfraczek · 2018-08-29T07:49:02Z

I run:

python -u train.py \
    --src_vocab_fpath toy/train/vocab.sources.txt \
    --trg_vocab_fpath toy/train/vocab.targets.txt \
    --train_file_pattern toy/train/pattern.txt \
    --use_token_batch True \
    --batch_size 1 \
    --sort_type pool \
    --pool_size 2 \
    --token_delimiter '\t' \
    --device CPU

I have data of size:

[sfraczek@nervana-skx42 train]$ wc -l vocab.sources.txt
23 vocab.sources.txt
[sfraczek@nervana-skx42 train]$ wc -l vocab.targets.txt
23 vocab.targets.txt
[sfraczek@nervana-skx42 train]$ wc -l pattern.txt
10003 pattern.txt

And my dataset looks like this:

[sfraczek@nervana-skx42 train]$ head vocab.sources.txt
<s>
<e>
<unk>
2       6404
3       6370
12      6295
9       6294
10      6293
笑      6280
6       6274
[sfraczek@nervana-skx42 train]$ head vocab.targets.txt
<s>
<e>
<unk>
2       6404
3       6370
12      6295
9       6294
10      6293
笑      6280
6       6274
[sfraczek@nervana-skx42 train]$ head pattern.txt
<s>      <s>
<e>      <e>
<unk>    <unk>
14 11 11 13 6 4 17 10 5 5 9 6 1 3 15     14 11 11 13 6 4 17 10 5 5 9 6 1 3 15
2 笑 7 11 5 8 8 7 14 0 11        2 笑 7 11 5 8 8 7 14 0 11
8 13 17 6 18 18 14 1 18 14 1 18 17 18    8 13 17 6 18 18 14 1 18 14 1 18 17 18
2 17 5 笑 8 3 8 3 17     2 17 5 笑 8 3 8 3 17
17 7 11 1 0 5 4 15 16 14 7 9 15 6 0      17 7 11 1 0 5 4 15 16 14 7 9 15 6 0
9 11 11 2 1 14 9 7 16 15 12 5 12 笑      9 11 11 2 1 14 9 7 16 15 12 5 12 笑
15 3 13 1 11 15 3 11 15          15 3 13 1 11 15 3 11 15

sfraczek · 2018-08-29T08:07:44Z

I have tried batch_size 23 and pool_size 230 this time and it worked. Thank for pointing me to the clue about data size. I wonder why the previous parameters failed though?

guoshengCS · 2018-08-29T08:20:16Z

@mrysztow The RAM size of my platform is about 216G

guoshengCS · 2018-08-29T10:31:13Z

@sfraczek I suggest using the setting --use_token_batch False when debugging, which means the --batch_size indicating the number of samples. Otherwise --batch_size=1 means the number of words rather than sentences(samples) included in one mini-batch is 1, which might lead to undesirable behavior, though I haven't find what exactly the bug is.

sfraczek · 2018-08-29T11:28:57Z

The inference script prints only empty lines.
This is the only printing instruction:

models/fluid/neural_machine_translation/transformer/infer.py

Line 207 in c3ada25

print(hyps[i][-1])

and it always prints an empty line. Shouldn't there by anything? It seems to be doing something for a while, it enters that print statement many times over.

sfraczek · 2018-08-29T11:52:57Z

@guoshengCS thank you for suggestion. When I specify use_token_batch False - what pool_size should I use?

guoshengCS · 2018-08-29T12:06:10Z

pool_size indicates the buffer size to pool data(samples). Then some process(sort by sentence length) will carry out on the data in buffer. It can be set as you want.

sfraczek · 2018-08-29T12:20:00Z

When I specify --pool_size 230 and --use_token_batch False I get
UnboundLocalError: local variable 'val_avg_cost' referenced before assignment
But when I specified batch_size 23 and pool_size 230 it worked on forged dataset.

sfraczek · 2018-08-30T12:10:25Z

I get weird errors still trying to run inference on wmt16_en_de. I think the cause is that I might have a bad model file from the forget dataset instead of from wmt16 dataset. If so, this error doesn't tell me anything. Paddle could use better error handling.

[sfraczek@nervana-skx42 transformer]$ ./infer2.sh
create data reader
running fast infer
iterating now
('batch_id ', 0)
preparing batch input
Traceback (most recent call last):
  File "infer.py", line 242, in <module>
    infer(args)
  File "infer.py", line 237, in infer
    inferencer(test_data, trg_idx2word, args.use_wordpiece)
  File "infer.py", line 186, in fast_infer
    return_numpy=False)
  File "/nfs/site/home/sfraczek/paddle/build.release/python/paddle/fluid/executor.py", line 470, in run
    self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: Enforce failed. Expected output_shape[unk_dim_idx] * capacity == -in_size, but received output_shape[unk_dim_idx] * capacity:-37000 != -in_size:-69000.
Invalid shape is given. at [/ec/fm/disks/nrv_algo_home01/sfraczek/paddle/paddle/fluid/operators/reshape_op.cc:98]

sfraczek · 2018-08-31T07:50:57Z

That was indeed bad model file - from other dataset. Since the model is saved only once per epoch I had to shorten the epoch by inserting a break after few iterations (it could take maybe days to just finish 1 epoch). This got me a model I could use for inference and it finally run. But after few iterations Linux has "Killed" the process again.

[Tue Aug 28 06:09:03 2018] Out of memory: Kill process 430041 (python) score 990 or sacrifice child
[Tue Aug 28 06:09:03 2018] Killed process 429849 (python) total-vm:416198832kB, anon-rss:390268240kB, file-rss:0kB

So, the question is... how do you claim that this model works on CPU? Do you use some kind of multinode? How do you put this into GPUs if it consumes so much memory? Do you have any estimate of how much memory is required to train on wmt16?

guoshengCS · 2018-09-03T06:06:33Z

model code: commit 597bae2
Paddle code: release/0.15.0
command to run:

export CPU_NUM=1
export PYTHONPATH=/home/paddle/guosheng/repos/paddle-0.15.0/Paddle/build/python
python -u train.py \
  --src_vocab_fpath /home/paddle/guosheng/data/wmt_bpe/vocab_all.bpe.32000 \
  --trg_vocab_fpath /home/paddle/guosheng/data/wmt_bpe/vocab_all.bpe.32000 \
  --special_token '<s>' '<e>' '<unk>' \
  --train_file_pattern /home/paddle/guosheng/data/wmt_bpe/train.tok.clean.bpe.32000.en-de.tiny \
  --use_token_batch True \
  --batch_size 4096 \
  --sort_type pool \
  --pool_size 200000 \
  --device CPU \
  learning_rate 2.0 \
  warmup_steps 8000 \
  beta2 0.997 \
  pass_num 100

train.tok.clean.bpe.32000.en-de.tiny is got by head -n400000 on the whole training data.
The training log and memory usage is as following:

When training on GPU, similar memory usage(20G) on each device. While I haven't trained on wmt16 completely with CPU, and the CPU support is not added by me...

$@sfraczek$ sfraczek added the intel label Jul 19, 2018

$@sfraczek$ sfraczek mentioned this issue Jul 19, 2018

Where to download transformer dataset from? #1058

Closed

wanghaoshuang assigned ktlichkid, kuke and guoshengCS and unassigned ktlichkid Jul 20, 2018

$@sfraczek$ sfraczek closed this as completed Jul 20, 2018

$@sfraczek$ sfraczek reopened this Jul 24, 2018

How to correctly run transformer? #1059

How to correctly run transformer? #1059

Comments

sfraczek commented Jul 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps I have taken

Training

Inference

sfraczek commented Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

update on debugging

Uh oh!

kuke commented Jul 20, 2018

Uh oh!

sfraczek commented Jul 20, 2018

Uh oh!

sfraczek commented Jul 20, 2018

Uh oh!

sfraczek commented Jul 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfraczek commented Jul 27, 2018

Uh oh!

guoshengCS commented Jul 30, 2018

Uh oh!

sfraczek commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfraczek commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfraczek commented Aug 28, 2018

Uh oh!

guoshengCS commented Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfraczek commented Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoshengCS commented Aug 29, 2018

Uh oh!

mrysztow commented Aug 29, 2018

Uh oh!

sfraczek commented Aug 29, 2018

Uh oh!

sfraczek commented Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoshengCS commented Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoshengCS commented Aug 29, 2018

Uh oh!

sfraczek commented Aug 29, 2018

Uh oh!

sfraczek commented Aug 29, 2018

Uh oh!

guoshengCS commented Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfraczek commented Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfraczek commented Aug 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfraczek commented Aug 31, 2018

Uh oh!

guoshengCS commented Sep 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfraczek commented Jul 19, 2018 •

edited

Loading

sfraczek commented Jul 20, 2018 •

edited

Loading

sfraczek commented Jul 24, 2018 •

edited

Loading

sfraczek commented Jul 30, 2018 •

edited

Loading

sfraczek commented Jul 30, 2018 •

edited

Loading

guoshengCS commented Aug 29, 2018 •

edited

Loading

sfraczek commented Aug 29, 2018 •

edited

Loading

sfraczek commented Aug 29, 2018 •

edited

Loading

guoshengCS commented Aug 29, 2018 •

edited

Loading

guoshengCS commented Aug 29, 2018 •

edited

Loading

sfraczek commented Aug 29, 2018 •

edited

Loading

sfraczek commented Aug 30, 2018 •

edited

Loading

guoshengCS commented Sep 3, 2018 •

edited

Loading