-
Notifications
You must be signed in to change notification settings - Fork 2.9k
How to correctly run transformer? #1059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
update on debuggingI found that there was a problem with I have replaced the call to from itertools import izip
def concat_files(f1, f2, f3):
with open(f1) as textfile1, open(f2) as textfile2, open(f3, "w") as output_file:
for x, y in izip(textfile1, textfile2):
x = x.strip()
y = y.strip()
output_file.write("{0} \t {1}\n".format(x, y))
def main():
f1="newstest2013.tok.bpe.32000.en"
f2="newstest2013.tok.bpe.32000.de"
f3="newstest2013.tok.bpe.32000.en-de"
concat_files(f1,f2,f3)
f1="train.tok.clean.bpe.32000.en"
f2="train.tok.clean.bpe.32000.de"
f3="train.tok.clean.bpe.32000.en-de"
concat_files(f1,f2,f3)
if __name__ == "__main__":
main() now I'm repeating the training and inference and I will keep you updated. |
Are you sure you didn't miss \t when using |
I have run into
|
I need to update the transformer code to a new version and start again because there were a lot of changes. |
I have been trying to run the transformer on the most recent develop branch but I keep running out of memory even with The command I have been using python -u train.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de --use_token_batch True --batch_size 1 --sort_type pool_size 2 --device CPU I think it is caused by out of memory because of this command
|
I tried running transformer with valgrind's tool named massif but the tool didn't work so I couldn't track the issue. |
I reproduce the error, while I can run successfully with |
Thanks. I have tried it but it didn't help.
Running
I will try again with latest develop branch - I forgot to change branch. |
It ran out of memory on develop too. ~188 GB
|
Any idea what this error means?
|
It might be caused by the empty list |
But what is this from the perspective of a user of the framework who doesn't know the internals? Is it a bug or something in the data? |
The |
@guoshengCS can you share, how much RAM does your platform have? |
I run: python -u train.py \
--src_vocab_fpath toy/train/vocab.sources.txt \
--trg_vocab_fpath toy/train/vocab.targets.txt \
--train_file_pattern toy/train/pattern.txt \
--use_token_batch True \
--batch_size 1 \
--sort_type pool \
--pool_size 2 \
--token_delimiter '\t' \
--device CPU I have data of size: [sfraczek@nervana-skx42 train]$ wc -l vocab.sources.txt
23 vocab.sources.txt
[sfraczek@nervana-skx42 train]$ wc -l vocab.targets.txt
23 vocab.targets.txt
[sfraczek@nervana-skx42 train]$ wc -l pattern.txt
10003 pattern.txt And my dataset looks like this:
|
I have tried batch_size 23 and pool_size 230 this time and it worked. Thank for pointing me to the clue about data size. I wonder why the previous parameters failed though? |
@mrysztow The RAM size of my platform is about 216G |
@sfraczek I suggest using the setting |
The inference script prints only empty lines.
and it always prints an empty line. Shouldn't there by anything? It seems to be doing something for a while, it enters that print statement many times over. |
@guoshengCS thank you for suggestion. When I specify |
|
When I specify |
I get weird errors still trying to run inference on wmt16_en_de. I think the cause is that I might have a bad model file from the forget dataset instead of from wmt16 dataset. If so, this error doesn't tell me anything. Paddle could use better error handling.
|
That was indeed bad model file - from other dataset. Since the model is saved only once per epoch I had to shorten the epoch by inserting a break after few iterations (it could take maybe days to just finish 1 epoch). This got me a model I could use for inference and it finally run. But after few iterations Linux has "Killed" the process again.
So, the question is... how do you claim that this model works on CPU? Do you use some kind of multinode? How do you put this into GPUs if it consumes so much memory? Do you have any estimate of how much memory is required to train on wmt16? |
model code: commit 597bae2
When training on GPU, similar memory usage(20G) on each device. While I haven't trained on wmt16 completely with CPU, and the CPU support is not added by me... |
Uh oh!
There was an error while loading. Please reload this page.
Hi,
I have encountered a number of problems with fluid/neural_machine_translation/transformer model. Am I doing something wrong? How to correctly run it?
Steps I have taken
Following instructions in https://github.com/PaddlePaddle/models/blob/develop/fluid/neural_machine_translation/transformer/README_cn.md I have downloaded WMT'16 EN-DE from https://github.com/google/seq2seq/blob/master/docs/data.md by clicking download.
Next I extracted it to
wmt16_en_de
directory.Next I did
paste -d ' \ t ' train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.tok.clean.bpe.32000.en-de
Then I did
sed -i '1i\<s>\n<e>\n<unk>' vocab.bpe.32000
in config.py I changed
use_gpu = True
toFalse
.In train.py I added
import multiprocessing
and changeddev_count = fluid.core.get_cuda_device_count()
todev_count = fluid.core.get_cuda_device_count() if TrainTaskConfig.use_gpu else multiprocessing.cpu_count()
.Training
I launched training by
python -u train.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de --use_token_batch True --batch_size 3200 --sort_type pool --pool_size 200000
but I got
So I have commented out
and it worked.
Inference
So next I have tried to run inference.
I have found that the file wmt16_en_de/newstest2013.tok.bpe.32000.en-de doesn't exist but based on the README I guessed that I should run
paste -d ' \ t ' newstest2013.tok.bpe.32000.en newstest2013.tok.bpe.32000.de > newstest2013.tok.bpe.32000.en-de
is this correct?python -u infer.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --test_file_pattern wmt16_en_de/newstest2013.tok.bpe.32000.en-de --batch_size 4 model_path trained_models/pass_20.infer.model beam_size 5
but there was no ouptut from the script. It ended without error too.I tried giving other files but it doesn't output anything either.
I added profiling by adding
import paddle.fluid.profiler as profiler
andand
But there is no output from the profile.
Please help.
The text was updated successfully, but these errors were encountered: