-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Labels
invalidThis doesn't seem rightThis doesn't seem right
Description
When batching data, Saber truncates / right-pads each sequence to match a length of saber.constants.MAX_SENT_LEN.
Truncating sequences should only happen on the train set, ensuring that we don't drop examples in the evaluation partitions (dataset_folder/valid.* and dataset_folder/test.*)
Furthermore, a user should be able to specify some kind of percentile (e.g. 0.99), which would set the max sequence length to whatever length truncates only 1% of all training examples. This would be a principled way to choose the value. This could lead to big reductions in training time if there were a handful of very long sentences.
Metadata
Metadata
Assignees
Labels
invalidThis doesn't seem rightThis doesn't seem right