Skip to content

Truncating should effect only the train set #166

@JohnGiorgi

Description

@JohnGiorgi

When batching data, Saber truncates / right-pads each sequence to match a length of saber.constants.MAX_SENT_LEN.

Truncating sequences should only happen on the train set, ensuring that we don't drop examples in the evaluation partitions (dataset_folder/valid.* and dataset_folder/test.*)

Furthermore, a user should be able to specify some kind of percentile (e.g. 0.99), which would set the max sequence length to whatever length truncates only 1% of all training examples. This would be a principled way to choose the value. This could lead to big reductions in training time if there were a handful of very long sentences.

Metadata

Metadata

Assignees

Labels

invalidThis doesn't seem right

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions