Truncating should effect only the train set

When batching data, Saber truncates / right-pads each sequence to match a length of `saber.constants.MAX_SENT_LEN`.

Truncating sequences should only happen on the train set, ensuring that we don't drop examples in the evaluation partitions (`dataset_folder/valid.*` and `dataset_folder/test.*`)

Furthermore, a user should be able to specify some kind of percentile (e.g. `0.99`), which would set the max sequence length to whatever length truncates only 1% of all training examples. This would be a principled way to choose the value. This could lead to big reductions in training time if there were a handful of very long sentences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Truncating should effect only the train set #166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Truncating should effect only the train set #166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions