Skip to content

Conversation

pe-trik
Copy link

@pe-trik pe-trik commented Oct 13, 2021

The CTC decoder might perform worse when using multicharacter tokens (e.g., BPEs). This issue is mentioned in #173.

The reason is that the implementation sets is_character_based in Scorer to false because the tokens in LM's dictionary have more than one character. When is_character_based is false the Scorer creates an FST based on malformed transitions (add_word_to_dictionary() breaks the tokens/words in LM's dictionary to characters and not to tokens).

This pull request adds an option is_token_based that indicates that the vocabulary consists of custom (multicharacter) tokens.

@pe-trik
Copy link
Author

pe-trik commented Oct 15, 2021

Hi @SeanNaren , could you please review the PR? Thanks, Peter.

@TehGreatCat
Copy link

@SeanNaren please merge this branch, this is a really needed feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants