-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
(Reporting this here as well as https://code.google.com/p/mitlm/issues/detail?id=44 in case github gets more attention these days)
The crash only happens if the ngram order is higher than 1, and only if the # occurs at the start of a token.
I'm guessing this is because it interprets a # at the beginning of a line in a text counts file as a comment and skips it, meaning a unigram beginning with a # is missing from the term dictionary when it's encountered in a later bigram.
What steps will reproduce the problem?
$ estimate-ngram -wc counts -text <(echo 'a #hashtag')
0.001 Loading corpus /dev/fd/63...
0.002 Smoothing[1] = ModKN
0.002 Smoothing[2] = ModKN
0.002 Smoothing[3] = ModKN
0.002 Set smoothing algorithms...
0.002 Saving counts to counts...
$ cat counts
<s> 1
a 1
#hashtag 1
<s> a 1
a #hashtag 1
#hashtag </s> 1
<s> a #hashtag 1
a #hashtag </s> 1
$ estimate-ngram -counts counts -wl lm.arpa
0.001 Loading counts counts...
estimate-ngram: src/NgramModel.cpp:800: void mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != NgramVector::Invalid)' failed.
Aborted (core dumped)
What version of the product are you using? On what operating system?
Built from latest master on github. Ubuntu 14.04.1
Metadata
Metadata
Assignees
Labels
No labels