Allow word tokenization override #27

freecraver · 2022-02-21T10:00:26Z

Introduces a non-breaking change which allows to override custom word-level tokenization.

The new f_tokenize_words argument accepts a function which maps a text to its words.

example:

from nltk import word_tokenize
r = Readability(text, f_tokenize_words=word_tokenize)

Tests run ✔️
Tests added ✔️
Added section 'What makes a word' to Readme ✔️

Additional remarks:

The main difference between nltks TweetTokenizer and the TreebankWordTokenizer I observed is the handling of clitics and abbreviations:

Text	Tweet	Treebank
`"We've got two different solutions"`	`["We've", 'got', 'two', 'different', 'solutions']`	`['We', "'ve", 'got', 'two', 'different', 'solutions']`
`'How common are abbreviations in the U.S.?'`	`['How', 'common', 'are', 'abbreviations', 'in', 'the', 'U', '.', 'S', '.', '?']`	`['How', 'common', 'are', 'abbreviations', 'in', 'the', 'U.S.', '?']`

freecraver added 2 commits February 21, 2022 10:42

allow word tokenization override

0d51e27

removed prints

0debeb1

freecraver mentioned this pull request Feb 21, 2022

Question: Why NLTK TweetTokenizer? #26

Open

Provide feedback