What is the convention to filter non-words? #1251
Answered
by
AngledLuffa
otakutyrant
asked this question in
Q&A
Replies: 1 comment
-
We don't provide that facility, but of course you can filter punctuation as
you need, or you can read about stopwords
This seems to cover some of it:
https://byteiota.com/stopwords/
…On Sun, May 28, 2023 at 11:09 PM otakutyrant ***@***.***> wrote:
I lemmatized the whole book and counted them. But the top are symbols:
, 44348
. 19035
I 9928
— 6975
'' 6835
my 3714
; 3421
- 3172
! 2036
) 1183
's 1154
( 1148
? 1116
' 941
| 575
: 543
[ 313
] 312
* 281
I know how to filter them. Just use isalpha() and isascii. But I wonder
what the convetion is in NLP, like internal API which I do not know?
—
Reply to this email directly, view it on GitHub
<#1251>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWKMU4BHMS45NYXVNH3XIQ4QZANCNFSM6AAAAAAYSMDNKE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
otakutyrant
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I lemmatized a book and counted them. But the tops are almost symbols:
I know how to filter them. Just use
isalpha()
andisascii
. But I wonder what the convention is in NLP, like some internal API which I do not know?Beta Was this translation helpful? Give feedback.
All reactions