-
Notifications
You must be signed in to change notification settings - Fork 131
bug: StringEncoder and TextEncoder raise exceptions when the input is already categorical #1400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @MarieSacksick, thanks for opening the issue. It seems like the problem is that the object columns are already encoded as categories, and the StringEncoder does not handle that gracefully 🙈 For the time being, you should be able to get past the error by converting the categorical columns back to strings, and continuing from there. I'll fix this issue ASAP. |
It seems like the TextEncoder has the same issue |
No worries, I rollbacked to a stable version and it works fine. |
This is a minimal example for reproducibility import pandas as pd
from skrub import TableVectorizer, StringEncoder
data = {
'category1': pd.Categorical(['A', 'B', 'A', 'C']),
'category2': pd.Categorical(['X', 'Y', 'X', 'Z']),
'numeric': [1, 2, 3, 4]
}
df = pd.DataFrame(data)
tv = TableVectorizer(low_cardinality=StringEncoder(n_components=2))
out = tv.fit_transform(df) Gives
|
After an IRL talk with @Vincent-Maladiere we thought about converting the categorical columns to strings. The StringEncoder definitely needs this, because it's trying to fill null values and failing. The TextEncoder raises an exception if the column does not have string as datatype, but it might work if that particular check is modified to accept categorical columns 🤔 We could either modify the |
Describe the bug
I'm trying to use the tabular learner on census data, but the StringEncoder returns an error.
Steps/Code to Reproduce
Expected Results
The pipeline is ran (fit/transform + predict).
Actual Results
Versions
The text was updated successfully, but these errors were encountered: