bug: StringEncoder and TextEncoder raise exceptions when the input is already categorical #1400

MarieSacksick · 2025-05-21T12:55:07Z

Describe the bug

I'm trying to use the tabular learner on census data, but the StringEncoder returns an error.

Steps/Code to Reproduce

# %%
from sklearn.datasets import fetch_openml

X, y = fetch_openml("adult", version=2, as_frame=True, return_X_y=True)
# %%
# Let's take a look at the data
# in real life, we would do a lot more data exploration.
X.info()
# %%
y.value_counts()

# %%
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_encoded = le.fit_transform(y)

# %%
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, random_state=1)

# %%
from skrub import tabular_learner

baseline = tabular_learner("classification")
baseline

# %%
from skore import EstimatorReport

baseline_report = EstimatorReport(
    baseline,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
)
baseline_report.help()

Expected Results

The pipeline is ran (fit/transform + predict).

Actual Results

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_on_each_column.py:543, in _fit_transform_column(column, y, columns_to_handle, transformer, allow_reject)
    542 try:
--> 543     output = transformer.fit_transform(transformer_input, y=y)
    544 except allowed:

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_on_each_column.py:154, in _wrap_add_check_single_column.<locals>.fit_transform(self, X, y)
    153 self._check_single_column(X, f.__name__)
--> 154 return f(self, X, y=y)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_string_encoder.py:136, in StringEncoder.fit_transform(***failed resolving arguments***)
    131     raise ValueError(
    132         f"Unknown vectorizer {self.vectorizer}. Options are 'tfidf' or"
    133         f" 'hashing', got {self.vectorizer!r}"
    134     )
--> 136 X_filled = sbd.fill_nulls(X, "")
    137 X_out = self.vectorizer_.fit_transform(X_filled).astype("float32")

File ~/anaconda3/envs/skore_test/lib/python3.12/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    906     raise TypeError(f'{funcname} requires at least '
    907                     '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_dataframe/_common.py:1126, in _fill_nulls_pandas(obj, value)
   1125 with pd.option_context("future.no_silent_downcasting", True):
-> 1126     return obj.fillna(value)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/pandas/core/generic.py:7349, in NDFrame.fillna(self, value, method, axis, inplace, limit, downcast)
   7343         raise TypeError(
   7344             '"value" parameter must be a scalar, dict '
   7345             "or Series, but you passed a "
   7346             f'"{type(value).__name__}"'
   7347         )
-> 7349     new_data = self._mgr.fillna(
   7350         value=value, limit=limit, inplace=inplace, downcast=downcast
   7351     )
   7353 elif isinstance(value, (dict, ABCSeries)):

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/pandas/core/internals/base.py:186, in DataManager.fillna(self, value, limit, inplace, downcast)
    184     limit = libalgos.validate_limit(None, limit=limit)
--> 186 return self.apply_with_block(
    187     "fillna",
    188     value=value,
    189     limit=limit,
    190     inplace=inplace,
    191     downcast=downcast,
    192     using_cow=using_copy_on_write(),
    193     already_warned=_AlreadyWarned(),
    194 )

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    362 else:
--> 363     applied = getattr(b, f)(**kwargs)
    364 result_blocks = extend_blocks(applied, result_blocks)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/pandas/core/internals/blocks.py:2334, in ExtensionBlock.fillna(self, value, limit, inplace, downcast, using_cow, already_warned)
   2333 refs = None
-> 2334 new_values = self.values.fillna(value=value, method=None, limit=limit)
   2335 # issue the warning *after* retrying, in case the TypeError
   2336 #  was caused by an invalid fill_value

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/pandas/core/arrays/_mixins.py:372, in NDArrayBackedExtensionArray.fillna(self, value, method, limit, copy)
    371             new_values = self[:]
--> 372         new_values[mask] = value
    373 else:
    374     # We validate the fill_value even if there is nothing to fill

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/pandas/core/arrays/_mixins.py:261, in NDArrayBackedExtensionArray.__setitem__(self, key, value)
    260 key = check_array_indexer(self, key)
--> 261 value = self._validate_setitem_value(value)
    262 self._ndarray[key] = value

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/pandas/core/arrays/categorical.py:1589, in Categorical._validate_setitem_value(self, value)
   1588 else:
-> 1589     return self._validate_scalar(value)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/pandas/core/arrays/categorical.py:1614, in Categorical._validate_scalar(self, fill_value)
   1613 else:
-> 1614     raise TypeError(
   1615         "Cannot setitem on a Categorical with a new "
   1616         f"category ({fill_value}), set the categories first"
   1617     ) from None
   1618 return fill_value

TypeError: Cannot setitem on a Categorical with a new category (), set the categories first

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[8], line 3
      1 from skore import EstimatorReport
----> 3 baseline_report = EstimatorReport(
      4     baseline,
      5     X_train=X_train,
      6     y_train=y_train,
      7     X_test=X_test,
      8     y_test=y_test,
      9 )
     10 baseline_report.help()

File ~/Documents/skore/skore/src/skore/sklearn/_estimator/report.py:141, in EstimatorReport.__init__(self, estimator, fit, X_train, y_train, X_test, y_test)
    139         self._estimator = self._copy_estimator(estimator)
    140     except NotFittedError:
--> 141         self._estimator, fit_time = self._fit_estimator(
    142             estimator, X_train, y_train
    143         )
    144 elif fit is True:
    145     self._estimator, fit_time = self._fit_estimator(estimator, X_train, y_train)

File ~/Documents/skore/skore/src/skore/sklearn/_estimator/report.py:104, in EstimatorReport._fit_estimator(estimator, X_train, y_train)
    102 estimator_ = clone(estimator)
    103 with MeasureTime() as fit_time:
--> 104     estimator_.fit(X_train, y_train)
    105 return estimator_, fit_time()

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/base.py:1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1382     estimator._validate_params()
   1384 with config_context(
   1385     skip_parameter_validation=(
   1386         prefer_skip_nested_validation or global_skip_validation
   1387     )
   1388 ):
-> 1389     return fit_method(estimator, *args, **kwargs)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/pipeline.py:654, in Pipeline.fit(self, X, y, **params)
    647     raise ValueError(
    648         "The `transform_input` parameter can only be set if metadata "
    649         "routing is enabled. You can enable metadata routing using "
    650         "`sklearn.set_config(enable_metadata_routing=True)`."
    651     )
    653 routed_params = self._check_method_params(method="fit", props=params)
--> 654 Xt = self._fit(X, y, routed_params, raw_params=params)
    655 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    656     if self._final_estimator != "passthrough":

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/pipeline.py:588, in Pipeline._fit(self, X, y, routed_params, raw_params)
    581 # Fit or load from cache the current transformer
    582 step_params = self._get_metadata_for_step(
    583     step_idx=step_idx,
    584     step_params=routed_params[name],
    585     all_params=raw_params,
    586 )
--> 588 X, fitted_transformer = fit_transform_one_cached(
    589     cloned_transformer,
    590     X,
    591     y,
    592     weight=None,
    593     message_clsname="Pipeline",
    594     message=self._log_message(step_idx),
    595     params=step_params,
    596 )
    597 # Replace the transformer of the step with the fitted
    598 # transformer. This is necessary when loading the transformer
    599 # from the cache.
    600 self.steps[step_idx] = (name, fitted_transformer)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/joblib/memory.py:312, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    311 def __call__(self, *args, **kwargs):
--> 312     return self.func(*args, **kwargs)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/pipeline.py:1551, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, params)
   1549 with _print_elapsed_time(message_clsname, message):
   1550     if hasattr(transformer, "fit_transform"):
-> 1551         res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
   1552     else:
   1553         res = transformer.fit(X, y, **params.get("fit", {})).transform(
   1554             X, **params.get("transform", {})
   1555         )

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/utils/_set_output.py:319, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    317 @wraps(f)
    318 def wrapped(self, X, *args, **kwargs):
--> 319     data_to_wrap = f(self, X, *args, **kwargs)
    320     if isinstance(data_to_wrap, tuple):
    321         # only wrap the first output for cross decomposition
    322         return_tuple = (
    323             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    324             *data_to_wrap[1:],
    325         )

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_table_vectorizer.py:775, in TableVectorizer.fit_transform(self, X, y)
    773 self._check_specific_columns()
    774 self._make_pipeline()
--> 775 output = self._pipeline.fit_transform(X, y=y)
    776 self.all_outputs_ = sbd.column_names(output)
    777 self._store_processing_steps()

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/base.py:1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1382     estimator._validate_params()
   1384 with config_context(
   1385     skip_parameter_validation=(
   1386         prefer_skip_nested_validation or global_skip_validation
   1387     )
   1388 ):
-> 1389     return fit_method(estimator, *args, **kwargs)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/pipeline.py:718, in Pipeline.fit_transform(self, X, y, **params)
    679 """Fit the model and transform with the final estimator.
    680 
    681 Fit all the transformers one after the other and sequentially transform
   (...)
    715     Transformed samples.
    716 """
    717 routed_params = self._check_method_params(method="fit_transform", props=params)
--> 718 Xt = self._fit(X, y, routed_params)
    720 last_step = self._final_estimator
    721 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/pipeline.py:588, in Pipeline._fit(self, X, y, routed_params, raw_params)
    581 # Fit or load from cache the current transformer
    582 step_params = self._get_metadata_for_step(
    583     step_idx=step_idx,
    584     step_params=routed_params[name],
    585     all_params=raw_params,
    586 )
--> 588 X, fitted_transformer = fit_transform_one_cached(
    589     cloned_transformer,
    590     X,
    591     y,
    592     weight=None,
    593     message_clsname="Pipeline",
    594     message=self._log_message(step_idx),
    595     params=step_params,
    596 )
    597 # Replace the transformer of the step with the fitted
    598 # transformer. This is necessary when loading the transformer
    599 # from the cache.
    600 self.steps[step_idx] = (name, fitted_transformer)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/joblib/memory.py:312, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    311 def __call__(self, *args, **kwargs):
--> 312     return self.func(*args, **kwargs)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/pipeline.py:1551, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, params)
   1549 with _print_elapsed_time(message_clsname, message):
   1550     if hasattr(transformer, "fit_transform"):
-> 1551         res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
   1552     else:
   1553         res = transformer.fit(X, y, **params.get("fit", {})).transform(
   1554             X, **params.get("transform", {})
   1555         )

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/sklearn/utils/_set_output.py:319, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    317 @wraps(f)
    318 def wrapped(self, X, *args, **kwargs):
--> 319     data_to_wrap = f(self, X, *args, **kwargs)
    320     if isinstance(data_to_wrap, tuple):
    321         # only wrap the first output for cross decomposition
    322         return_tuple = (
    323             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    324             *data_to_wrap[1:],
    325         )

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_on_each_column.py:451, in OnEachColumn.fit_transform(self, X, y)
    449 parallel = Parallel(n_jobs=self.n_jobs)
    450 func = delayed(_fit_transform_column)
--> 451 results = parallel(
    452     func(
    453         sbd.col(X, col_name),
    454         y,
    455         self._columns,
    456         self.transformer,
    457         self.allow_reject,
    458     )
    459     for col_name in all_columns
    460 )
    461 return self._process_fit_transform_results(results, X)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
   1845 self.n_dispatched_batches += 1
   1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
   1848 self.n_completed_tasks += 1
   1849 self.print_progress()

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_on_each_column.py:547, in _fit_transform_column(column, y, columns_to_handle, transformer, allow_reject)
    545     return col_name, [column], None
    546 except Exception as e:
--> 547     raise ValueError(
    548         f"Transformer {transformer.__class__.__name__}.fit_transform "
    549         f"failed on column {col_name!r}. See above for the full traceback."
    550     ) from e
    551 output = _utils.check_output(transformer, transformer_input, output)
    552 output_cols = sbd.to_column_list(output)

ValueError: Transformer StringEncoder.fit_transform failed on column 'native-country'. See above for the full traceback.

Versions

System:
    python: 3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:31:09) [GCC 11.2.0]
executable: /home/marie/anaconda3/envs/skore_test/bin/python
   machine: Linux-6.8.0-59-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.6.1
          pip: 24.2
   setuptools: 75.1.0
        numpy: 2.2.0
        scipy: 1.14.1
       Cython: None
       pandas: 2.2.3
   matplotlib: 3.9.3
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-c128ec02.so
        version: 0.3.27.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libgomp
       filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
0.6.dev0

(skrub: skrub @ git+https://github.com/skrub-data/skrub.git@3f6d0c7301ebe75c5eb46bc3b2a988a395afc8cf)

The text was updated successfully, but these errors were encountered:

rcap107 · 2025-05-21T13:47:54Z

Hey @MarieSacksick, thanks for opening the issue.

It seems like the problem is that the object columns are already encoded as categories, and the StringEncoder does not handle that gracefully 🙈

For the time being, you should be able to get past the error by converting the categorical columns back to strings, and continuing from there. I'll fix this issue ASAP.

rcap107 · 2025-05-21T14:01:45Z

It seems like the TextEncoder has the same issue

MarieSacksick · 2025-05-21T14:10:01Z

No worries, I rollbacked to a stable version and it works fine.

rcap107 · 2025-05-21T14:21:55Z

This is a minimal example for reproducibility

import pandas as pd
from skrub import TableVectorizer, StringEncoder
data = {
'category1': pd.Categorical(['A', 'B', 'A', 'C']),
'category2': pd.Categorical(['X', 'Y', 'X', 'Z']),
'numeric': [1, 2, 3, 4]
}
df = pd.DataFrame(data)

tv = TableVectorizer(low_cardinality=StringEncoder(n_components=2))
out = tv.fit_transform(df)

Gives

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/Projects/skrub/skrub/_on_each_column.py:543, in _fit_transform_column(column, y, columns_to_handle, transformer, allow_reject)
    542 try:
--> 543     output = transformer.fit_transform(transformer_input, y=y)
    544 except allowed:

File ~/Projects/skrub/skrub/_on_each_column.py:154, in _wrap_add_check_single_column.<locals>.fit_transform(self, X, y)
    153 self._check_single_column(X, f.__name__)
--> 154 return f(self, X, y=y)

File ~/Projects/skrub/skrub/_string_encoder.py:134, in StringEncoder.fit_transform(***failed resolving arguments***)
    129     raise ValueError(
    130         f"Unknown vectorizer {self.vectorizer}. Options are 'tfidf' or"
    131         f" 'hashing', got {self.vectorizer!r}"
    132     )
--> 134 X_filled = sbd.fill_nulls(X, "")
    135 X_out = self.vectorizer_.fit_transform(X_filled).astype("float32")

File ~/.local/share/uv/python/cpython-3.11.10-linux-x86_64-gnu/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    906     raise TypeError(f'{funcname} requires at least '
    907                     '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)

File ~/Projects/skrub/skrub/_dataframe/_common.py:1126, in _fill_nulls_pandas(obj, value)
   1125 with pd.option_context("future.no_silent_downcasting", True):
-> 1126     return obj.fillna(value)

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/pandas/core/generic.py:7349, in NDFrame.fillna(self, value, method, axis, inplace, limit, downcast)
   7343         raise TypeError(
   7344             '"value" parameter must be a scalar, dict '
   7345             "or Series, but you passed a "
   7346             f'"{type(value).__name__}"'
   7347         )
-> 7349     new_data = self._mgr.fillna(
   7350         value=value, limit=limit, inplace=inplace, downcast=downcast
   7351     )
   7353 elif isinstance(value, (dict, ABCSeries)):

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/pandas/core/internals/base.py:186, in DataManager.fillna(self, value, limit, inplace, downcast)
    184     limit = libalgos.validate_limit(None, limit=limit)
--> 186 return self.apply_with_block(
    187     "fillna",
    188     value=value,
    189     limit=limit,
    190     inplace=inplace,
    191     downcast=downcast,
    192     using_cow=using_copy_on_write(),
    193     already_warned=_AlreadyWarned(),
    194 )

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    362 else:
--> 363     applied = getattr(b, f)(**kwargs)
    364 result_blocks = extend_blocks(applied, result_blocks)

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2334, in ExtensionBlock.fillna(self, value, limit, inplace, downcast, using_cow, already_warned)
   2333 refs = None
-> 2334 new_values = self.values.fillna(value=value, method=None, limit=limit)
   2335 # issue the warning *after* retrying, in case the TypeError
   2336 #  was caused by an invalid fill_value

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/pandas/core/arrays/_mixins.py:376, in NDArrayBackedExtensionArray.fillna(self, value, method, limit, copy)
    375 if value is not None:
--> 376     self._validate_setitem_value(value)
    378 if not copy:

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/pandas/core/arrays/categorical.py:1589, in Categorical._validate_setitem_value(self, value)
   1588 else:
-> 1589     return self._validate_scalar(value)

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/pandas/core/arrays/categorical.py:1614, in Categorical._validate_scalar(self, fill_value)
   1613 else:
-> 1614     raise TypeError(
   1615         "Cannot setitem on a Categorical with a new "
   1616         f"category ({fill_value}), set the categories first"
   1617     ) from None
   1618 return fill_value

TypeError: Cannot setitem on a Categorical with a new category (), set the categories first

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[2], line 12
      9 df = pd.DataFrame(data)
     11 tv = TableVectorizer(low_cardinality=StringEncoder(n_components=2), n_jobs=1)
---> 12 out = tv.fit_transform(df)

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/sklearn/utils/_set_output.py:319, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    317 @wraps(f)
    318 def wrapped(self, X, *args, **kwargs):
--> 319     data_to_wrap = f(self, X, *args, **kwargs)
    320     if isinstance(data_to_wrap, tuple):
    321         # only wrap the first output for cross decomposition
    322         return_tuple = (
    323             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    324             *data_to_wrap[1:],
    325         )

File ~/Projects/skrub/skrub/_table_vectorizer.py:775, in TableVectorizer.fit_transform(self, X, y)
    773 self._check_specific_columns()
    774 self._make_pipeline()
--> 775 output = self._pipeline.fit_transform(X, y=y)
    776 self.all_outputs_ = sbd.column_names(output)
    777 self._store_processing_steps()

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/sklearn/base.py:1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1382     estimator._validate_params()
   1384 with config_context(
   1385     skip_parameter_validation=(
   1386         prefer_skip_nested_validation or global_skip_validation
   1387     )
   1388 ):
-> 1389     return fit_method(estimator, *args, **kwargs)

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/sklearn/pipeline.py:718, in Pipeline.fit_transform(self, X, y, **params)
    679 """Fit the model and transform with the final estimator.
    680 
    681 Fit all the transformers one after the other and sequentially transform
   (...)    715     Transformed samples.
    716 """
    717 routed_params = self._check_method_params(method="fit_transform", props=params)
--> 718 Xt = self._fit(X, y, routed_params)
    720 last_step = self._final_estimator
    721 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/sklearn/pipeline.py:588, in Pipeline._fit(self, X, y, routed_params, raw_params)
    581 # Fit or load from cache the current transformer
    582 step_params = self._get_metadata_for_step(
    583     step_idx=step_idx,
    584     step_params=routed_params[name],
    585     all_params=raw_params,
    586 )
--> 588 X, fitted_transformer = fit_transform_one_cached(
    589     cloned_transformer,
    590     X,
    591     y,
    592     weight=None,
    593     message_clsname="Pipeline",
    594     message=self._log_message(step_idx),
    595     params=step_params,
    596 )
    597 # Replace the transformer of the step with the fitted
    598 # transformer. This is necessary when loading the transformer
    599 # from the cache.
    600 self.steps[step_idx] = (name, fitted_transformer)

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/joblib/memory.py:312, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    311 def __call__(self, *args, **kwargs):
--> 312     return self.func(*args, **kwargs)

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/sklearn/pipeline.py:1551, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, params)
   1549 with _print_elapsed_time(message_clsname, message):
   1550     if hasattr(transformer, "fit_transform"):
-> 1551         res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
   1552     else:
   1553         res = transformer.fit(X, y, **params.get("fit", {})).transform(
   1554             X, **params.get("transform", {})
   1555         )

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/sklearn/utils/_set_output.py:319, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    317 @wraps(f)
    318 def wrapped(self, X, *args, **kwargs):
--> 319     data_to_wrap = f(self, X, *args, **kwargs)
    320     if isinstance(data_to_wrap, tuple):
    321         # only wrap the first output for cross decomposition
    322         return_tuple = (
    323             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    324             *data_to_wrap[1:],
    325         )

File ~/Projects/skrub/skrub/_on_each_column.py:451, in OnEachColumn.fit_transform(self, X, y)
    449 parallel = Parallel(n_jobs=self.n_jobs)
    450 func = delayed(_fit_transform_column)
--> 451 results = parallel(
    452     func(
    453         sbd.col(X, col_name),
    454         y,
    455         self._columns,
    456         self.transformer,
    457         self.allow_reject,
    458     )
    459     for col_name in all_columns
    460 )
    461 return self._process_fit_transform_results(results, X)

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:

File ~/Projects/skrub/sk3/lib/python3.11/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
   1845 self.n_dispatched_batches += 1
   1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
   1848 self.n_completed_tasks += 1
   1849 self.print_progress()

File ~/Projects/skrub/skrub/_on_each_column.py:547, in _fit_transform_column(column, y, columns_to_handle, transformer, allow_reject)
    545     return col_name, [column], None
    546 except Exception as e:
--> 547     raise ValueError(
    548         f"Transformer {transformer.__class__.__name__}.fit_transform "
    549         f"failed on column {col_name!r}. See above for the full traceback."
    550     ) from e
    551 output = _utils.check_output(transformer, transformer_input, output)
    552 output_cols = sbd.to_column_list(output)

ValueError: Transformer StringEncoder.fit_transform failed on column 'category1'. See above for the full traceback.

rcap107 · 2025-05-21T14:40:12Z

After an IRL talk with @Vincent-Maladiere we thought about converting the categorical columns to strings.

The StringEncoder definitely needs this, because it's trying to fill null values and failing.

The TextEncoder raises an exception if the column does not have string as datatype, but it might work if that particular check is modified to accept categorical columns 🤔

We could either modify the ToStr transformer to force it to convert categoricals if needed, make a change locally in both Text and StringEncoder, or write another transformer.

MarieSacksick added the bug Something isn't working label May 21, 2025

rcap107 changed the title ~~bug: StringEncoder trouble with the adult census data~~ bug: StringEncoder and TextEncoder raise exceptions when the input is already categorical May 21, 2025

rcap107 linked a pull request May 21, 2025 that will close this issue

Fixing an exception that is raised if the input to the StringEncoder and TextEncoder is categorical #1401

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: StringEncoder and TextEncoder raise exceptions when the input is already categorical #1400

bug: StringEncoder and TextEncoder raise exceptions when the input is already categorical #1400

MarieSacksick commented May 21, 2025

rcap107 commented May 21, 2025

Uh oh!

rcap107 commented May 21, 2025

Uh oh!

MarieSacksick commented May 21, 2025

Uh oh!

rcap107 commented May 21, 2025

Uh oh!

rcap107 commented May 21, 2025

Uh oh!

bug: StringEncoder and TextEncoder raise exceptions when the input is already categorical #1400

bug: StringEncoder and TextEncoder raise exceptions when the input is already categorical #1400

Comments

MarieSacksick commented May 21, 2025

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

rcap107 commented May 21, 2025

Uh oh!

rcap107 commented May 21, 2025

Uh oh!

MarieSacksick commented May 21, 2025

Uh oh!

rcap107 commented May 21, 2025

Uh oh!

rcap107 commented May 21, 2025

Uh oh!