Various features and improvements for the skrub.datasets utilities #1422

rcap107 · 2025-05-28T13:20:32Z

I think the skrub.datasets utilities for loading and returning datasets could be improved a bit for for a better user experience:

skrub supports both pandas and polars, but datasets are always returned as pandas. It would be good to add an option to return the dataframes as polars dataframes instead. This could be implemented as an option for Adding a skrub global config #1377 ("preferred dataframe engine" for example)
Currently, datasets are returned as a Bunch object, which is pretty much a dict that shows the repr of the dataframes when printed in a notebook cell. I think it would be better to have a more compact repr that shows the name and shape of the dataframes in the bunch, as well as metadata info if available.
This was mentioned in another issue (Add some stats on the size of the datasets provided in skrub.datasets #1252), but it would be useful to have more info on the datasets (at least, their size) on the doc entry for the respective function.
It would be nice to have an option to save the datasets in parquet, rather than csv (this could be another option for Adding a skrub global config #1377 )

The text was updated successfully, but these errors were encountered:

GaelVaroquaux · 2025-05-28T13:46:57Z

Rather than supporting polars, I would prefer adding the the filenames and changing many of our example to load the data.

I think that this is a better pattern. It's closer to code that people might write

rcap107 · 2025-05-28T13:52:49Z

Rather than supporting polars, I would prefer adding the the filenames and changing many of our example to load the data.

I think that this is a better pattern. It's closer to code that people might write

So rather than

data = skrub.datasets.fetch_toxicity()

something like this?

data_dir = skrub.datasets.get_data_dir()

data = pl.read_csv(data_dir / "toxicity/toxicity/toxicity.csv")

ideally less janky than that

GaelVaroquaux · 2025-05-28T14:19:24Z

Not "get_data_dir": we need to trigger the download.

More:

data = skrub.fetch_toxicity()
df = pl.read_csv(data.filename)

Vincent-Maladiere · 2025-05-28T14:35:17Z

Hey @rcap107, these are interesting ideas! That said, I’d challenge the priority of working on this, as it doesn’t seem to bring much benefit to users.

rcap107 · 2025-05-28T14:40:55Z

Hey @rcap107, these are interesting ideas! That said, I’d challenge the priority of working on this, as it doesn’t seem to bring much benefit to users.

I agree it's not a particularly urgent issue. I noted it down as I was working on an example in case someone wants to work on it.

Vincent-Maladiere · 2025-05-28T15:29:15Z

This could be a good first issue to suggest during sprints indeed!

GaelVaroquaux · 2025-05-28T15:38:23Z

This could be a good first issue to suggest during sprints indeed!

I would start simple by suggesting the "filename" thing (maybe create a subissue with "good first issue" label)

rcap107 added help wanted Extra attention is needed good first issue Good for newcomers labels May 28, 2025

rcap107 removed the help wanted Extra attention is needed label May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Various features and improvements for the skrub.datasets utilities #1422

Various features and improvements for the skrub.datasets utilities #1422

rcap107 commented May 28, 2025 •

edited

Loading

GaelVaroquaux commented May 28, 2025

Uh oh!

rcap107 commented May 28, 2025

Uh oh!

GaelVaroquaux commented May 28, 2025

Uh oh!

Vincent-Maladiere commented May 28, 2025

Uh oh!

rcap107 commented May 28, 2025

Uh oh!

Vincent-Maladiere commented May 28, 2025

Uh oh!

GaelVaroquaux commented May 28, 2025 via email

Uh oh!

Various features and improvements for the skrub.datasets utilities #1422

Various features and improvements for the skrub.datasets utilities #1422

Comments

rcap107 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GaelVaroquaux commented May 28, 2025

Uh oh!

rcap107 commented May 28, 2025

Uh oh!

GaelVaroquaux commented May 28, 2025

Uh oh!

Vincent-Maladiere commented May 28, 2025

Uh oh!

rcap107 commented May 28, 2025

Uh oh!

Vincent-Maladiere commented May 28, 2025

Uh oh!

GaelVaroquaux commented May 28, 2025 via email

Uh oh!

rcap107 commented May 28, 2025 •

edited

Loading