Skip to content

Various features and improvements for the skrub.datasets utilities #1422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rcap107 opened this issue May 28, 2025 · 7 comments
Open

Various features and improvements for the skrub.datasets utilities #1422

rcap107 opened this issue May 28, 2025 · 7 comments
Labels
good first issue Good for newcomers

Comments

@rcap107
Copy link
Contributor

rcap107 commented May 28, 2025

I think the skrub.datasets utilities for loading and returning datasets could be improved a bit for for a better user experience:

  • skrub supports both pandas and polars, but datasets are always returned as pandas. It would be good to add an option to return the dataframes as polars dataframes instead. This could be implemented as an option for Adding a skrub global config #1377 ("preferred dataframe engine" for example)
  • Currently, datasets are returned as a Bunch object, which is pretty much a dict that shows the repr of the dataframes when printed in a notebook cell. I think it would be better to have a more compact repr that shows the name and shape of the dataframes in the bunch, as well as metadata info if available.
  • This was mentioned in another issue (Add some stats on the size of the datasets provided in skrub.datasets #1252), but it would be useful to have more info on the datasets (at least, their size) on the doc entry for the respective function.
  • It would be nice to have an option to save the datasets in parquet, rather than csv (this could be another option for Adding a skrub global config #1377 )
@rcap107 rcap107 added help wanted Extra attention is needed good first issue Good for newcomers labels May 28, 2025
@GaelVaroquaux
Copy link
Member

Rather than supporting polars, I would prefer adding the the filenames and changing many of our example to load the data.

I think that this is a better pattern. It's closer to code that people might write

@rcap107
Copy link
Contributor Author

rcap107 commented May 28, 2025

Rather than supporting polars, I would prefer adding the the filenames and changing many of our example to load the data.

I think that this is a better pattern. It's closer to code that people might write

So rather than

data = skrub.datasets.fetch_toxicity()

something like this?

data_dir = skrub.datasets.get_data_dir()

data = pl.read_csv(data_dir / "toxicity/toxicity/toxicity.csv")

ideally less janky than that

@GaelVaroquaux
Copy link
Member

Not "get_data_dir": we need to trigger the download.

More:

data = skrub.fetch_toxicity()
df = pl.read_csv(data.filename)

@Vincent-Maladiere
Copy link
Member

Hey @rcap107, these are interesting ideas! That said, I’d challenge the priority of working on this, as it doesn’t seem to bring much benefit to users.

@rcap107 rcap107 removed the help wanted Extra attention is needed label May 28, 2025
@rcap107
Copy link
Contributor Author

rcap107 commented May 28, 2025

Hey @rcap107, these are interesting ideas! That said, I’d challenge the priority of working on this, as it doesn’t seem to bring much benefit to users.

I agree it's not a particularly urgent issue. I noted it down as I was working on an example in case someone wants to work on it.

@Vincent-Maladiere
Copy link
Member

This could be a good first issue to suggest during sprints indeed!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 28, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants