-
Notifications
You must be signed in to change notification settings - Fork 131
Various features and improvements for the skrub.datasets utilities #1422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Rather than supporting polars, I would prefer adding the the filenames and changing many of our example to load the data. I think that this is a better pattern. It's closer to code that people might write |
So rather than data = skrub.datasets.fetch_toxicity() something like this? data_dir = skrub.datasets.get_data_dir()
data = pl.read_csv(data_dir / "toxicity/toxicity/toxicity.csv") ideally less janky than that |
Not "get_data_dir": we need to trigger the download. More: data = skrub.fetch_toxicity()
df = pl.read_csv(data.filename) |
Hey @rcap107, these are interesting ideas! That said, I’d challenge the priority of working on this, as it doesn’t seem to bring much benefit to users. |
I agree it's not a particularly urgent issue. I noted it down as I was working on an example in case someone wants to work on it. |
This could be a good first issue to suggest during sprints indeed! |
This could be a good first issue to suggest during sprints indeed!
I would start simple by suggesting the "filename" thing (maybe create a subissue with "good first issue" label)
|
Uh oh!
There was an error while loading. Please reload this page.
I think the skrub.datasets utilities for loading and returning datasets could be improved a bit for for a better user experience:
Bunch
object, which is pretty much a dict that shows the repr of the dataframes when printed in a notebook cell. I think it would be better to have a more compact repr that shows the name and shape of the dataframes in the bunch, as well as metadata info if available.The text was updated successfully, but these errors were encountered: