Skip to content

Add subsampling hint in TableReport used in expression #1384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
GaelVaroquaux opened this issue May 14, 2025 · 6 comments · May be fixed by #1418
Open

Add subsampling hint in TableReport used in expression #1384

GaelVaroquaux opened this issue May 14, 2025 · 6 comments · May be fixed by #1418
Labels
enhancement New feature or request expressions Something related to the skrub expressions TableReport anything related to the TableReport

Comments

@GaelVaroquaux
Copy link
Member

At the bottom of the first tab of the TableReport, there is some space (where it says "100 rows ✕ 9 columns. "):
Image

We should use this to add usability messages in general, when the TableReport is called from a given context.

Specifically, in the case of subsampled previews, we should write here "100 rows ✕ 9 columns (subsampled from xxxx rows)".

@GaelVaroquaux GaelVaroquaux added enhancement New feature or request TableReport anything related to the TableReport expressions Something related to the skrub expressions labels May 14, 2025
@Vincent-Maladiere
Copy link
Member

Should we remove this line from the display? It is kinda redundant with your suggestion, but since the display might be switched to a regular pandas or polars dataframe in the future as per #1377, we might have to keep it.

Image

@GaelVaroquaux
Copy link
Member Author

For the reason that you give, I think that for now we should not remove the line on the top.

I do think that we can be more creative with the space on the top, and have a better visual display of the information. But I still think that the information that this is working on a subsample should be very very visible, and hence duplication is fine

@Vincent-Maladiere
Copy link
Member

when the TableReport is called from a given context.

In which context should we provide the unsampled number of rows? As @rcap107 pointed out, if we simply evaluate the expression using the full data to get the number of samples of the output, then we don't benefit from subsampling at all.

@Vincent-Maladiere
Copy link
Member

An alternative is to simply mention "subsampled", without the number of rows

Image

@GaelVaroquaux
Copy link
Member Author

An alternative is to simply mention "subsampled", without the number of rows

Yes, that's the right way to do it IHMO (I wrote the original issue a bit too fast, without thinking it through). And the corresponding API in the TableReport could be something like footer_msg="(subsampled)" (well, "" by default, but called here with "(subsampled)".

Thanks!!

@Vincent-Maladiere
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request expressions Something related to the skrub expressions TableReport anything related to the TableReport
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants