-
Notifications
You must be signed in to change notification settings - Fork 184
Skipping associations compute if the number of cols > specified #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skipping associations compute if the number of cols > specified #1304
Conversation
|
Final changes:
|
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @victoris93, thank you for this PR!
@jeromedockes, we probably should have a single parameter to control both max_plot_columns and max_association_columns, WDYT? We can do it in another PR, though.
|
@Vincent-Maladiere we briefly touched upon this in our last conversation w/ @jeromedockes; imo it's an elegant simplification but from the user standpoint it's a big assumption that they would always prefer to avoid both the plotting and the association computation and for the same n. of columns. But totally your call ofc. |
|
Though if you decide to implement it later, would be down to submit another PR |
|
I don't have a strong opinion. @victoris93 is right that it's an assumption that a user would want to tie the 2 together. OTOH my guess would be that most users will leave those parameters to the default value all the time 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that very few users will actually tweak these parameters. That said, having both parameters could make sense for users who, for example, have many correlated columns but would rather not display the distributions of all of them. Therefore we can keep things as they are :)
| self.dataframe = dataframe | ||
| self.verbose = verbose | ||
| self.max_plot_columns = max_plot_columns | ||
| self.max_association_columns = max_association_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: we could handle None values (below) when setting the class attributes (above) to simplify the flow a bit.
| with_plots=with_plots, | ||
| with_associations=with_associations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could simplify the lines above like:
| with_plots=with_plots, | |
| with_associations=with_associations, | |
| with_plots=self.max_plot_columns >= self.n_columns, | |
| with_associations=self.max_association_columns >= self.n_columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's right. There're some lines to delete just above, so will commit separately
skrub/_reporting/_table_report.py
Outdated
| self.max_association_columns = max_association_columns | ||
| self.n_columns = sbd.shape(self.dataframe)[1] | ||
|
|
||
| if self.max_plot_columns is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vincent-Maladiere Not sure I understand the comment about None value handling. Would you like None values to be handled differently than what I have here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost nothing, having something like:
self.max_plot_columns = self.n_columns if max_plot_columns is None else max_plot_columnsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah to follow @jeromedockes suggestion, we could have self.max_plot_columns = max_plot_columns, and then derive self.max_plot_columns_ using your logic. self.max_plot_columns_ would be the one actually used. I won't comment further as this gets too nitpicky
skrub/_reporting/_table_report.py
Outdated
| self.n_columns = sbd.shape(self.dataframe)[1] | ||
|
|
||
| if self.max_plot_columns is None: | ||
| self.max_plot_columns = self.n_columns | ||
| if self.max_association_columns is None: | ||
| self.max_association_columns = self.n_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather still store them as None so it is easy to check if the report was created with max_plot_columns set to None or to the number of columns in the dataframe
skrub/_reporting/_table_report.py
Outdated
| return summarize_dataframe( | ||
| self.dataframe, with_plots=True, title=self.title, **self._summary_kwargs | ||
| ) | ||
| def _lightify_summary(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason why we had 2 cached properties is that the json does not need the plots so we might want either a full summary or a plot-less one, and both would be cached.
Now if we don't have just 2 kinds of report but several parameters we should just give up on that it wasn't really necessary, and more a relic from an early version where the tablereport was also able to produce output in text format.
We can have just
@functools.cached_property
def _summary(self):
n_columns = sbd.shape(self.dataframe)[1]
with_plots = self.max_plot_columns is None or self.max_plot_columns >= n_columns
with_associations = ...
return summarize_dataframe(self.dataframe, with_plots=with_plots, ...)
(and no more _lightify_summary nor _get_summary )
|
I think the review comments (at least mine) are getting unclear 😅 sorry about that @victoris93 . As the PR is working fine I suggest we merge it and any further changes can be done in another PR, WDYT? @victoris93 was there anything you still wanted to change before I merge it? |
|
@jeromedockes np pb, I think I got it. Just a sec, the last commit should address the comments, almost there |
|
ok thanks! If you do get rid of |
…max_association_columns atts as initially set.
|
@jeromedockes @Vincent-Maladiere should be good now I think |
jeromedockes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot! LGTM with the suggested changes
Co-authored-by: Jérôme Dockès <[email protected]>
Co-authored-by: Jérôme Dockès <[email protected]>
Co-authored-by: Jérôme Dockès <[email protected]>
jeromedockes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great!! thanks so much @victoris93 this is really very useful 🚀 as the cramerv computation in particular can get very slow and take a lot of memory for a large number of columns
formatting
|
oops I wanted to run the pre-commit checks before you merged but it was too late, sry about that. I hope it's ok. |
there was just an extra blank line but I removed it before merging; the pre-commit checks are passing without complaints :) thanks again, and whenever you have time don't hesitate to ping us if you want help in picking the next thing to work on! |
|
Thanks Victoria!!
|
Closes #1287.
I made changes to skip assocations computation. Haven't modified to the tests yet, it's coming right up.
It would make sense to remove the
Associationstab from the html table altogether.