Skip to content

Conversation

@victoris93
Copy link
Contributor

@victoris93 victoris93 commented Apr 11, 2025

Closes #1287.

I made changes to skip assocations computation. Haven't modified to the tests yet, it's coming right up.
It would make sense to remove the Associations tab from the html table altogether.

@victoris93 victoris93 changed the title Skipping associations compute if the number of cols > default Skipping associations compute if the number of cols > specified Apr 11, 2025
@victoris93 victoris93 marked this pull request as ready for review April 11, 2025 15:55
@victoris93
Copy link
Contributor Author

Final changes:

  • max_association_columns = 30
  • If associations are not computed (n_cols > max_association_columns), the Associations tab is excluded from the report

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @victoris93, thank you for this PR!

@jeromedockes, we probably should have a single parameter to control both max_plot_columns and max_association_columns, WDYT? We can do it in another PR, though.

@victoris93
Copy link
Contributor Author

@Vincent-Maladiere we briefly touched upon this in our last conversation w/ @jeromedockes; imo it's an elegant simplification but from the user standpoint it's a big assumption that they would always prefer to avoid both the plotting and the association computation and for the same n. of columns. But totally your call ofc.

@victoris93
Copy link
Contributor Author

victoris93 commented Apr 14, 2025

Though if you decide to implement it later, would be down to submit another PR

@jeromedockes
Copy link
Member

I don't have a strong opinion. @victoris93 is right that it's an assumption that a user would want to tie the 2 together. OTOH my guess would be that most users will leave those parameters to the default value all the time 🤔

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that very few users will actually tweak these parameters. That said, having both parameters could make sense for users who, for example, have many correlated columns but would rather not display the distributions of all of them. Therefore we can keep things as they are :)

self.dataframe = dataframe
self.verbose = verbose
self.max_plot_columns = max_plot_columns
self.max_association_columns = max_association_columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: we could handle None values (below) when setting the class attributes (above) to simplify the flow a bit.

Comment on lines +167 to +168
with_plots=with_plots,
with_associations=with_associations,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simplify the lines above like:

Suggested change
with_plots=with_plots,
with_associations=with_associations,
with_plots=self.max_plot_columns >= self.n_columns,
with_associations=self.max_association_columns >= self.n_columns,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's right. There're some lines to delete just above, so will commit separately

self.max_association_columns = max_association_columns
self.n_columns = sbd.shape(self.dataframe)[1]

if self.max_plot_columns is None:
Copy link
Contributor Author

@victoris93 victoris93 Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vincent-Maladiere Not sure I understand the comment about None value handling. Would you like None values to be handled differently than what I have here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost nothing, having something like:

self.max_plot_columns = self.n_columns if max_plot_columns is None else max_plot_columns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah to follow @jeromedockes suggestion, we could have self.max_plot_columns = max_plot_columns, and then derive self.max_plot_columns_ using your logic. self.max_plot_columns_ would be the one actually used. I won't comment further as this gets too nitpicky

Comment on lines 144 to 149
self.n_columns = sbd.shape(self.dataframe)[1]

if self.max_plot_columns is None:
self.max_plot_columns = self.n_columns
if self.max_association_columns is None:
self.max_association_columns = self.n_columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather still store them as None so it is easy to check if the report was created with max_plot_columns set to None or to the number of columns in the dataframe

return summarize_dataframe(
self.dataframe, with_plots=True, title=self.title, **self._summary_kwargs
)
def _lightify_summary(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason why we had 2 cached properties is that the json does not need the plots so we might want either a full summary or a plot-less one, and both would be cached.

Now if we don't have just 2 kinds of report but several parameters we should just give up on that it wasn't really necessary, and more a relic from an early version where the tablereport was also able to produce output in text format.

We can have just

@functools.cached_property
def _summary(self):
    n_columns = sbd.shape(self.dataframe)[1]
    with_plots = self.max_plot_columns is None or self.max_plot_columns >= n_columns
    with_associations = ...
    return summarize_dataframe(self.dataframe, with_plots=with_plots, ...)

(and no more _lightify_summary nor _get_summary )

@jeromedockes
Copy link
Member

I think the review comments (at least mine) are getting unclear 😅 sorry about that @victoris93 . As the PR is working fine I suggest we merge it and any further changes can be done in another PR, WDYT?

@victoris93 was there anything you still wanted to change before I merge it?

@victoris93
Copy link
Contributor Author

@jeromedockes np pb, I think I got it. Just a sec, the last commit should address the comments, almost there

@jeromedockes
Copy link
Member

ok thanks! If you do get rid of _get_summary() you will need to replace _get_summary() with _summary everywhere in _reporting/tests/test_table_report.py

…max_association_columns atts as initially set.
@victoris93
Copy link
Contributor Author

@jeromedockes @Vincent-Maladiere should be good now I think

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot! LGTM with the suggested changes

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!! thanks so much @victoris93 this is really very useful 🚀 as the cramerv computation in particular can get very slow and take a lot of memory for a large number of columns

@jeromedockes jeromedockes enabled auto-merge (squash) April 14, 2025 14:01
@jeromedockes jeromedockes merged commit 66329b0 into skrub-data:main Apr 14, 2025
21 of 22 checks passed
@jeromedockes jeromedockes added the TableReport anything related to the TableReport label Apr 14, 2025
@victoris93
Copy link
Contributor Author

victoris93 commented Apr 14, 2025

oops I wanted to run the pre-commit checks before you merged but it was too late, sry about that. I hope it's ok.

@jeromedockes
Copy link
Member

oops I wanted to run the pre-commit checks before you merged but it was too late, sry about that. I hope it's ok.

there was just an extra blank line but I removed it before merging; the pre-commit checks are passing without complaints :) thanks again, and whenever you have time don't hesitate to ping us if you want help in picking the next thing to work on!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 14, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

TableReport anything related to the TableReport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid computing column associations when the number of columns is large

4 participants