-
Notifications
You must be signed in to change notification settings - Fork 185
Adding DropUninformative #1313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding DropUninformative #1313
Conversation
|
addressing #1286 |
|
Can't make comments/suggestions in unchanged files, so just leaving this here:
and Let me know if it's somewhat useful. |
skrub/_drop_uninformative.py
Outdated
| all values must be null for the column to be dropped). | ||
| - The column includes only one unique value (the column is constant). Missing | ||
| values are considered a separate value. | ||
| - The number of unique values in the column is equal to the length of the column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this category of non-informativeness should be handled with care because free-form text is informative but might have unique values.
We might need additional heuristics to separate non-informative IDs from free-form text. I'm thinking about mean text length, variance of text length, or looking for a structure –e.g. uuid4 are common types of ids that all look like 9268c073-1cb3-4817-a40d-d9296b0d5a8c. You could also argue that numerical range IDs (from 0 to N) could also bring information if the data is not exchangeable.
There is also the somewhat nicher use case of mini-batching using partial fit, but we can ignore it for now.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and there is no restriction on the types transformed by this column so for example a column of real numbers would also easily have all unique values.
this parameter is set to False by default, though... but I'm not sure I would ever want to set it to true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I totally agree the "column of uniques" should be kept off by default
the scenario I was considering was more along the lines of "the user has figured out the text columns and is handling them separately, anything else is an ID and can be dropped"
maybe I could add a check that it's only for non-numeric columns, what do you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think relying on the number of unique values to detect IDs is a bit tricky for any type: unique ints could be timestamps, unique floats could be a continuous quantity, unique strings could be text ... so it's kind of a topic on its own. So my hunch would be to maybe leave that aside to avoid blocking this PR and discuss it separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however if other people prefer to keep it I'm ok with it as long as it's off by default
Co-authored-by: Vincent M <[email protected]>
…nto drop-if-constant
|
I can't replicate the test failures even using pixi |
is it only on windows? |
I've only seen it fail on windows, I'm not sure if it's because other runs are skipped by the failure |
|
It does look like it's only windows 🙈 |
jeromedockes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really cool @rcap107 ! apart from adding to the reference index I only have nitpicks
skrub/_drop_uninformative.py
Outdated
| if self.null_fraction_threshold == 1.0: | ||
| return sbd.is_all_null(column) | ||
| # No nulls found, or no threshold | ||
| if self.null_count == 0 or self.null_fraction_threshold is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we must store null_count in an attribute, we should name it _null_count or null_count_ to fit the scikit-learn convention. Alternatively, as it may not be a super useful attribute, we could avoid storing it and pass it as an argument
edit: I see the null count is used everywhere so maybe it makes sense to store it in a private attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also we can skip computing the null count if that criterion is disabled (if easy, not important)
skrub/_drop_uninformative.py
Outdated
|
|
||
| def _drop_if_constant(self, column): | ||
| if self.constant_column: | ||
| if (sbd.n_unique(column) == 1) and (sum(sbd.is_null(column)) == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that sum the same thing as the null count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I replaced it with not sbd.has_nulls
skrub/_table_vectorizer.py
Outdated
| this selection is disabled: no columns are dropped based on the number | ||
| of null values they contain. | ||
| drop_constant : bool, default=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docstring says True but the init says False
I think it is simpler (and makes the repr shorter) if the default in the tablevectorizer is the same as the default in the dropuninformative columns
jeromedockes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks so much @rcap107 !
|
oops I approved th echanges too quick :) |
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nitpicks and it's good to go :)
skrub/_drop_uninformative.py
Outdated
| If True, drop the column if it contains only one unique value. Missing values | ||
| count as one additional distinct value. | ||
| drop_if_id : bool, default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about drop_if_unique? As we have no way to identify IDs, drop_if_id could be a little misleading? Also, it could be nice to add an explainer saying that free-form text is likely to be dropped too, so to be used with caution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the note in two places, maybe it's redundant
Co-authored-by: Vincent M <[email protected]>
Co-authored-by: Vincent M <[email protected]>
Vincent-Maladiere
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @rcap107! :)
closes #1317
Adding the
DropUninformativetransformer, which uses different tests to check whether if a column is considered to be "uninformative".For the moment, I have implemented
Missing values are considered as additional missing values as per @GaelVaroquaux's comment.
The idea adding new heuristics for "uninformative" columns here directly rather than having a bunch of different very small objects.
Still missing: docstring and examples, adding a deprecation warning for the drop if too many nulls transformer.