Skip to content

Conversation

@rcap107
Copy link
Member

@rcap107 rcap107 commented Apr 11, 2025

closes #1317

Adding the DropUninformative transformer, which uses different tests to check whether if a column is considered to be "uninformative".

For the moment, I have implemented

  • Drop if too many nulls (I moved the logic from the transformer that is already there)
  • Drop if constant
  • Drop if all values are distinct

Missing values are considered as additional missing values as per @GaelVaroquaux's comment.

The idea adding new heuristics for "uninformative" columns here directly rather than having a bunch of different very small objects.

Still missing: docstring and examples, adding a deprecation warning for the drop if too many nulls transformer.

@rcap107
Copy link
Member Author

rcap107 commented Apr 11, 2025

addressing #1286

@victoris93
Copy link
Contributor

Can't make comments/suggestions in unchanged files, so just leaving this here:

  1. in DropIfTooManyNulls in drop_if_too_many_nulls.py:
    def __init__(self, threshold=1.0):
        self.threshold = threshold
        warnings.warn(
            (
                "DropIfTooManyNulls will be deprecated in the next release. "
                "Equivalent functionality is available in DropUninformative."
            ),
            category=DeprecationWarning,
        )
  1. in test_drop_if_too_many_nulls.py:
def test_drop_if_too_many_nulls_warning():
    with pytest.warns(
        DeprecationWarning,
        match=(
            "DropIfTooManyNulls will be deprecated in the next release. "
            "Equivalent functionality is available in DropUninformative."
            )
    ):
        dn = DropIfTooManyNulls()

and @pytest.mark.filterwarnings("ignore::DeprecationWarning") for all the other tests.

Let me know if it's somewhat useful.

all values must be null for the column to be dropped).
- The column includes only one unique value (the column is constant). Missing
values are considered a separate value.
- The number of unique values in the column is equal to the length of the column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this category of non-informativeness should be handled with care because free-form text is informative but might have unique values.

We might need additional heuristics to separate non-informative IDs from free-form text. I'm thinking about mean text length, variance of text length, or looking for a structure –e.g. uuid4 are common types of ids that all look like 9268c073-1cb3-4817-a40d-d9296b0d5a8c. You could also argue that numerical range IDs (from 0 to N) could also bring information if the data is not exchangeable.

There is also the somewhat nicher use case of mini-batching using partial fit, but we can ignore it for now.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and there is no restriction on the types transformed by this column so for example a column of real numbers would also easily have all unique values.

this parameter is set to False by default, though... but I'm not sure I would ever want to set it to true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I totally agree the "column of uniques" should be kept off by default

the scenario I was considering was more along the lines of "the user has figured out the text columns and is handling them separately, anything else is an ID and can be dropped"

maybe I could add a check that it's only for non-numeric columns, what do you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think relying on the number of unique values to detect IDs is a bit tricky for any type: unique ints could be timestamps, unique floats could be a continuous quantity, unique strings could be text ... so it's kind of a topic on its own. So my hunch would be to maybe leave that aside to avoid blocking this PR and discuss it separately

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however if other people prefer to keep it I'm ok with it as long as it's off by default

@rcap107 rcap107 marked this pull request as ready for review April 14, 2025 13:46
@rcap107
Copy link
Member Author

rcap107 commented Apr 14, 2025

I can't replicate the test failures even using pixi

@jeromedockes
Copy link
Member

I can't replicate the test failures even using pixi

is it only on windows?

@rcap107
Copy link
Member Author

rcap107 commented Apr 14, 2025

I can't replicate the test failures even using pixi

is it only on windows?

I've only seen it fail on windows, I'm not sure if it's because other runs are skipped by the failure

@rcap107
Copy link
Member Author

rcap107 commented Apr 14, 2025

It does look like it's only windows 🙈

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really cool @rcap107 ! apart from adding to the reference index I only have nitpicks

if self.null_fraction_threshold == 1.0:
return sbd.is_all_null(column)
# No nulls found, or no threshold
if self.null_count == 0 or self.null_fraction_threshold is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we must store null_count in an attribute, we should name it _null_count or null_count_ to fit the scikit-learn convention. Alternatively, as it may not be a super useful attribute, we could avoid storing it and pass it as an argument

edit: I see the null count is used everywhere so maybe it makes sense to store it in a private attribute

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we can skip computing the null count if that criterion is disabled (if easy, not important)


def _drop_if_constant(self, column):
if self.constant_column:
if (sbd.n_unique(column) == 1) and (sum(sbd.is_null(column)) == 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that sum the same thing as the null count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I replaced it with not sbd.has_nulls

this selection is disabled: no columns are dropped based on the number
of null values they contain.
drop_constant : bool, default=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring says True but the init says False

I think it is simpler (and makes the repr shorter) if the default in the tablevectorizer is the same as the default in the dropuninformative columns

jeromedockes
jeromedockes previously approved these changes Apr 23, 2025
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks so much @rcap107 !

@jeromedockes
Copy link
Member

oops I approved th echanges too quick :)

@glemaitre glemaitre self-requested a review April 25, 2025 15:41
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nitpicks and it's good to go :)

If True, drop the column if it contains only one unique value. Missing values
count as one additional distinct value.
drop_if_id : bool, default=False
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about drop_if_unique? As we have no way to identify IDs, drop_if_id could be a little misleading? Also, it could be nice to add an explainer saying that free-form text is likely to be dropped too, so to be used with caution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the note in two places, maybe it's redundant

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @rcap107! :)

@rcap107 rcap107 merged commit b7c10ed into skrub-data:main Apr 28, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants