Adding DropUninformative #1313

rcap107 · 2025-04-11T16:12:35Z

Adding the DropUninformative transformer, which uses different tests to check whether if a column is considered to be "uninformative".

For the moment, I have implemented

Drop if too many nulls (I moved the logic from the transformer that is already there)
Drop if constant
Drop if all values are distinct

Missing values are considered as additional missing values as per @GaelVaroquaux's comment.

The idea adding new heuristics for "uninformative" columns here directly rather than having a bunch of different very small objects.

Still missing: docstring and examples, adding a deprecation warning for the drop if too many nulls transformer.

rcap107 · 2025-04-11T16:15:49Z

addressing #1286

victoris93 · 2025-04-12T13:31:25Z

Can't make comments/suggestions in unchanged files, so just leaving this here:

in DropIfTooManyNulls in drop_if_too_many_nulls.py:

    def __init__(self, threshold=1.0):
        self.threshold = threshold
        warnings.warn(
            (
                "DropIfTooManyNulls will be deprecated in the next release. "
                "Equivalent functionality is available in DropUninformative."
            ),
            category=DeprecationWarning,
        )

in test_drop_if_too_many_nulls.py:

def test_drop_if_too_many_nulls_warning():
    with pytest.warns(
        DeprecationWarning,
        match=(
            "DropIfTooManyNulls will be deprecated in the next release. "
            "Equivalent functionality is available in DropUninformative."
            )
    ):
        dn = DropIfTooManyNulls()

and @pytest.mark.filterwarnings("ignore::DeprecationWarning") for all the other tests.

Let me know if it's somewhat useful.

CHANGES.rst

Vincent-Maladiere · 2025-04-14T09:46:48Z

skrub/_drop_uninformative.py

+    all values must be null for the column to be dropped).
+    - The column includes only one unique value (the column is constant). Missing
+    values are considered a separate value.
+    - The number of unique values in the column is equal to the length of the column.


I guess this category of non-informativeness should be handled with care because free-form text is informative but might have unique values.

We might need additional heuristics to separate non-informative IDs from free-form text. I'm thinking about mean text length, variance of text length, or looking for a structure –e.g. uuid4 are common types of ids that all look like 9268c073-1cb3-4817-a40d-d9296b0d5a8c. You could also argue that numerical range IDs (from 0 to N) could also bring information if the data is not exchangeable.

There is also the somewhat nicher use case of mini-batching using partial fit, but we can ignore it for now.

WDYT?

yes and there is no restriction on the types transformed by this column so for example a column of real numbers would also easily have all unique values.

this parameter is set to False by default, though... but I'm not sure I would ever want to set it to true

yes I totally agree the "column of uniques" should be kept off by default

the scenario I was considering was more along the lines of "the user has figured out the text columns and is handling them separately, anything else is an ID and can be dropped"

maybe I could add a check that it's only for non-numeric columns, what do you think

I think relying on the number of unique values to detect IDs is a bit tricky for any type: unique ints could be timestamps, unique floats could be a continuous quantity, unique strings could be text ... so it's kind of a topic on its own. So my hunch would be to maybe leave that aside to avoid blocking this PR and discuss it separately

however if other people prefer to keep it I'm ok with it as long as it's off by default

skrub/_drop_uninformative.py

Co-authored-by: Vincent M <[email protected]>

…nto drop-if-constant

rcap107 · 2025-04-14T13:49:48Z

I can't replicate the test failures even using pixi

jeromedockes · 2025-04-14T14:20:31Z

I can't replicate the test failures even using pixi

is it only on windows?

rcap107 · 2025-04-14T14:36:16Z

I can't replicate the test failures even using pixi

is it only on windows?

I've only seen it fail on windows, I'm not sure if it's because other runs are skipped by the failure

rcap107 · 2025-04-14T14:47:55Z

It does look like it's only windows 🙈

jeromedockes

really cool @rcap107 ! apart from adding to the reference index I only have nitpicks

skrub/_drop_uninformative.py

jeromedockes · 2025-04-15T11:16:56Z

skrub/_drop_uninformative.py

+        if self.null_fraction_threshold == 1.0:
+            return sbd.is_all_null(column)
+        # No nulls found, or no threshold
+        if self.null_count == 0 or self.null_fraction_threshold is None:


if we must store null_count in an attribute, we should name it _null_count or null_count_ to fit the scikit-learn convention. Alternatively, as it may not be a super useful attribute, we could avoid storing it and pass it as an argument

edit: I see the null count is used everywhere so maybe it makes sense to store it in a private attribute

also we can skip computing the null count if that criterion is disabled (if easy, not important)

jeromedockes · 2025-04-15T11:18:18Z

skrub/_drop_uninformative.py

+
+    def _drop_if_constant(self, column):
+        if self.constant_column:
+            if (sbd.n_unique(column) == 1) and (sum(sbd.is_null(column)) == 0):


is that sum the same thing as the null count?

yes, I replaced it with not sbd.has_nulls

jeromedockes · 2025-04-15T11:24:05Z

skrub/_table_vectorizer.py

        this selection is disabled: no columns are dropped based on the number
        of null values they contain.

+    drop_constant : bool, default=True


the docstring says True but the init says False

I think it is simpler (and makes the repr shorter) if the default in the tablevectorizer is the same as the default in the dropuninformative columns

doc/reference/index.rst

skrub/_drop_uninformative.py

jeromedockes

cool, thanks so much @rcap107 !

jeromedockes · 2025-04-23T16:03:54Z

oops I approved th echanges too quick :)

_

Vincent-Maladiere

A few nitpicks and it's good to go :)

Vincent-Maladiere · 2025-04-28T09:12:28Z

skrub/_drop_uninformative.py

+        If True, drop the column if it contains only one unique value. Missing values
+        count as one additional distinct value.
+
+    drop_if_id : bool, default=False


What about drop_if_unique? As we have no way to identify IDs, drop_if_id could be a little misleading? Also, it could be nice to add an explainer saying that free-form text is likely to be dropped too, so to be used with caution

I added the note in two places, maybe it's redundant

skrub/_drop_uninformative.py

Co-authored-by: Vincent M <[email protected]>

Vincent-Maladiere

LGTM, thank you @rcap107! :)

adding DropUninformative

6fb2229

rcap107 added 2 commits April 11, 2025 18:25

updating changelog

0caea36

addressing a test in ci

44aed49

GaelVaroquaux mentioned this pull request Apr 12, 2025

Added an option to drop cols with cardinality == 1 in TableVectorizer #1317

Closed

Vincent-Maladiere reviewed Apr 14, 2025

View reviewed changes

jeromedockes reviewed Apr 14, 2025

View reviewed changes

skrub/_drop_uninformative.py Outdated Show resolved Hide resolved

rcap107 and others added 7 commits April 14, 2025 14:29

Update CHANGES.rst

eba8480

Co-authored-by: Vincent M <[email protected]>

Small changes to address reviews

eccaf8b

Merge branch 'drop-if-constant' of https://github.com/rcap107/skrub i…

f847425

…nto drop-if-constant

Replacing drop_if_too_many_nulls with drop_uninformative in tv

52b8897

Merge branch 'main' of github.com:skrub-data/skrub into drop-if-constant

26e460c

addressing some tests

f10113f

updating changelog

484afbd

rcap107 marked this pull request as ready for review April 14, 2025 13:46

rcap107 and others added 3 commits April 14, 2025 16:14

Trying to get past a failing test

54df451

fixing formatting

d7c7572

Merge branch 'main' into drop-if-constant

363b7fc

still trying to fix the same test

f7b9a8b

simplifying test

568c938

jeromedockes reviewed Apr 15, 2025

View reviewed changes

rcap107 added 3 commits April 15, 2025 15:27

refactoring, improving docs

7f4675c

fixes

bfef0f5

refactoring

31c8306

jeromedockes reviewed Apr 15, 2025

View reviewed changes

rcap107 added 10 commits April 16, 2025 14:55

refactoring and improving docstring

9690b61

fixing docstring

c458a33

removing a forgotten import

d9a7106

docstring

366bb17

docstring AGAIN

b2e2eab

small fixes

c0bf052

rewording changelog

165ea53

rewording changelog

92d2918

changing default parameter

1d7fdf9

fixing default params to false

abfc1d6

jeromedockes previously approved these changes Apr 23, 2025

View reviewed changes

rcap107 and others added 3 commits April 24, 2025 09:59

fixed changes to address the new defaults, refactoring

61cdba7

Merge branch 'main' of github.com:skrub-data/skrub into drop-if-constant

102e198

Merge branch 'main' into drop-if-constant

268cbe0

glemaitre self-requested a review April 25, 2025 15:41

Vincent-Maladiere reviewed Apr 28, 2025

View reviewed changes

rcap107 and others added 3 commits April 28, 2025 12:03

Update skrub/_drop_uninformative.py

25d3410

Co-authored-by: Vincent M <[email protected]>

Update skrub/_drop_uninformative.py

2c2ed2a

Co-authored-by: Vincent M <[email protected]>

renaming parameter, adding a note on the risk of dropping text columns

e8f2331

Vincent-Maladiere approved these changes Apr 28, 2025

View reviewed changes

rcap107 merged commit b7c10ed into skrub-data:main Apr 28, 2025
26 checks passed

rcap107 mentioned this pull request May 5, 2025

Add a DropColumnIfConstant transformer #1286

Closed

Adding DropUninformative #1313

Adding DropUninformative #1313

Uh oh!

Conversation

rcap107 commented Apr 11, 2025 • edited by jeromedockes Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcap107 commented Apr 11, 2025

Uh oh!

victoris93 commented Apr 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rcap107 commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromedockes commented Apr 14, 2025

Uh oh!

rcap107 commented Apr 14, 2025

Uh oh!

rcap107 commented Apr 14, 2025

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

jeromedockes commented Apr 23, 2025

Uh oh!

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Uh oh!

Vincent-Maladiere Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rcap107 commented Apr 11, 2025 •

edited by jeromedockes

Loading

rcap107 commented Apr 14, 2025 •

edited

Loading

Vincent-Maladiere Apr 28, 2025 •

edited

Loading