Skip to content

Conversation

@PilliSiddharth
Copy link
Contributor

Add add_tostr flag to Cleaner to control ToStr transformation

Summary

This PR introduces an add_tostr parameter to the Cleaner transformer, allowing users to disable the automatic ToStr() conversion applied to non-numeric, non-categorical, and non-datetime columns. This prevents list/object-like columns from being converted to strings unless explicitly requested.

What’s changed

  • Add add_tostr argument to Cleaner (default: True).
  • Update _get_preprocessors to include ToStr() only when add_tostr=True.
  • Update Cleaner docstring (parameter documented; notes updated).
  • Add test_add_tostr to validate both add_tostr=True and add_tostr=False.

Motivation

Cleaner should not silently change complex dtypes (e.g., list, object) to strings. This flag gives users control and mirrors the existing pattern used for numeric_dtype="float32".

Test coverage

  • New unit test skrub/tests/test_cleaner.py::test_add_tostr added.
  • Local test run: relevant tests pass.

rcap107
rcap107 previously approved these changes Dec 5, 2025
Copy link
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @PilliSiddharth. Overall, it looks good, but there are a few things that should be changed.

We should also rethink the name of the add_tostr parameter, but at the moment I don't have a very good alternative in mind.

CHANGES.rst Outdated
``.skb.set_name()``, ``.skb.set_description()``, ``.skb.mark_as_X()`` or
``.skb.mark_as_y()`` used to raise an error, this has been fixed in :pr:`1782`
by :user:`Jérôme Dockès <jeromedockes>`.
- Added ``add_tostr`` parameter to :class:`Cleaner` to prevent unintended
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, even though I suggested it I don't like add_tostr as a parameter name for the Cleaner 🤔 we might have to change this before merging.

A user doesn't know what tostr is (since it's a private transformer), so add_tostr does not give enough information.

Comment on lines 256 to 257
categorical, or datetime. This step is controlled by the ``add_tostr``
parameter. When ``add_tostr=True`` (default), string conversion is applied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, the name add_tostr will probably be changed

@rcap107 rcap107 dismissed their stale review December 5, 2025 08:58

approved by mistake

@PilliSiddharth
Copy link
Contributor Author

Hey @rcap107, I moved the test_add_tostr test into tests/test_table_vectorizer.py, set the default to False, fixed the ordering so CleanCategories runs before ToStr, and fixed the docstring text. About the name, add_tostr works technically, but I agree it doesn’t feel very descriptive. Something like convert_objects might be clearer, but happy to go with whatever you prefer. Let me know if this looks right or if you want me to adjust anything else before I update the PR again.

@rcap107
Copy link
Member

rcap107 commented Dec 8, 2025

Hi @PilliSiddharth, I think you forgot to push your latest commit.

In the end, I'd like the parameter to be called cast_to_str. Also, the test_add_tostr should be modified slightly:

  • Change the name to cast_to_str
  • Test that the default behavior of the Cleaner does not change the type, and that setting cast_to_str=True does change the type as expected. In the current version, we're not testing the default behavior.

@PilliSiddharth PilliSiddharth force-pushed the fix-cleaner-add-tostr-flag branch from 0f6eb11 to 28b0403 Compare December 8, 2025 14:18
@PilliSiddharth
Copy link
Contributor Author

Hey @rcap107, I pushed the updates: renamed the parameter to cast_to_str, updated the test and docstring, and fixed the example output.

@rcap107
Copy link
Member

rcap107 commented Dec 9, 2025

Hi @PilliSiddharth, thanks a lot for the PR. I made a small change in the order of parameters, swapping cast_to_str and n_jobs. Aside from that, it looks good to me and I'll merge it once the CI is done.

An aside, if you are having issues with doctests failing, it's often easier to use the ELLIPSIS (...) to skip parts of the test. In some cases the same test can fail with specific package/python versions because the string representations change from one version to another, and using ... is the simplest way to get the tests to pass.

@rcap107
Copy link
Member

rcap107 commented Dec 9, 2025

Whoops, the order change broke the CI, I'll fix it

@rcap107 rcap107 merged commit b84a994 into skrub-data:main Dec 9, 2025
29 checks passed
@PilliSiddharth
Copy link
Contributor Author

Sounds good, glad I could be a part! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants