Skip to content

Conversation

@dierickxsimon
Copy link
Contributor

@dierickxsimon dierickxsimon commented Oct 2, 2025

I can't find any issues related to this PR, but this problem came up during the pydata2025 sprints. It's a fix to ensure that the skrub-data folder does not get full of unused reports when creating a full_report.

This fix ensure that when creating a full_report with output_dir=None reports older than 7 days in the skrub_data folder are deleted.

@dierickxsimon dierickxsimon force-pushed the prune-the-skrub-data-report-folder branch from 246a49f to f612dc4 Compare October 3, 2025 07:28
@dierickxsimon dierickxsimon marked this pull request as ready for review October 3, 2025 09:10
@dierickxsimon
Copy link
Contributor Author

dierickxsimon commented Oct 3, 2025

I' m running in some issues with the ci/cd, but it's a bit unclear for me what exactly the problem is.

@rcap107
Copy link
Member

rcap107 commented Oct 3, 2025

It looks like the CI issue is related to #1658, so now that that PR is merged it should be fixed

time_threshold = time.time() - 7 * 24 * 3600 # 7 days ago

folders = (
(f, os.path.getmtime(os.path.join(path, f)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests are passing except the codecov (not enough test coverage). In the test cov report these lines (line 73-77) are not covered. Maybe because assignment of the variable is only done on line 73.

try:
if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"):
shutil.rmtree(os.path.join(path, dir[0]))
except Exception:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line isn’t covered by the tests: I added the try-except block to make sure it fails silently if the file happens to still be open. I’m not sure if this is really needed (and if so, whether the exception block should be tested). Thanks in advance!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated comment: it's much much preferable to capture a specific Exception rather than the base class. So this code should capture the specific one that is trigger by the problem.

Also, I would always add a small comment on why we capture the exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should catch the exception as a PermissionError, and the exception should be tested. However, this should be a Windows specific error, so the test should be skipped when the platform isn't Windows with something like

import sys
import pytest

@pytest.mark.skipif(sys.platform != 'win32', reason="Test is Windows-specific")
def test_windows_specific_functionality():
    # test code here

If you don't have a Windows machine to test this one, it will have to be tested on CI which is going to be annoying unfortunately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the general rule but here I think catching a broad exception type makes sense: we are performing an accessory background task (cleanup of old files created by other programs), so we don't want to crash the program if it fails. it might fail for other reasons beside the one mentioned here: another thread may have removed it already between the listdir() and the rmtree() calls, we may not have permission if it was created by another user, etc. . In those cases failing the current report generation would not help, so I suggest we emit a warning (so that if they see the warning often the user might want to report it or manually delete the directory if it is large) and proceed with the main task without raising an error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to test on other platforms than windows (and even on windows actually), use unittest.mock and the pytest monkeypatch fixture to mock rmtree so it raises an exception (set side_effect = OSError() on the mock)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with @jeromedockes point: we can not worry about being specific and catch a broad exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then i will just catch a broad exception and write a test for it, I willl work on it today or tomorrow!

Comment on lines 72 to 80

folders = (
(f, os.path.getmtime(os.path.join(path, f)))
for f in os.listdir(path)
if os.path.isdir(os.path.join(path, f))
)
for dir in folders:
try:
if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of changes: I moved the name condition to the first if, and I unpacked dir into dir_path and dir_time. dir is a python keyword so it should not be used as a variable name.

Suggested change
folders = (
(f, os.path.getmtime(os.path.join(path, f)))
for f in os.listdir(path)
if os.path.isdir(os.path.join(path, f))
)
for dir in folders:
try:
if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"):
folders = (
(f, os.path.getmtime(os.path.join(path, f)))
for f in os.listdir(path)
if os.path.isdir(os.path.join(path, f))
and f.startswith("full_data_op_report_"))
)
for dir_path, dir_time in folders:
try:
if dir_time < time_threshold :

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably going to break the pre-commit checks

@rcap107 rcap107 added this to the 0.7.0 milestone Oct 7, 2025
.. note::
reports starting with ``full_data_op_report_`` that are stored
in the skrub data folder are automatically deleted after 7 days.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can tweak the wording to make it clearer that each execution of the report checks for old files and removes them? so that wary users don't wonder if we are trying to schedule something or leave some daemon cleanup process laying around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sound good to me. I will change it!

@jeromedockes
Copy link
Member

jeromedockes commented Oct 13, 2025 via email

@rcap107 rcap107 changed the title Implementation for pruning full reports of the skrub-data folder FEAT - Pruning skrub-data folder to remove old data ops execution reports Oct 17, 2025
@dierickxsimon dierickxsimon force-pushed the prune-the-skrub-data-report-folder branch from 9a61846 to f612dc4 Compare October 23, 2025 15:17
@dierickxsimon dierickxsimon requested a review from rcap107 October 23, 2025 19:44
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only nitpicks, nothing strictly required -- thank you @dierickxsimon !

return
time_threshold = time.time() - 7 * 24 * 3600 # 7 days ago

folders = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: the rest of skrub uses the (newer) pathlib module for those kind of filesystem manipulation rather than os, so for consistency it would be great to use it here, especially since the path you get from _get_output_dir is already a pathlib.Path

for example you can get the mtime with stat, the Path object representing the directory has iterdir() and glob() methods, you can join with the / operator or joinpath, etc.

also maybe we could name folders directories instead? since we use the term 'directory' rather than 'folder' for other variables (same for the function name prune_folder)

finally why do you prefer 2 separate loops for finding the directories and then removing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally agree with the variable naming — I’ll change it, and I’ll switch to using the pathlib module as well. As for the two separate loops: I initially used them for debugging purposes so I could easily print all the folders that were selected. But maybe its better to and even more readable to put it all in one loop. I will change it!

)
for dir_path, dir_time in folders:
try:
if dir_time < time_threshold:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess the test could be outside the try block

for dir_path, dir_time in folders:
try:
if dir_time < time_threshold:
shutil.rmtree(os.path.join(path, dir_path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have an even stricter check of the name (eg check with a regex there is a timestamp) just to be extra sure we don't erase something we shouldn't ... but I guess "full_data_op_report_" is already very specific 🤔

@dierickxsimon
Copy link
Contributor Author

@jeromedockes I updated all code based on your comments. I also cleaned up the pytests to make them a bit more readable and to remove the code duplication.

Copy link
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! I think we can merge this now

Thanks a lot @dierickxsimon, great job

@rcap107 rcap107 merged commit f289677 into skrub-data:main Nov 4, 2025
29 checks passed
@jeromedockes
Copy link
Member

@jeromedockes I updated all code based on your comments. I also cleaned up the pytests to make them a bit more readable and to remove the code duplication.

awesome, thank you for this feature @dierickxsimon ! this is an important addition to make the reports convenient to use :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants