FEAT - Pruning skrub-data folder to remove old data ops execution reports #1657

dierickxsimon · 2025-10-02T19:08:11Z

I can't find any issues related to this PR, but this problem came up during the pydata2025 sprints. It's a fix to ensure that the skrub-data folder does not get full of unused reports when creating a full_report.

This fix ensure that when creating a full_report with output_dir=None reports older than 7 days in the skrub_data folder are deleted.

dierickxsimon · 2025-10-03T09:12:25Z

I' m running in some issues with the ci/cd, but it's a bit unclear for me what exactly the problem is.

rcap107 · 2025-10-03T13:24:09Z

It looks like the CI issue is related to #1658, so now that that PR is merged it should be fixed

dierickxsimon · 2025-10-03T15:10:51Z

skrub/_data_ops/_utils.py

+    time_threshold = time.time() - 7 * 24 * 3600  # 7 days ago
+
+    folders = (
+        (f, os.path.getmtime(os.path.join(path, f)))


All tests are passing except the codecov (not enough test coverage). In the test cov report these lines (line 73-77) are not covered. Maybe because assignment of the variable is only done on line 73.

dierickxsimon · 2025-10-03T15:10:55Z

skrub/_data_ops/_utils.py

+        try:
+            if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"):
+                shutil.rmtree(os.path.join(path, dir[0]))
+        except Exception:


This line isn’t covered by the tests: I added the try-except block to make sure it fails silently if the file happens to still be open. I’m not sure if this is really needed (and if so, whether the exception block should be tested). Thanks in advance!

Unrelated comment: it's much much preferable to capture a specific Exception rather than the base class. So this code should capture the specific one that is trigger by the problem.

Also, I would always add a small comment on why we capture the exception.

I think we should catch the exception as a PermissionError, and the exception should be tested. However, this should be a Windows specific error, so the test should be skipped when the platform isn't Windows with something like

import sys import pytest @pytest.mark.skipif(sys.platform != 'win32', reason="Test is Windows-specific") def test_windows_specific_functionality(): # test code here

If you don't have a Windows machine to test this one, it will have to be tested on CI which is going to be annoying unfortunately.

I agree with the general rule but here I think catching a broad exception type makes sense: we are performing an accessory background task (cleanup of old files created by other programs), so we don't want to crash the program if it fails. it might fail for other reasons beside the one mentioned here: another thread may have removed it already between the listdir() and the rmtree() calls, we may not have permission if it was created by another user, etc. . In those cases failing the current report generation would not help, so I suggest we emit a warning (so that if they see the warning often the user might want to report it or manually delete the directory if it is large) and proceed with the main task without raising an error

to test on other platforms than windows (and even on windows actually), use unittest.mock and the pytest monkeypatch fixture to mock rmtree so it raises an exception (set side_effect = OSError() on the mock)

I'm fine with @jeromedockes point: we can not worry about being specific and catch a broad exception

Okay then i will just catch a broad exception and write a test for it, I willl work on it today or tomorrow!

rcap107 · 2025-10-07T08:34:35Z

skrub/_data_ops/_utils.py

+
+    folders = (
+        (f, os.path.getmtime(os.path.join(path, f)))
+        for f in os.listdir(path)
+        if os.path.isdir(os.path.join(path, f))
+    )
+    for dir in folders:
+        try:
+            if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"):


A couple of changes: I moved the name condition to the first if, and I unpacked dir into dir_path and dir_time. dir is a python keyword so it should not be used as a variable name.

Suggested change

folders = (

(f, os.path.getmtime(os.path.join(path, f)))

for f in os.listdir(path)

if os.path.isdir(os.path.join(path, f))

)

for dir in folders:

try:

if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"):

folders = (

(f, os.path.getmtime(os.path.join(path, f)))

for f in os.listdir(path)

if os.path.isdir(os.path.join(path, f))

and f.startswith("full_data_op_report_"))

)

for dir_path, dir_time in folders:

try:

if dir_time < time_threshold :

This is probably going to break the pre-commit checks

jeromedockes · 2025-10-13T06:57:34Z

skrub/_data_ops/_skrub_namespace.py

+        .. note::
+
+            reports starting with ``full_data_op_report_`` that are stored
+            in the skrub data folder are automatically deleted after 7 days.


maybe we can tweak the wording to make it clearer that each execution of the report checks for old files and removes them? so that wary users don't wonder if we are trying to schedule something or leave some daemon cleanup process laying around

sound good to me. I will change it!

jeromedockes · 2025-10-13T08:36:50Z

sound good to me. I will change it!

and maybe we could add something along the lines of "for reports that you want to keep, provide an output directory" so users know what to do if they don't want it removed? Thanks!

…a-report-folder

skrub/_data_ops/tests/test_utils.py

Co-authored-by: Riccardo Cappuzzo <[email protected]>

jeromedockes

I have only nitpicks, nothing strictly required -- thank you @dierickxsimon !

jeromedockes · 2025-11-04T10:43:48Z

skrub/_data_ops/_utils.py

+        return
+    time_threshold = time.time() - 7 * 24 * 3600  # 7 days ago
+
+    folders = (


nitpick: the rest of skrub uses the (newer) pathlib module for those kind of filesystem manipulation rather than os, so for consistency it would be great to use it here, especially since the path you get from _get_output_dir is already a pathlib.Path

for example you can get the mtime with stat, the Path object representing the directory has iterdir() and glob() methods, you can join with the / operator or joinpath, etc.

also maybe we could name folders directories instead? since we use the term 'directory' rather than 'folder' for other variables (same for the function name prune_folder)

finally why do you prefer 2 separate loops for finding the directories and then removing them?

totally agree with the variable naming — I’ll change it, and I’ll switch to using the pathlib module as well. As for the two separate loops: I initially used them for debugging purposes so I could easily print all the folders that were selected. But maybe its better to and even more readable to put it all in one loop. I will change it!

jeromedockes · 2025-11-04T10:49:16Z

skrub/_data_ops/_utils.py

+    )
+    for dir_path, dir_time in folders:
+        try:
+            if dir_time < time_threshold:


i guess the test could be outside the try block

jeromedockes · 2025-11-04T10:52:02Z

skrub/_data_ops/_utils.py

+    for dir_path, dir_time in folders:
+        try:
+            if dir_time < time_threshold:
+                shutil.rmtree(os.path.join(path, dir_path))


I wonder if we should have an even stricter check of the name (eg check with a regex there is a timestamp) just to be extra sure we don't erase something we shouldn't ... but I guess "full_data_op_report_" is already very specific 🤔

…a-report-folder

dierickxsimon · 2025-11-04T15:11:09Z

@jeromedockes I updated all code based on your comments. I also cleaned up the pytests to make them a bit more readable and to remove the code duplication.

rcap107

Looks great to me! I think we can merge this now

Thanks a lot @dierickxsimon, great job

jeromedockes · 2025-11-08T09:58:14Z

@jeromedockes I updated all code based on your comments. I also cleaned up the pytests to make them a bit more readable and to remove the code duplication.

awesome, thank you for this feature @dierickxsimon ! this is an important addition to make the reports convenient to use :)

dierickxsimon added 6 commits October 2, 2025 20:25

tests and implementation for the prune_folder func still no docs

f560b23

changed the implementation to work with dirs instead of files

4d08cc2

updated the docs

fa955f8

use pixi.lock from main

7cf9bc3

update changelogs

3a76e20

add chech if path exist on top

f612dc4

dierickxsimon force-pushed the prune-the-skrub-data-report-folder branch from 246a49f to f612dc4 Compare October 3, 2025 07:28

dierickxsimon marked this pull request as ready for review October 3, 2025 09:10

dierickxsimon commented Oct 3, 2025

View reviewed changes

rcap107 reviewed Oct 7, 2025

View reviewed changes

rcap107 added this to the 0.7.0 milestone Oct 7, 2025

jeromedockes reviewed Oct 13, 2025

View reviewed changes

rcap107 changed the title ~~Implementation for pruning full reports of the skrub-data folder~~ FEAT - Pruning skrub-data folder to remove old data ops execution reports Oct 17, 2025

dierickxsimon force-pushed the prune-the-skrub-data-report-folder branch from 9a61846 to f612dc4 Compare October 23, 2025 15:17

dierickxsimon added 2 commits October 23, 2025 20:34

Merge remote-tracking branch 'upstream/main' into prune-the-skrub-dat…

3e4a145

…a-report-folder

added tests for exception and refining the docs

9b14f78

dierickxsimon requested a review from rcap107 October 23, 2025 19:44

updating docs to be more specific

01be627

rcap107 reviewed Nov 4, 2025

View reviewed changes

skrub/_data_ops/tests/test_utils.py Outdated Show resolved Hide resolved

rcap107 reviewed Nov 4, 2025

View reviewed changes

skrub/_data_ops/tests/test_utils.py Outdated Show resolved Hide resolved

rcap107 reviewed Nov 4, 2025

View reviewed changes

skrub/_data_ops/tests/test_utils.py Outdated Show resolved Hide resolved

dierickxsimon and others added 4 commits November 4, 2025 10:47

Update skrub/_data_ops/tests/test_utils.py

7474b72

Co-authored-by: Riccardo Cappuzzo <[email protected]>

Update skrub/_data_ops/tests/test_utils.py

fe088af

Co-authored-by: Riccardo Cappuzzo <[email protected]>

Update skrub/_data_ops/tests/test_utils.py

4aeb179

Co-authored-by: Riccardo Cappuzzo <[email protected]>

pre-commits

a0987e6

jeromedockes approved these changes Nov 4, 2025

View reviewed changes

changed the tests and implemented the reviews

dadb13b

dierickxsimon added 2 commits November 4, 2025 15:22

Merge remote-tracking branch 'upstream/main' into prune-the-skrub-dat…

282128e

…a-report-folder

update function name to prune_directory

d135246

dierickxsimon requested a review from jeromedockes November 4, 2025 15:11

rcap107 approved these changes Nov 4, 2025

View reviewed changes

rcap107 merged commit f289677 into skrub-data:main Nov 4, 2025
29 checks passed

FEAT - Pruning skrub-data folder to remove old data ops execution reports #1657

FEAT - Pruning skrub-data folder to remove old data ops execution reports #1657

Uh oh!

Conversation

dierickxsimon commented Oct 2, 2025 • edited by rcap107 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dierickxsimon commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcap107 commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromedockes commented Oct 13, 2025 via email

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dierickxsimon commented Nov 4, 2025

Uh oh!

rcap107 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromedockes commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dierickxsimon commented Oct 2, 2025 •

edited by rcap107

Loading

dierickxsimon commented Oct 3, 2025 •

edited

Loading