-
Notifications
You must be signed in to change notification settings - Fork 185
FEAT - Pruning skrub-data folder to remove old data ops execution reports #1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT - Pruning skrub-data folder to remove old data ops execution reports #1657
Conversation
246a49f to
f612dc4
Compare
|
I' m running in some issues with the ci/cd, but it's a bit unclear for me what exactly the problem is. |
|
It looks like the CI issue is related to #1658, so now that that PR is merged it should be fixed |
skrub/_data_ops/_utils.py
Outdated
| time_threshold = time.time() - 7 * 24 * 3600 # 7 days ago | ||
|
|
||
| folders = ( | ||
| (f, os.path.getmtime(os.path.join(path, f))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests are passing except the codecov (not enough test coverage). In the test cov report these lines (line 73-77) are not covered. Maybe because assignment of the variable is only done on line 73.
skrub/_data_ops/_utils.py
Outdated
| try: | ||
| if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"): | ||
| shutil.rmtree(os.path.join(path, dir[0])) | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line isn’t covered by the tests: I added the try-except block to make sure it fails silently if the file happens to still be open. I’m not sure if this is really needed (and if so, whether the exception block should be tested). Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated comment: it's much much preferable to capture a specific Exception rather than the base class. So this code should capture the specific one that is trigger by the problem.
Also, I would always add a small comment on why we capture the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should catch the exception as a PermissionError, and the exception should be tested. However, this should be a Windows specific error, so the test should be skipped when the platform isn't Windows with something like
import sys
import pytest
@pytest.mark.skipif(sys.platform != 'win32', reason="Test is Windows-specific")
def test_windows_specific_functionality():
# test code hereIf you don't have a Windows machine to test this one, it will have to be tested on CI which is going to be annoying unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the general rule but here I think catching a broad exception type makes sense: we are performing an accessory background task (cleanup of old files created by other programs), so we don't want to crash the program if it fails. it might fail for other reasons beside the one mentioned here: another thread may have removed it already between the listdir() and the rmtree() calls, we may not have permission if it was created by another user, etc. . In those cases failing the current report generation would not help, so I suggest we emit a warning (so that if they see the warning often the user might want to report it or manually delete the directory if it is large) and proceed with the main task without raising an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to test on other platforms than windows (and even on windows actually), use unittest.mock and the pytest monkeypatch fixture to mock rmtree so it raises an exception (set side_effect = OSError() on the mock)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with @jeromedockes point: we can not worry about being specific and catch a broad exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay then i will just catch a broad exception and write a test for it, I willl work on it today or tomorrow!
skrub/_data_ops/_utils.py
Outdated
|
|
||
| folders = ( | ||
| (f, os.path.getmtime(os.path.join(path, f))) | ||
| for f in os.listdir(path) | ||
| if os.path.isdir(os.path.join(path, f)) | ||
| ) | ||
| for dir in folders: | ||
| try: | ||
| if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of changes: I moved the name condition to the first if, and I unpacked dir into dir_path and dir_time. dir is a python keyword so it should not be used as a variable name.
| folders = ( | |
| (f, os.path.getmtime(os.path.join(path, f))) | |
| for f in os.listdir(path) | |
| if os.path.isdir(os.path.join(path, f)) | |
| ) | |
| for dir in folders: | |
| try: | |
| if dir[1] < time_threshold and dir[0].startswith("full_data_op_report_"): | |
| folders = ( | |
| (f, os.path.getmtime(os.path.join(path, f))) | |
| for f in os.listdir(path) | |
| if os.path.isdir(os.path.join(path, f)) | |
| and f.startswith("full_data_op_report_")) | |
| ) | |
| for dir_path, dir_time in folders: | |
| try: | |
| if dir_time < time_threshold : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably going to break the pre-commit checks
skrub/_data_ops/_skrub_namespace.py
Outdated
| .. note:: | ||
| reports starting with ``full_data_op_report_`` that are stored | ||
| in the skrub data folder are automatically deleted after 7 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can tweak the wording to make it clearer that each execution of the report checks for old files and removes them? so that wary users don't wonder if we are trying to schedule something or leave some daemon cleanup process laying around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sound good to me. I will change it!
|
sound good to me. I will change it!
and maybe we could add something along the lines of "for reports that you want to keep, provide an output directory" so users know what to do if they don't want it removed? Thanks!
|
9a61846 to
f612dc4
Compare
Co-authored-by: Riccardo Cappuzzo <[email protected]>
Co-authored-by: Riccardo Cappuzzo <[email protected]>
Co-authored-by: Riccardo Cappuzzo <[email protected]>
jeromedockes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only nitpicks, nothing strictly required -- thank you @dierickxsimon !
skrub/_data_ops/_utils.py
Outdated
| return | ||
| time_threshold = time.time() - 7 * 24 * 3600 # 7 days ago | ||
|
|
||
| folders = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: the rest of skrub uses the (newer) pathlib module for those kind of filesystem manipulation rather than os, so for consistency it would be great to use it here, especially since the path you get from _get_output_dir is already a pathlib.Path
for example you can get the mtime with stat, the Path object representing the directory has iterdir() and glob() methods, you can join with the / operator or joinpath, etc.
also maybe we could name folders directories instead? since we use the term 'directory' rather than 'folder' for other variables (same for the function name prune_folder)
finally why do you prefer 2 separate loops for finding the directories and then removing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally agree with the variable naming — I’ll change it, and I’ll switch to using the pathlib module as well. As for the two separate loops: I initially used them for debugging purposes so I could easily print all the folders that were selected. But maybe its better to and even more readable to put it all in one loop. I will change it!
skrub/_data_ops/_utils.py
Outdated
| ) | ||
| for dir_path, dir_time in folders: | ||
| try: | ||
| if dir_time < time_threshold: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess the test could be outside the try block
skrub/_data_ops/_utils.py
Outdated
| for dir_path, dir_time in folders: | ||
| try: | ||
| if dir_time < time_threshold: | ||
| shutil.rmtree(os.path.join(path, dir_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have an even stricter check of the name (eg check with a regex there is a timestamp) just to be extra sure we don't erase something we shouldn't ... but I guess "full_data_op_report_" is already very specific 🤔
|
@jeromedockes I updated all code based on your comments. I also cleaned up the pytests to make them a bit more readable and to remove the code duplication. |
rcap107
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! I think we can merge this now
Thanks a lot @dierickxsimon, great job
awesome, thank you for this feature @dierickxsimon ! this is an important addition to make the reports convenient to use :) |
I can't find any issues related to this PR, but this problem came up during the pydata2025 sprints. It's a fix to ensure that the skrub-data folder does not get full of unused reports when creating a
full_report.This fix ensure that when creating a
full_reportwithoutput_dir=Nonereports older than 7 days in the skrub_data folder are deleted.