You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR fixes the cleanup procedures of the DAGFileProcessors being spawned by the DagFileProcessorManager instance. Thus, this PR closes#22191.
We came across this bug when encountering a timeout triggering the forceful killing of these DAGFileProcessors.
We have had somewhat lengthy discussion on an issue (#22191) which we encourage anyone to read for some more insight into the cause, discovery and posterior fix. People on that thread were extremely helpful! 😸
We have not included tests because we're unsure of how to test a behaviour such as this one. If pointed in the right direction we would be more than happy to add them. We can say however that we applied these changes on our own production Airflow instance and we haven't encountered the issue ever since.
On top of that we would be more than welcome if you made any suggestions to the code: for instance, when cleaning up a dictionary we're iterating over we decided to take note of the problematic DAGFileProcessors to then remove them on a second for loop. Our background in programming is much stronger on some other languages and so we feel really uncomfortable pushing Python 'to the limit' in terms of relying on its implementation to make design choices. If there's anything that can be done better by all means say so.
Another fertile topic for discussion is how to wait() for the processes being killed through the SIGKILL signal. This has been brought up by @dlesco on #22191 and we agree with him on adding an optional timeout to the operation to avoid blocking in very bizarre circumstances (which the current solution would do). However, we decided to contribute our initial approach and then iterate on solutions within the PR.
Thanks a lot for letting me contribute to a tool with the expertise and size of Airflow: it's truly an honour.
References to processors weren't being cleaned up after
killing them in the event of a timeout. This lead to
a crash caused by an unhandled exception when trying to
read from a closed end of a pipe.
This PR fixes the cleanup procedures of the
DAGFileProcessor
s being spawned by theDagFileProcessorManager
instance. Thus, this PR closes #22191.We came across this bug when encountering a timeout triggering the forceful killing of these
DAGFileProcessor
s.We have had somewhat lengthy discussion on an issue (#22191) which we encourage anyone to read for some more insight into the cause, discovery and posterior fix. People on that thread were extremely helpful! 😸
We have not included tests because we're unsure of how to test a behaviour such as this one. If pointed in the right direction we would be more than happy to add them. We can say however that we applied these changes on our own production Airflow instance and we haven't encountered the issue ever since.
On top of that we would be more than welcome if you made any suggestions to the code: for instance, when cleaning up a dictionary we're iterating over we decided to take note of the problematic
DAGFileProcessor
s to then remove them on a secondfor
loop. Our background in programming is much stronger on some other languages and so we feel really uncomfortable pushing Python 'to the limit' in terms of relying on its implementation to make design choices. If there's anything that can be done better by all means say so.Another fertile topic for discussion is how to
wait()
for the processes being killed through theSIGKILL
signal. This has been brought up by @dlesco on #22191 and we agree with him on adding an optional timeout to the operation to avoid blocking in very bizarre circumstances (which the current solution would do). However, we decided to contribute our initial approach and then iterate on solutions within the PR.Thanks a lot for letting me contribute to a tool with the expertise and size of Airflow: it's truly an honour.
Hope to make the merge as seamless as possible 😜
closes: #22191