Skip to content

Conversation

pcolladosoto
Copy link
Contributor

This PR fixes the cleanup procedures of the DAGFileProcessors being spawned by the DagFileProcessorManager instance. Thus, this PR closes #22191.

We came across this bug when encountering a timeout triggering the forceful killing of these DAGFileProcessors.

We have had somewhat lengthy discussion on an issue (#22191) which we encourage anyone to read for some more insight into the cause, discovery and posterior fix. People on that thread were extremely helpful! 😸

We have not included tests because we're unsure of how to test a behaviour such as this one. If pointed in the right direction we would be more than happy to add them. We can say however that we applied these changes on our own production Airflow instance and we haven't encountered the issue ever since.

On top of that we would be more than welcome if you made any suggestions to the code: for instance, when cleaning up a dictionary we're iterating over we decided to take note of the problematic DAGFileProcessors to then remove them on a second for loop. Our background in programming is much stronger on some other languages and so we feel really uncomfortable pushing Python 'to the limit' in terms of relying on its implementation to make design choices. If there's anything that can be done better by all means say so.

Another fertile topic for discussion is how to wait() for the processes being killed through the SIGKILL signal. This has been brought up by @dlesco on #22191 and we agree with him on adding an optional timeout to the operation to avoid blocking in very bizarre circumstances (which the current solution would do). However, we decided to contribute our initial approach and then iterate on solutions within the PR.

Thanks a lot for letting me contribute to a tool with the expertise and size of Airflow: it's truly an honour.

Hope to make the merge as seamless as possible 😜

closes: #22191

References to processors weren't being cleaned up after
killing them in the event of a timeout. This lead to
a crash caused by an unhandled exception when trying to
read from a closed end of a pipe.
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Apr 1, 2022
@boring-cyborg