-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Summary
Python pipelines running on Python 3.11 may experience periodic stuckness. Beam Dataflow users might see this stuckess accompanied with errors like:
Unable to retrieve status info from SDK harness sdk_harness_id within allowed time
SDK worker appears to be permanently unresponsive. Aborting the SDK.
The issue may be more pronounced in pipelines that frequently trigger garbage collection.
Mitigation: Use Python 3.12, Python 3.10, or switch to Beam 2.64.0 once it is released.
Details
Beam SDK has a mechanism to provide status report to a runner that captures the ongoing work. The status report includes stacktraces of running threads.
To collect such stacktraces, we inspect the content of running Python frames via sys._current_frames()
.
It appears that on Python 3.11, such invocation can cause a deadlock if/when garbage collection triggers during the call to sys._current_frames()
: python/cpython#106883. The issue is not reproducible on Python 3.10 or Python 3.12.
On Python 3.11, a Beam job might get stuck. An a stuck job running on Dataflow might have errors like:
Unable to retrieve status info from SDK harness sdk_harness_id within allowed time
SDK worker appears to be permanently unresponsive. Aborting the SDK.
As noted in https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact , such errors can happen when a thread in Python process permanently holds the GIL.
Inspecting the Dataflow workers with pystack, for example via an automated script like: https://gist.github.com/tvalentyn/82fcee6b93253740d2ae50bd425916a5 , reveals a thread with a stacktrace in frames = sys._current_frames()
holding the GIL and a thread doing garbage collecting; sometimes these are also the same thread:
Traceback for thread 107 (python) [Has the GIL,Garbage collecting] (most recent call last):
(C) File "Python/thread_pthread.h", line 241, in pythread_wrapper (/usr/local/lib/libpython3.11.so.1.0)
(C) File "./Modules/_threadmodule.c", line 1124, in thread_run (/usr/local/lib/libpython3.11.so.1.0)
(Python) File "/usr/local/lib/python3.11/threading.py", line 1002, in _bootstrap
self._bootstrap_inner()
(Python) File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
(Python) File "/usr/local/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
(Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 175, in <lambda>
target=lambda: self._serve(), name='fn_api_status_handler')
(Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 200, in _serve
id=request.id, status_info=self.generate_status_response()))
(Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 219, in generate_status_response
all_status_sections.append(thread_dump())
(Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 60, in thread_dump
frames = sys._current_frames() # pylint: disable=protected-access
(C) File "Modules/gcmodule.c", line 2290, in gc_alloc (inlined) (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Modules/gcmodule.c", line 1400, in gc_collect_with_callback (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Modules/gcmodule.c", line 1287, in gc_collect_main (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Modules/gcmodule.c", line 1013, in delete_garbage (inlined) (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Objects/typeobject.c", line 1279, in subtype_clear (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Objects/typeobject.c", line 1463, in subtype_dealloc (/usr/local/lib/libpython3.11.so.1.0)
(C) File "./Modules/_threadmodule.c", line 904, in local_dealloc (/usr/local/lib/libpython3.11.so.1.0)
(C) File "./Modules/_threadmodule.c", line 872, in local_clear (/usr/local/lib/libpython3.11.so.1.0)
(C) File "Python/thread_pthread.h", line 497, in PyThread_acquire_lock_timed (/usr/local/lib/libpython3.11.so.1.0)
This failure mode matches the description of python/cpython#106883, which is known to affect CPython 3.11, has been fixed in CPython 3.12 and has not been reproduced in CPython 3.10.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner