Skip to content

[Bug]: Python pipelines running on Python 3.11 may experience periodic stuckness. #33966

@tvalentyn

Description

@tvalentyn

Summary

Python pipelines running on Python 3.11 may experience periodic stuckness. Beam Dataflow users might see this stuckess accompanied with errors like:

Unable to retrieve status info from SDK harness sdk_harness_id within allowed time

SDK worker appears to be permanently unresponsive. Aborting the SDK.

The issue may be more pronounced in pipelines that frequently trigger garbage collection.

Mitigation: Use Python 3.12, Python 3.10, or switch to Beam 2.64.0 once it is released.

Details

Beam SDK has a mechanism to provide status report to a runner that captures the ongoing work. The status report includes stacktraces of running threads.

To collect such stacktraces, we inspect the content of running Python frames via sys._current_frames().

It appears that on Python 3.11, such invocation can cause a deadlock if/when garbage collection triggers during the call to sys._current_frames(): python/cpython#106883. The issue is not reproducible on Python 3.10 or Python 3.12.

On Python 3.11, a Beam job might get stuck. An a stuck job running on Dataflow might have errors like:

Unable to retrieve status info from SDK harness sdk_harness_id within allowed time

SDK worker appears to be permanently unresponsive. Aborting the SDK.

As noted in https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact , such errors can happen when a thread in Python process permanently holds the GIL.

Inspecting the Dataflow workers with pystack, for example via an automated script like: https://gist.github.com/tvalentyn/82fcee6b93253740d2ae50bd425916a5 , reveals a thread with a stacktrace in frames = sys._current_frames() holding the GIL and a thread doing garbage collecting; sometimes these are also the same thread:

Traceback for thread 107 (python) [Has the GIL,Garbage collecting] (most recent call last):
    (C) File "Python/thread_pthread.h", line 241, in pythread_wrapper (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 1124, in thread_run (/usr/local/lib/libpython3.11.so.1.0)
    (Python) File "/usr/local/lib/python3.11/threading.py", line 1002, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/usr/local/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
        self.run()
    (Python) File "/usr/local/lib/python3.11/threading.py", line 982, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 175, in <lambda>
        target=lambda: self._serve(), name='fn_api_status_handler')
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 200, in _serve
        id=request.id, status_info=self.generate_status_response()))
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 219, in generate_status_response
        all_status_sections.append(thread_dump())
    (Python) File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/worker_status.py", line 60, in thread_dump
        frames = sys._current_frames()  # pylint: disable=protected-access
    (C) File "Modules/gcmodule.c", line 2290, in gc_alloc (inlined) (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1400, in gc_collect_with_callback (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1287, in gc_collect_main (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Modules/gcmodule.c", line 1013, in delete_garbage (inlined) (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Objects/typeobject.c", line 1279, in subtype_clear (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Objects/typeobject.c", line 1463, in subtype_dealloc (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 904, in local_dealloc (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "./Modules/_threadmodule.c", line 872, in local_clear (/usr/local/lib/libpython3.11.so.1.0)
    (C) File "Python/thread_pthread.h", line 497, in PyThread_acquire_lock_timed (/usr/local/lib/libpython3.11.so.1.0)

This failure mode matches the description of python/cpython#106883, which is known to affect CPython 3.11, has been fixed in CPython 3.12 and has not been reproduced in CPython 3.10.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions