-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Fix a Python spinlock bug #13665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a Python spinlock bug #13665
Conversation
|
|
|
d6db378
to
9edf330
Compare
|
|
|
src/core/lib/iomgr/ev_poll_posix.cc
Outdated
@@ -1031,6 +1033,9 @@ static grpc_error* pollset_work(grpc_exec_ctx* exec_ctx, grpc_pollset* pollset, | |||
pfds[i].fd, (pfds[i].revents & POLLIN_CHECK) != 0, | |||
(pfds[i].revents & POLLOUT_CHECK) != 0, pfds[i].revents); | |||
} | |||
if (pfds[i].revents & POLLHUP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here that this is just a mitigation and add a reference to the PR.
9edf330
to
69559f9
Compare
|
|
|
Based on grpc/grpc#13665 and backported to the `v1.7.3` release in https://github.com/dhermes/grpc/tree/1.7.3-with-13665 The wheel was created via: ``` $ docker run \ > --rm \ > --tty \ > --interactive \ > --volume $(pwd):/var/wheels/ \ > quay.io/pypa/manylinux1_x86_64:latest \ > /bin/bash ```
Based on grpc/grpc#13665 and backported to the `v1.7.3` release in https://github.com/dhermes/grpc/tree/1.7.3-with-13665 The wheel was created via: ``` $ docker run \ > --rm \ > --tty \ > --interactive \ > --volume $(pwd):/var/wheels/ \ > quay.io/pypa/manylinux1_x86_64:latest \ > /bin/bash % cd tmp/ % git clone https://github.com/dhermes/grpc % cd grpc % git checkout 1.7.3-with-13665 % git submodule update --init % /opt/python/cp36-cp36m/bin/python3.6 -m pip install --upgrade pip wheel % /opt/python/cp36-cp36m/bin/python3.6 -m pip install --requirement requirements.txt % export REPO_ROOT=$(pwd) % export GRPC_PYTHON_BUILD_WITH_CYTHON=1 % /opt/python/cp36-cp36m/bin/python3.6 -m pip wheel . --wheel-dir $(pwd) % auditwheel repair grpcio-1.7.4.dev1-cp36-cp36m-linux_x86_64.whl --wheel-dir /var/wheels/ ```
Something is pretty messed up here: It looks like when I remove fd's then end with pollhup they never end up getting freed... |
@kpayson64 FWIW I have a (Python) case that reliably gets into the The spinlock is always triggered by this line, but the real culprit seems to be in the C core. In my reproducible case, there are ~54 threads that run I let it run for 80 minutes exactly, just to be sure the spinlock triggers (hence the reason there are ~54 such "channel spin" threads even though the thrashing starts by the 41st). UPDATE: I just finished an 80 minute run of the case and there was NO thrashing. Highest measured CPU usage was 3.2%. I still observed exactly one "channel spin" thread (the 42nd out of 54) kicking off a thread to check metadata. UPDATE 2: Here is my case https://github.com/dhermes/google-cloud-pubsub-performance/tree/f6defb0994604208cd9088e564417f956fff180e/no-messages-too. (I use the commit hash since I may update the data.) The failure occurs at about ( |
69559f9
to
2df509f
Compare
|
|
|
🎉 |
I'll try to get a patch release out by tomorrow. |
1.8.2 has been pushed |
This is due to a nasty spinlock bug [1] that has been partially fixed [2] in `1.8.2`. [1]: grpc/grpc#9688 [2]: grpc/grpc#13665
This is due to a nasty spinlock bug [1] that has been partially fixed [2] in `1.8.2`. [1]: grpc/grpc#9688 [2]: grpc/grpc#13665
This is causing MacOS-specific issues in the poller now. Running the resource_quota_server test under mac will fail almost immediately with this PR in. |
More specifically, the test results of this PR were already failing: |
@kpayson64 @nathanielmanistaatgoogle Now that this has been reverted (#13898), will there be a new PR to track progress? |
I've created #13918 to track the roll forward on master. I'm currently looking into the issue. |
Thanks @kpayson64! |
Hopefully finally fixes #9688
There has been an issue where
poll()
calls spinlock in the Python implementation. The strace looks as follows.I've addressed several Python fd refcounting issues that could have possibly been the cause, but users are still reporting the issue. This code
shouldbe uneeded, but given that users see this in production code but we don't have a reproduction case, I would like to solve this issue once and for all.