Skip to content

Conversation

kpayson64
Copy link
Contributor

Hopefully finally fixes #9688

There has been an issue where poll() calls spinlock in the Python implementation. The strace looks as follows.

poll([{fd=3, events=POLLIN}, {fd=8, events=POLLIN}, {fd=5, events=POLLIN|POLLOUT}, {fd=11, events=0}], 4, 188) = 1 ([{fd=11, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {905930, 24155064}) = 0
poll([{fd=3, events=POLLIN}, {fd=8, events=POLLIN}, {fd=5, events=POLLIN|POLLOUT}, {fd=11, events=0}], 4, 188) = 1 ([{fd=11, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {905930, 24206581}) = 0
poll([{fd=3, events=POLLIN}, {fd=8, events=POLLIN}, {fd=5, events=POLLIN|POLLOUT}, {fd=11, events=0}], 4, 188) = 1 ([{fd=11, revents=POLLHUP}])
....

I've addressed several Python fd refcounting issues that could have possibly been the cause, but users are still reporting the issue. This code should be uneeded, but given that users see this in production code but we don't have a reproduction case, I would like to solve this issue once and for all.

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                           FILE SIZE
 ++++++++++++++ GROWING                             ++++++++++++++
  +0.3%     +32 src/core/lib/iomgr/ev_poll_posix.cc     +32  +0.3%
      +0.6%     +17 pollset_work                            +17  +0.6%
      +2.4%      +8 [Unmapped]                               +8  +2.4%
      +3.0%      +7 fd_create                                +7  +3.0%
  [ = ]       0 [None]                                   +8  +0.0%

  +0.0%     +32 TOTAL                                   +40  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                           FILE SIZE
 ++++++++++++++ GROWING                             ++++++++++++++
  +0.6%     +80 src/core/lib/iomgr/ev_poll_posix.cc     +80  +0.6%
      +1.2%     +33 pollset_work                            +33  +1.2%
      +9.0%     +32 fd_orphan                               +32  +9.0%
      +4.6%     +11 fd_create                               +11  +4.6%
      +1.2%      +4 [Unmapped]                               +4  +1.2%

 -+-+-+-+-+-+-+ MIXED                               +-+-+-+-+-+-+-
  -0.0%     -16 [None]                                 +136  +0.0%

  +0.0%     +64 TOTAL                                  +216  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@@ -1031,6 +1033,9 @@ static grpc_error* pollset_work(grpc_exec_ctx* exec_ctx, grpc_pollset* pollset,
pfds[i].fd, (pfds[i].revents & POLLIN_CHECK) != 0,
(pfds[i].revents & POLLOUT_CHECK) != 0, pfds[i].revents);
}
if (pfds[i].revents & POLLHUP) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here that this is just a mitigation and add a reference to the PR.

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                           FILE SIZE
 ++++++++++++++ GROWING                             ++++++++++++++
  +0.8%     +96 src/core/lib/iomgr/ev_poll_posix.cc     +96  +0.8%
      +1.7%     +49 pollset_work                            +49  +1.7%
      +9.0%     +32 fd_orphan                               +32  +9.0%
      +4.6%     +11 fd_create                               +11  +4.6%
      +1.2%      +4 [Unmapped]                               +4  +1.2%

 -------------- SHRINKING                           --------------
  [ = ]       0 [None]                                  -72  -0.0%

  +0.0%     +96 TOTAL                                   +24  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

dhermes added a commit to dhermes/google-cloud-pubsub-performance that referenced this pull request Dec 14, 2017
Based on grpc/grpc#13665 and backported to the
`v1.7.3` release in https://github.com/dhermes/grpc/tree/1.7.3-with-13665

The wheel was created via:

```
$ docker run \
>   --rm \
>   --tty \
>   --interactive \
>   --volume $(pwd):/var/wheels/ \
>   quay.io/pypa/manylinux1_x86_64:latest \
>   /bin/bash
```
dhermes added a commit to dhermes/google-cloud-pubsub-performance that referenced this pull request Dec 14, 2017
Based on grpc/grpc#13665 and backported to the
`v1.7.3` release in https://github.com/dhermes/grpc/tree/1.7.3-with-13665

The wheel was created via:

```
$ docker run \
>   --rm \
>   --tty \
>   --interactive \
>   --volume $(pwd):/var/wheels/ \
>   quay.io/pypa/manylinux1_x86_64:latest \
>   /bin/bash
% cd tmp/
% git clone https://github.com/dhermes/grpc
% cd grpc
% git checkout 1.7.3-with-13665
% git submodule update --init
% /opt/python/cp36-cp36m/bin/python3.6 -m pip install --upgrade pip wheel
% /opt/python/cp36-cp36m/bin/python3.6 -m pip install --requirement requirements.txt
% export REPO_ROOT=$(pwd)
% export GRPC_PYTHON_BUILD_WITH_CYTHON=1
% /opt/python/cp36-cp36m/bin/python3.6 -m pip wheel . --wheel-dir $(pwd)
% auditwheel repair grpcio-1.7.4.dev1-cp36-cp36m-linux_x86_64.whl --wheel-dir /var/wheels/
```
@kpayson64
Copy link
Contributor Author

kpayson64 commented Dec 14, 2017

Something is pretty messed up here: It looks like when I remove fd's then end with pollhup they never end up getting freed...

@dhermes
Copy link

dhermes commented Dec 14, 2017

@kpayson64 FWIW I have a (Python) case that reliably gets into the POLLHUP spinlock 100% of the time, but it typically takes 65 minutes to trigger (one time it did after 44 minutes). I have backported this PR on the v1.7.3 branch and am currently running my case with the modified grpcio.

The spinlock is always triggered by this line, but the real culprit seems to be in the C core. In my reproducible case, there are ~54 threads that run target=channel_spin and the CPU thrashing doesn't happen until the 41st or 42nd such thread. In every case, exactly one of these threads kicks off a thread to check metadata, and that exactly corresponds to the point where the thrashing starts.

I let it run for 80 minutes exactly, just to be sure the spinlock triggers (hence the reason there are ~54 such "channel spin" threads even though the thrashing starts by the 41st).

UPDATE: I just finished an 80 minute run of the case and there was NO thrashing. Highest measured CPU usage was 3.2%. I still observed exactly one "channel spin" thread (the 42nd out of 54) kicking off a thread to check metadata.

UPDATE 2: Here is my case https://github.com/dhermes/google-cloud-pubsub-performance/tree/f6defb0994604208cd9088e564417f956fff180e/no-messages-too. (I use the commit hash since I may update the data.) The failure occurs at about (21:31:33 local / 5:31:33 UTC / 3674391ms relative to start).

@grpc-testing
Copy link

****************************************************************

libgrpc.so

     VM SIZE                                           FILE SIZE
 ++++++++++++++ GROWING                             ++++++++++++++
  [ = ]       0 [None]                                 +280  +0.0%
  +0.8%     +96 src/core/lib/iomgr/ev_poll_posix.cc     +96  +0.8%
      +1.7%     +48 pollset_work                            +48  +1.7%
      +9.0%     +32 fd_orphan                               +32  +9.0%
      +4.6%     +11 fd_create                               +11  +4.6%
      +1.5%      +5 [Unmapped]                               +5  +1.5%

  +0.0%     +96 TOTAL                                  +376  +0.0%


****************************************************************

libgrpc++.so

     VM SIZE        FILE SIZE
 ++++++++++++++  ++++++++++++++

  [ = ]       0        0  [ = ]



@grpc-testing
Copy link

[trickle] No significant performance differences

@grpc-testing
Copy link

[microbenchmarks] No significant performance differences

@kpayson64 kpayson64 merged commit 958649e into grpc:v1.8.x Dec 20, 2017
@dhermes
Copy link

dhermes commented Dec 20, 2017

🎉

@kpayson64
Copy link
Contributor Author

I'll try to get a patch release out by tomorrow.

@kpayson64
Copy link
Contributor Author

1.8.2 has been pushed

dhermes added a commit to dhermes/google-cloud-python that referenced this pull request Dec 20, 2017
This is due to a nasty spinlock bug [1] that has been partially
fixed [2] in `1.8.2`.

[1]: grpc/grpc#9688
[2]: grpc/grpc#13665
dhermes added a commit to googleapis/google-cloud-python that referenced this pull request Dec 20, 2017
This is due to a nasty spinlock bug [1] that has been partially
fixed [2] in `1.8.2`.

[1]: grpc/grpc#9688
[2]: grpc/grpc#13665
@nicolasnoble
Copy link
Contributor

This is causing MacOS-specific issues in the poller now. Running the resource_quota_server test under mac will fail almost immediately with this PR in.

@dhermes
Copy link

dhermes commented Jan 3, 2018

@kpayson64 @nathanielmanistaatgoogle Now that this has been reverted (#13898), will there be a new PR to track progress?

@kpayson64
Copy link
Contributor Author

I've created #13918 to track the roll forward on master. I'm currently looking into the issue.

@dhermes
Copy link

dhermes commented Jan 5, 2018

Thanks @kpayson64!

@lock lock bot locked as resolved and limited conversation to collaborators Jan 21, 2019
@lock lock bot unassigned sreecha Jan 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants