Skip to content

Conversation

@potatosalad
Copy link
Contributor

@potatosalad potatosalad commented Oct 23, 2025

Potential fix for #10322

If I'm understanding what happens when scheduler pollset migration gets triggered: once the state->count++ > 10, we migrate the "hot fd" to the scheduler pollset.

However, the state->count++ is global, so if there are multiple processes all calling socket:accept at the same time, then state->count++ > 10 could be triggered almost immediately.

For things like socket:recv, this generally would work as expected since there is typically one process calling this in a loop.

However, for socket:accept, it's more of a common practice to have multiple processes all trying to accept inbound connections.

If all of that's correct, then this is my assumption:

  • OTP 27: Direct message delivery from the I/O thread(?)
  • OTP 28: Process signals, sent from the normal scheduler thread(?)

The change here is to add a state->last_select_pid field and reset the state->count = 0 if we detect that the process has changed between calls to enif_select.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 23, 2025

CT Test Results

    3 files    135 suites   49m 22s ⏱️
1 655 tests 1 598 ✅ 57 💤 0 ❌
2 293 runs  2 217 ✅ 76 💤 0 ❌

Results for commit c89cbee.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@potatosalad potatosalad marked this pull request as ready for review October 23, 2025 20:45
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Oct 27, 2025
@sverker sverker self-assigned this Oct 27, 2025
@sverker
Copy link
Contributor

sverker commented Oct 29, 2025

This looks promising. I pushed two minor fixups.
Does this PR fix your regression?

@potatosalad
Copy link
Contributor Author

@sverker Yes, we tested the patches today and all the issues we were seeing appear to have been resolved.

@sverker sverker force-pushed the potatosalad/10322-non-global-scheduler-pollset-count branch from 0b47de1 to c89cbee Compare October 30, 2025 15:45
@sverker sverker changed the base branch from master to maint October 30, 2025 15:46
@sverker
Copy link
Contributor

sverker commented Oct 30, 2025

I rebased-squashed it for inclusion into maint (OTP-28.2).

@sverker sverker added fix testing currently being tested, tag is used by OTP internal CI labels Oct 30, 2025
@sverker sverker self-requested a review October 30, 2025 15:49
@potatosalad
Copy link
Contributor Author

@sverker Some additional information: we've noticed a latency regression when at high CPU utilization (90%+) that seems to be due to scheduler slowdown from the large number of pollset entries per scheduler.

I'll draw some ASCII art in a separate comment...

@potatosalad
Copy link
Contributor Author

potatosalad commented Oct 30, 2025

@sverker Alright, so before scheduler pollsets, my understanding is that this is roughly what was happening:

┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
│ S0      │  │ S1      │  │ S2      │  │ S3      │  │ I/O     │
└─────────┘  └─────────┘  └─────────┘  └─────────┘  │ PS a..h │
                                                    └─────────┘

At high load, all of the pollset operations are still isolated to the I/O event loop and won't uniformly slow down the schedulers when there are a really large number of file descriptors in the pollsets.

With the new scheduler pollset feature, we have this:

┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
│ S0      │  │ S1      │  │ S2      │  │ S3      │
│ PS a..b │  │ PS c..d │  │ PS e..f │  │ PS g..h │
└─────────┘  └─────────┘  └─────────┘  └─────────┘

First off: is this an accurate view of how this works?

We tested a version of OTP where ERTS_POLL_USE_SCHEDULER_POLLING was defined as (0) to disable the feature entirely and the latency regression at high CPU went away.

If all of this is correct, would it make sense to have a socket option that we could use to selectively prevent them from being included in the scheduler pollset optimization? We think this might help us in very large number of sockets scenarios.

For small number of sockets, the scheduler pollset optimization should provide more of a benefit, I would think.

Maybe something like:

socket:setopt(Socket, {otp, scheduler_polling}, false).

What do you think?

@sverker
Copy link
Contributor

sverker commented Nov 3, 2025

There is one scheduler pollset, namely sched_pollset in erl_check_io.c. Only one scheduler at a time waits for events from the pollset and not until all outstanding triggered events have been handled. There are some internal docs written about this at https://www.erlang.org/doc/apps/erts/checkio.html.

You can disable scheduler pollset with erl +IOs false. If you don't get any performance gain from scheduler pollset then that seems to be the easy solution.

If an option would be necessary, I would prefer something less implementation specific than {otp, scheduler_polling} if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants