-
Notifications
You must be signed in to change notification settings - Fork 44
Remove go and use chan to keep CDP msg order #524
Conversation
I once tried using buffered channels across the handlers while working on this problem, which didn't help back then. Another thing that worried me was making the interaction between components/events more unpredictable due to the nature of buffered channels. |
I've had great success with this approach so far. I've been running stress tests for many minutes and usually I hit upon the following but at the moment nothing:
Can you elaborate on what exactly makes buffered channels unpredictable? It's definitely keeping the order of the incoming events now. |
A buffered channel can easily hide deadlock and race condition problems. With an unbuffered channel, the communication between parties is synchronized. And this nature of unbuffered channels helps bring the deadlocks to the surface. However, with a buffered channel, there is no synchronization between the concurrent components (no lock-step). So, it is (and will be) harder to figure out the underlying race condition problems. The race conditions and deadlocks might not be because of the channel capacity. Instead, the problems might be happening because the components' orchestration and logic are incorrect. The deadlocks may seem to go while the work of components is lightweight. But the race condition problems will probably unpredictably come back later (Let's hope that won't happen in production!). 🤞 |
Aha, yes, that make sense. In this case, just maybe, there's a possibility that by removing the goroutine that was placing the events on the handlers, it's preventing the deadlock, and it seems that the order of the incoming events is kept. The
I've been running the stress tests locally, and unfortunately I've ran into the "nil response body" and "nil error when trying to print the session id", but still not the deadlock issue that i'm trying to tackle in the first place 🤞 The next step is testing this in the cloud. Prior to this fix, I was coming across the deadlock issue in the cloud 90% of the time I ran a test. Hopefully it doesn't show up there, which could make this a good candidate. We still have #512 though which forces the synchronisation of the create and clear contexts, which also needs to be tested in the cloud. |
The thing is each component has its own event emitter. So there can be out-of-order events between different event emitters. The cause of the problem is there is no mechanism to manage this on a higher-level. Maybe we should look into that.
I'm suspicious that we might be hiding it (deadlock). Even if there isn't an obvious deadlock, it doesn't mean that we don't have race conditions :( I'll try this fix with a complex website like Vistaprint when you finish 🤞 |
175788d
to
b6ce43c
Compare
db61e48
to
6ea67f5
Compare
continue | ||
default: | ||
go emitEvent(handler) | ||
emitEvent(handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bye bye goroutine -- this helps keep the order of incoming events.
|
||
// Event listener. | ||
|
||
EventListenerDefaultChanBufferSize int64 = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not have to work with buffered channels, but the issue is with handlers which not only listen and handle incoming events, they can also emit
events too. Without the buffered channel this would deadlock 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it 10
?
I'm asking this because, technically, a buffered channel is the same as an unbuffered channel—Only the prior will deadlock when it's full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 10
was a just a arbitrary number.
I was thinking about this, and it might be better to look at trying to prevent the handler(s) from emitting and receiving event(s), so that we don't need the buffered channels (and leave it as unbuffered). IMO, it doesn't feel right for a handler to be doing two jobs, and should concentrate on handling the events and not emitting new ones, but I need to look into this a bit more (which could take some time to make sense of).
If it's too much work for the handler to not emit, then maybe we need to allow for async emitting (i.e. another buffered channel on the receiving end of the emit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Inanc. If we're relying on a buffered channel to avoid the deadlock issue, then this change merely alleviates the problem during low activity. On complex sites with many concurrent events we would probably run into it again once the buffer is full.
Have you tested with a complex script like the Vistaprint one from #381? It will need adjusting to the recent breaking changes.
// and the channel (which should be buffered) is full. It might | ||
// be that by increasing the buffer size on the channel will help | ||
// alleviate the problem. | ||
logger.Error("emit timed out waiting for handler to read message") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we then cut this short here (return
without for
)? Now it retries, is it expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i'll add a return
if we're all happy with this change -- it's currently useful as a tool to debug an issue with pprof 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work digging into this 👏
While I do think this would improve the ordering of emitted events, it wouldn't address the functional concern of dependencies between events. I think the deadlock might not be manifesting in simple scenarios, but would happen on more complex sites, as mentioned in comments above.
For example, consider the Runtime.executionContextCreated
/ Runtime.executionContextsCleared
situation. Among other things, we need to ensure that:
-
FrameSession.onExecutionContextCreated()
always finishes processing beforeFrameSession.onExecutionContextsCleared()
processing starts.
TheFrameSession.contextIDToContextMu
mutex addresses the concurrency problem, but doesn't enforce the flow of execution. I.e. one could finish before the other, and we could end up with contexts that should've been cleared, or cleared contexts that shouldn't have been. -
That
Frame.waitForExecutionContext()
isn't called, or at least returns early, if the frame is done navigating, andonExecutionContextsCleared()
was called.
I.e. if there are no contexts, and no navigation is happening/expected, whatever calledwaitForExecutionContext()
is probably anticipating a document to exist, and it points to the same functional race condition we've seen in other places as well.
I'm a bit weary of doing this global change on EventEmitter
, since it might impact how other events are processed. I.e. given that event emission is no longer async, it might introduce delays for other events, of which we don't know the effects yet.
So I'd prefer a specific fix just for this waitForExecutionContext()
issue instead. But it's not a strong preference, and if you or Inanc oppose this, I would be OK with merging this and testing it out for some time. We can always revert it later, and we're still early in the cycle.
Are there any major drawbacks of merging the fix from #512? I haven't tested it in depth, and the code wasn't trivial to grok when I first took a look, but it might be preferable to this global change.
Another approach I'm still lingering on from #427, is having a separate channel or a small finite-state machine for specific event flows. E.g. we know that for a specific frame executionContextCreated
should always be processed before executionContextsCleared
. If that is violated that means that we're receiving CDP messages out of order, and that's a protocol level error out of our hands (and AFAIK we haven't noticed yet). Even though CDP is async, some events are always ordered within the context of a frame.
So I'm picturing a separate frame-specific channel to which these events are sent, and then a separate frame-specific reader would process them syncronously and wait for each event handler to complete, before starting the next one. This would get rid of most locking since we'd be sure things like FrameSession.contextIDToContext
couldn't be accessed concurrently.
Maybe it's worth exploring that solution for this isolated issue, and we can expand it to other similar event flows later. I tried doing something like this for the #427 refactor, but it was a lot of effort to take on all at once.
Anyway, just an idea. I would be fine with #512 if it improves this issue.
Closes: #482
The order of the CDP messages can quite easily be lost after being received and handed to the handlers that are listening in on the events. To alleviate this ordering issue, remove the
goroutine
that is used to place the event on the listening handlers, and instead use abuffered channel
. Thebuffered channel
avoids a deadlock where by an event handler could be busy emitting another event, while the emit method is busy trying to pass an event to the same handler.Diagram.

This has been tested in the cloud and after several tests, xk6-browser has not deadlocked 🎉
I have another solution but I prefer this solution as it's closer to the root cause of the issue -- #512 (which is still WIP).