Remove go and use chan to keep CDP msg order #524

ankur22 · 2022-09-09T11:43:57Z

Closes: #482

The order of the CDP messages can quite easily be lost after being received and handed to the handlers that are listening in on the events. To alleviate this ordering issue, remove the goroutine that is used to place the event on the listening handlers, and instead use a buffered channel. The buffered channel avoids a deadlock where by an event handler could be busy emitting another event, while the emit method is busy trying to pass an event to the same handler.

Diagram.

This has been tested in the cloud and after several tests, xk6-browser has not deadlocked 🎉

I have another solution but I prefer this solution as it's closer to the root cause of the issue -- #512 (which is still WIP).

inancgumus · 2022-09-09T11:51:17Z

I once tried using buffered channels across the handlers while working on this problem, which didn't help back then. Another thing that worried me was making the interaction between components/events more unpredictable due to the nature of buffered channels.

ankur22 · 2022-09-09T12:01:48Z

I once tried using buffered channels across the handlers while working on this problem, which didn't help back then.

I've had great success with this approach so far. I've been running stress tests for many minutes and usually I hit upon the following but at the moment nothing:

nill response in the locator tests,
deadlock in the locator tests,
nill error when trying to print the session id.

Another thing that worried me was making the interaction between components/events more unpredictable due to the nature of buffered channels.

Can you elaborate on what exactly makes buffered channels unpredictable? It's definitely keeping the order of the incoming events now.

inancgumus · 2022-09-09T12:16:30Z

Can you elaborate on what exactly makes buffered channels unpredictable? It's definitely keeping the order of the incoming events now.

A buffered channel can easily hide deadlock and race condition problems.

With an unbuffered channel, the communication between parties is synchronized. And this nature of unbuffered channels helps bring the deadlocks to the surface. However, with a buffered channel, there is no synchronization between the concurrent components (no lock-step). So, it is (and will be) harder to figure out the underlying race condition problems.

The race conditions and deadlocks might not be because of the channel capacity. Instead, the problems might be happening because the components' orchestration and logic are incorrect. The deadlocks may seem to go while the work of components is lightweight. But the race condition problems will probably unpredictably come back later (Let's hope that won't happen in production!).

🤞

ankur22 · 2022-09-09T16:42:27Z

With an unbuffered channel, the communication between parties is synchronized. And this nature of unbuffered channels helps bring the deadlocks to the surface. However, with a buffered channel, there is no synchronization between the concurrent components (no lock-step). So, it is (and will be) harder to figure out the underlying race condition problems.

Aha, yes, that make sense. In this case, just maybe, there's a possibility that by removing the goroutine that was placing the events on the handlers, it's preventing the deadlock, and it seems that the order of the incoming events is kept. The sync feature itself is also helping in preventing a deadlock as it is forcing one event to be handled (placed on the handlers' channel) at a time -- but that's likely causing it to run slower than before. 😵‍💫

The race conditions and deadlocks might not be because of the channel capacity. Instead, the problems might be happening because the components' orchestration and logic are incorrect. The deadlocks may seem to go while the work of components is lightweight. But the race condition problems will probably unpredictably come back later (Let's hope that won't happen in production!).

I've been running the stress tests locally, and unfortunately I've ran into the "nil response body" and "nil error when trying to print the session id", but still not the deadlock issue that i'm trying to tackle in the first place 🤞

The next step is testing this in the cloud. Prior to this fix, I was coming across the deadlock issue in the cloud 90% of the time I ran a test. Hopefully it doesn't show up there, which could make this a good candidate. We still have #512 though which forces the synchronisation of the create and clear contexts, which also needs to be tested in the cloud.

inancgumus · 2022-09-12T08:43:00Z

In this case, just maybe, there's a possibility that by removing the goroutine that was placing the events on the handlers, it's preventing the deadlock, and it seems that the order of the incoming events is kept.

The thing is each component has its own event emitter. So there can be out-of-order events between different event emitters. The cause of the problem is there is no mechanism to manage this on a higher-level. Maybe we should look into that.

still not the deadlock issue that i'm trying to tackle in the first place

I'm suspicious that we might be hiding it (deadlock). Even if there isn't an obvious deadlock, it doesn't mean that we don't have race conditions :( I'll try this fix with a complex website like Vistaprint when you finish 🤞

ankur22 · 2022-09-16T14:10:30Z

common/event_emitter.go

 				continue
 			default:
-				go emitEvent(handler)
+				emitEvent(handler)


Bye bye goroutine -- this helps keep the order of incoming events.

ankur22 · 2022-09-16T14:12:31Z

common/event_emitter.go

+
+	// Event listener.
+
+	EventListenerDefaultChanBufferSize int64 = 10


I would prefer not have to work with buffered channels, but the issue is with handlers which not only listen and handle incoming events, they can also emit events too. Without the buffered channel this would deadlock 😭

Why is it 10?

I'm asking this because, technically, a buffered channel is the same as an unbuffered channel—Only the prior will deadlock when it's full.

The 10 was a just a arbitrary number.

I was thinking about this, and it might be better to look at trying to prevent the handler(s) from emitting and receiving event(s), so that we don't need the buffered channels (and leave it as unbuffered). IMO, it doesn't feel right for a handler to be doing two jobs, and should concentrate on handling the events and not emitting new ones, but I need to look into this a bit more (which could take some time to make sense of).

If it's too much work for the handler to not emit, then maybe we need to allow for async emitting (i.e. another buffered channel on the receiving end of the emit).

I agree with Inanc. If we're relying on a buffered channel to avoid the deadlock issue, then this change merely alleviates the problem during low activity. On complex sites with many concurrent events we would probably run into it again once the buffer is full.

Have you tested with a complex script like the Vistaprint one from #381? It will need adjusting to the recent breaking changes.

inancgumus · 2022-09-16T14:38:32Z

common/event_emitter.go

+				// and the channel (which should be buffered) is full. It might
+				// be that by increasing the buffer size on the channel will help
+				// alleviate the problem.
+				logger.Error("emit timed out waiting for handler to read message")


Should we then cut this short here (return without for)? Now it retries, is it expected?

Yes, i'll add a return if we're all happy with this change -- it's currently useful as a tool to debug an issue with pprof 🙂

imiric

Nice work digging into this 👏

While I do think this would improve the ordering of emitted events, it wouldn't address the functional concern of dependencies between events. I think the deadlock might not be manifesting in simple scenarios, but would happen on more complex sites, as mentioned in comments above.

For example, consider the Runtime.executionContextCreated / Runtime.executionContextsCleared situation. Among other things, we need to ensure that:

FrameSession.onExecutionContextCreated() always finishes processing before FrameSession.onExecutionContextsCleared() processing starts.
The FrameSession.contextIDToContextMu mutex addresses the concurrency problem, but doesn't enforce the flow of execution. I.e. one could finish before the other, and we could end up with contexts that should've been cleared, or cleared contexts that shouldn't have been.
That Frame.waitForExecutionContext() isn't called, or at least returns early, if the frame is done navigating, and onExecutionContextsCleared() was called.
I.e. if there are no contexts, and no navigation is happening/expected, whatever called waitForExecutionContext() is probably anticipating a document to exist, and it points to the same functional race condition we've seen in other places as well.

I'm a bit weary of doing this global change on EventEmitter, since it might impact how other events are processed. I.e. given that event emission is no longer async, it might introduce delays for other events, of which we don't know the effects yet.

So I'd prefer a specific fix just for this waitForExecutionContext() issue instead. But it's not a strong preference, and if you or Inanc oppose this, I would be OK with merging this and testing it out for some time. We can always revert it later, and we're still early in the cycle.

Are there any major drawbacks of merging the fix from #512? I haven't tested it in depth, and the code wasn't trivial to grok when I first took a look, but it might be preferable to this global change.

Another approach I'm still lingering on from #427, is having a separate channel or a small finite-state machine for specific event flows. E.g. we know that for a specific frame executionContextCreated should always be processed before executionContextsCleared. If that is violated that means that we're receiving CDP messages out of order, and that's a protocol level error out of our hands (and AFAIK we haven't noticed yet). Even though CDP is async, some events are always ordered within the context of a frame.

So I'm picturing a separate frame-specific channel to which these events are sent, and then a separate frame-specific reader would process them syncronously and wait for each event handler to complete, before starting the next one. This would get rid of most locking since we'd be sure things like FrameSession.contextIDToContext couldn't be accessed concurrently.

Maybe it's worth exploring that solution for this isolated issue, and we can expand it to other similar event flows later. I tried doing something like this for the #427 refactor, but it was a lot of effort to take on all at once.

Anyway, just an idea. I would be fine with #512 if it improves this issue.

imiric · 2022-09-19T10:50:48Z

common/event_emitter.go

 // EventEmitter that all event emitters need to implement.
 type EventEmitter interface {
-	emit(event string, data interface{})
+	emit(logger *log.Logger, event string, data interface{})


Let's not change the emit() signature for this. The logger is irrelevant to the event and data being emitted.

Instead, I think it would be cleaner if we add a logger argument to NewBaseEventEmitter and use it in emit() instead. It's not the same thing, as it would be using the event emitter's logger instead of the emit() caller's, but do we need that distinction?

imiric · 2022-09-19T12:46:21Z

common/event_emitter.go

+
+	// Event listener.
+
+	EventListenerDefaultChanBufferSize int64 = 10


I agree with Inanc. If we're relying on a buffered channel to avoid the deadlock issue, then this change merely alleviates the problem during low activity. On complex sites with many concurrent events we would probably run into it again once the buffer is full.

Have you tested with a complex script like the Vistaprint one from #381? It will need adjusting to the recent breaking changes.

The order of the CDP messages can quite easily be lost after being received and handed to the handlers that are listening in on the events. To alleviate this ordering issue, remove the goroutine that is used to place the event on the listening handlers, and instead use a buffered channel. The buffered channel avoids a deadlock where by an event handler could be busy emitting another event, while the emit method is busy trying to pass an event to the same handler.

ankur22 · 2022-10-05T16:13:02Z

Unsuitable as it's only hiding the deadlock, and with more complex tests this will definitely deadlock.

ankur22 marked this pull request as draft September 9, 2022 11:44

ankur22 force-pushed the 482-waitForExecutionContext-timeout-chan branch from 175788d to b6ce43c Compare September 15, 2022 16:26

ankur22 mentioned this pull request Sep 16, 2022

Add mutex to sync the clearing of the context #512

Closed

ankur22 force-pushed the 482-waitForExecutionContext-timeout-chan branch from db61e48 to 6ea67f5 Compare September 16, 2022 14:09

ankur22 commented Sep 16, 2022

View reviewed changes

ankur22 changed the title ~~Draft: Remove go and use chan to keep CDP msg order~~ Remove go and use chan to keep CDP msg order Sep 16, 2022

ankur22 marked this pull request as ready for review September 16, 2022 14:36

ankur22 requested review from imiric and mstoykov September 16, 2022 14:37

inancgumus reviewed Sep 16, 2022

View reviewed changes

imiric reviewed Sep 19, 2022

View reviewed changes

ankur22 force-pushed the 482-waitForExecutionContext-timeout-chan branch from 6ea67f5 to 9243c81 Compare September 21, 2022 09:43

This was referenced Sep 27, 2022

Flaky timeout in Frame.waitForExecutionContext() #482

Closed

Deadlock when emitting and processing events #553

Closed

inancgumus assigned ankur22 Oct 5, 2022

ankur22 closed this Oct 5, 2022

ankur22 deleted the 482-waitForExecutionContext-timeout-chan branch November 7, 2022 15:16


		// Event listener.

		EventListenerDefaultChanBufferSize int64 = 10

Remove go and use chan to keep CDP msg order #524

Remove go and use chan to keep CDP msg order #524

Uh oh!

Conversation

ankur22 commented Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inancgumus commented Sep 9, 2022

Uh oh!

ankur22 commented Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inancgumus commented Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankur22 commented Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inancgumus commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankur22 Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

ankur22 Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

inancgumus Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankur22 Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imiric Sep 19, 2022

Choose a reason for hiding this comment

Uh oh!

inancgumus Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

ankur22 Sep 19, 2022

Choose a reason for hiding this comment

Uh oh!

imiric left a comment

Choose a reason for hiding this comment

Uh oh!

imiric Sep 19, 2022

Choose a reason for hiding this comment

Uh oh!

imiric Sep 19, 2022

Choose a reason for hiding this comment

Uh oh!

ankur22 commented Oct 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ankur22 commented Sep 9, 2022 •

edited

Loading

ankur22 commented Sep 9, 2022 •

edited

Loading

inancgumus commented Sep 9, 2022 •

edited

Loading

ankur22 commented Sep 9, 2022 •

edited

Loading

inancgumus commented Sep 12, 2022 •

edited

Loading

inancgumus Sep 16, 2022 •

edited

Loading

ankur22 Sep 19, 2022 •

edited

Loading