Skip to content
This repository was archived by the owner on Jan 30, 2025. It is now read-only.

Conversation

ankur22
Copy link
Collaborator

@ankur22 ankur22 commented Sep 9, 2022

Closes: #482

The order of the CDP messages can quite easily be lost after being received and handed to the handlers that are listening in on the events. To alleviate this ordering issue, remove the goroutine that is used to place the event on the listening handlers, and instead use a buffered channel. The buffered channel avoids a deadlock where by an event handler could be busy emitting another event, while the emit method is busy trying to pass an event to the same handler.

Diagram.
CDP event loop break ordering drawio

This has been tested in the cloud and after several tests, xk6-browser has not deadlocked 🎉

I have another solution but I prefer this solution as it's closer to the root cause of the issue -- #512 (which is still WIP).

@ankur22 ankur22 marked this pull request as draft September 9, 2022 11:44
@inancgumus
Copy link
Collaborator

I once tried using buffered channels across the handlers while working on this problem, which didn't help back then. Another thing that worried me was making the interaction between components/events more unpredictable due to the nature of buffered channels.

@ankur22
Copy link
Collaborator Author

ankur22 commented Sep 9, 2022

I once tried using buffered channels across the handlers while working on this problem, which didn't help back then.

I've had great success with this approach so far. I've been running stress tests for many minutes and usually I hit upon the following but at the moment nothing:

  • nill response in the locator tests,
  • deadlock in the locator tests,
  • nill error when trying to print the session id.

Another thing that worried me was making the interaction between components/events more unpredictable due to the nature of buffered channels.

Can you elaborate on what exactly makes buffered channels unpredictable? It's definitely keeping the order of the incoming events now.

@inancgumus
Copy link
Collaborator

inancgumus commented Sep 9, 2022

Can you elaborate on what exactly makes buffered channels unpredictable? It's definitely keeping the order of the incoming events now.

A buffered channel can easily hide deadlock and race condition problems.

With an unbuffered channel, the communication between parties is synchronized. And this nature of unbuffered channels helps bring the deadlocks to the surface. However, with a buffered channel, there is no synchronization between the concurrent components (no lock-step). So, it is (and will be) harder to figure out the underlying race condition problems.

The race conditions and deadlocks might not be because of the channel capacity. Instead, the problems might be happening because the components' orchestration and logic are incorrect. The deadlocks may seem to go while the work of components is lightweight. But the race condition problems will probably unpredictably come back later (Let's hope that won't happen in production!).

🤞

@ankur22
Copy link
Collaborator Author

ankur22 commented Sep 9, 2022

With an unbuffered channel, the communication between parties is synchronized. And this nature of unbuffered channels helps bring the deadlocks to the surface. However, with a buffered channel, there is no synchronization between the concurrent components (no lock-step). So, it is (and will be) harder to figure out the underlying race condition problems.

Aha, yes, that make sense. In this case, just maybe, there's a possibility that by removing the goroutine that was placing the events on the handlers, it's preventing the deadlock, and it seems that the order of the incoming events is kept. The sync feature itself is also helping in preventing a deadlock as it is forcing one event to be handled (placed on the handlers' channel) at a time -- but that's likely causing it to run slower than before. 😵‍💫

The race conditions and deadlocks might not be because of the channel capacity. Instead, the problems might be happening because the components' orchestration and logic are incorrect. The deadlocks may seem to go while the work of components is lightweight. But the race condition problems will probably unpredictably come back later (Let's hope that won't happen in production!).

I've been running the stress tests locally, and unfortunately I've ran into the "nil response body" and "nil error when trying to print the session id", but still not the deadlock issue that i'm trying to tackle in the first place 🤞

The next step is testing this in the cloud. Prior to this fix, I was coming across the deadlock issue in the cloud 90% of the time I ran a test. Hopefully it doesn't show up there, which could make this a good candidate. We still have #512 though which forces the synchronisation of the create and clear contexts, which also needs to be tested in the cloud.

@inancgumus
Copy link
Collaborator

inancgumus commented Sep 12, 2022

In this case, just maybe, there's a possibility that by removing the goroutine that was placing the events on the handlers, it's preventing the deadlock, and it seems that the order of the incoming events is kept.

The thing is each component has its own event emitter. So there can be out-of-order events between different event emitters. The cause of the problem is there is no mechanism to manage this on a higher-level. Maybe we should look into that.

still not the deadlock issue that i'm trying to tackle in the first place

I'm suspicious that we might be hiding it (deadlock). Even if there isn't an obvious deadlock, it doesn't mean that we don't have race conditions :( I'll try this fix with a complex website like Vistaprint when you finish 🤞

@ankur22 ankur22 force-pushed the 482-waitForExecutionContext-timeout-chan branch from 175788d to b6ce43c Compare September 15, 2022 16:26
@ankur22 ankur22 force-pushed the 482-waitForExecutionContext-timeout-chan branch from db61e48 to 6ea67f5 Compare September 16, 2022 14:09
continue
default:
go emitEvent(handler)
emitEvent(handler)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bye bye goroutine -- this helps keep the order of incoming events.


// Event listener.

EventListenerDefaultChanBufferSize int64 = 10
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not have to work with buffered channels, but the issue is with handlers which not only listen and handle incoming events, they can also emit events too. Without the buffered channel this would deadlock 😭

Copy link
Collaborator

@inancgumus inancgumus Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it 10?

I'm asking this because, technically, a buffered channel is the same as an unbuffered channel—Only the prior will deadlock when it's full.

Copy link
Collaborator Author

@ankur22 ankur22 Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10 was a just a arbitrary number.

I was thinking about this, and it might be better to look at trying to prevent the handler(s) from emitting and receiving event(s), so that we don't need the buffered channels (and leave it as unbuffered). IMO, it doesn't feel right for a handler to be doing two jobs, and should concentrate on handling the events and not emitting new ones, but I need to look into this a bit more (which could take some time to make sense of).

If it's too much work for the handler to not emit, then maybe we need to allow for async emitting (i.e. another buffered channel on the receiving end of the emit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Inanc. If we're relying on a buffered channel to avoid the deadlock issue, then this change merely alleviates the problem during low activity. On complex sites with many concurrent events we would probably run into it again once the buffer is full.

Have you tested with a complex script like the Vistaprint one from #381? It will need adjusting to the recent breaking changes.

@ankur22 ankur22 changed the title Draft: Remove go and use chan to keep CDP msg order Remove go and use chan to keep CDP msg order Sep 16, 2022
@ankur22 ankur22 marked this pull request as ready for review September 16, 2022 14:36
@ankur22 ankur22 requested review from imiric and mstoykov September 16, 2022 14:37
// and the channel (which should be buffered) is full. It might
// be that by increasing the buffer size on the channel will help
// alleviate the problem.
logger.Error("emit timed out waiting for handler to read message")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we then cut this short here (return without for)? Now it retries, is it expected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i'll add a return if we're all happy with this change -- it's currently useful as a tool to debug an issue with pprof 🙂

Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work digging into this 👏

While I do think this would improve the ordering of emitted events, it wouldn't address the functional concern of dependencies between events. I think the deadlock might not be manifesting in simple scenarios, but would happen on more complex sites, as mentioned in comments above.

For example, consider the Runtime.executionContextCreated / Runtime.executionContextsCleared situation. Among other things, we need to ensure that:

  • FrameSession.onExecutionContextCreated() always finishes processing before FrameSession.onExecutionContextsCleared() processing starts.
    The FrameSession.contextIDToContextMu mutex addresses the concurrency problem, but doesn't enforce the flow of execution. I.e. one could finish before the other, and we could end up with contexts that should've been cleared, or cleared contexts that shouldn't have been.

  • That Frame.waitForExecutionContext() isn't called, or at least returns early, if the frame is done navigating, and onExecutionContextsCleared() was called.
    I.e. if there are no contexts, and no navigation is happening/expected, whatever called waitForExecutionContext() is probably anticipating a document to exist, and it points to the same functional race condition we've seen in other places as well.

I'm a bit weary of doing this global change on EventEmitter, since it might impact how other events are processed. I.e. given that event emission is no longer async, it might introduce delays for other events, of which we don't know the effects yet.

So I'd prefer a specific fix just for this waitForExecutionContext() issue instead. But it's not a strong preference, and if you or Inanc oppose this, I would be OK with merging this and testing it out for some time. We can always revert it later, and we're still early in the cycle.

Are there any major drawbacks of merging the fix from #512? I haven't tested it in depth, and the code wasn't trivial to grok when I first took a look, but it might be preferable to this global change.


Another approach I'm still lingering on from #427, is having a separate channel or a small finite-state machine for specific event flows. E.g. we know that for a specific frame executionContextCreated should always be processed before executionContextsCleared. If that is violated that means that we're receiving CDP messages out of order, and that's a protocol level error out of our hands (and AFAIK we haven't noticed yet). Even though CDP is async, some events are always ordered within the context of a frame.

So I'm picturing a separate frame-specific channel to which these events are sent, and then a separate frame-specific reader would process them syncronously and wait for each event handler to complete, before starting the next one. This would get rid of most locking since we'd be sure things like FrameSession.contextIDToContext couldn't be accessed concurrently.

Maybe it's worth exploring that solution for this isolated issue, and we can expand it to other similar event flows later. I tried doing something like this for the #427 refactor, but it was a lot of effort to take on all at once.

Anyway, just an idea. I would be fine with #512 if it improves this issue.

// EventEmitter that all event emitters need to implement.
type EventEmitter interface {
emit(event string, data interface{})
emit(logger *log.Logger, event string, data interface{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change the emit() signature for this. The logger is irrelevant to the event and data being emitted.

Instead, I think it would be cleaner if we add a logger argument to NewBaseEventEmitter and use it in emit() instead. It's not the same thing, as it would be using the event emitter's logger instead of the emit() caller's, but do we need that distinction?


// Event listener.

EventListenerDefaultChanBufferSize int64 = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Inanc. If we're relying on a buffered channel to avoid the deadlock issue, then this change merely alleviates the problem during low activity. On complex sites with many concurrent events we would probably run into it again once the buffer is full.

Have you tested with a complex script like the Vistaprint one from #381? It will need adjusting to the recent breaking changes.

The order of the CDP messages can quite easily be lost after being
received and handed to the handlers that are listening in on the
events. To alleviate this ordering issue, remove the goroutine that is
used to place the event on the listening handlers, and instead use a
buffered channel. The buffered channel avoids a deadlock where by an
event handler could be busy emitting another event, while the emit
method is busy trying to pass an event to the same handler.
@ankur22
Copy link
Collaborator Author

ankur22 commented Oct 5, 2022

Unsuitable as it's only hiding the deadlock, and with more complex tests this will definitely deadlock.

@ankur22 ankur22 closed this Oct 5, 2022
@ankur22 ankur22 deleted the 482-waitForExecutionContext-timeout-chan branch November 7, 2022 15:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky timeout in Frame.waitForExecutionContext()

3 participants