Skip to content

Conversation

@karalekas
Copy link
Member

@karalekas karalekas commented Oct 5, 2025

I really thought I did something in #86 but turns out I was wrong.

The situation I observed was not simply due to an unfortunately timed reweight + rewind -- it was only possible due to a very unlucky combination of stale information + simultaneous reweights.

Screenshot 2025-10-04 at 8 29 35 PM

Revisiting the offending problem graph -- the only reason that 11241 is able to make it to its reweight critical section is that the pong from 11226 that contributed to its reweight was so stale that 11226 was still unweighted and unmatched at the time the pong was sent:

30.26: [#<D-N 11241>] processing reply pong AUGMENT 4 #<11241>--->#<11237> from #<11237>
30.26: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<11226> from #<11226>
30.26: [#<D-N 11241>] processing reply pong AUGMENT 4 #<11241>--->#<11274> from #<11274>
30.26: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<8435> from #<8435>
30.26: [#<D-N 11241>] unified pong is AUGMENT 2 #<11241>--->#<8435> from #<8435>
30.34: [#<D-S 36117>] got AUGMENT 2 #<11241>--->#<8435> from #<11241>

This stale pong AUGMENT 2 #<11241>--->#<11226> from #<11226> then happens to have the same weight as the soft pong that is returned to 11241 during CHECK-REWEIGHT, as 11226 is in the middle of a CONTRACT 2 reweight and has been de-weighted back to 0:

44.06: [#<D-N 11241>] pinging vertices #<14111> #<11237> #<11226> #<11156> #<11274> #<8435>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<14111> from #<14111>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<11237> from #<14111>
46.94: [#<D-N 11241>] processing reply pong HOLD 2 #<11241>--->#<11226> from #<14111>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 4 #<11241>--->#<11156> from #<11156>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<11274> from #<11274>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<8435> from #<8435>
46.94: [#<D-N 11241>] unified pong is AUGMENT 2 #<11241>--->#<8435> from #<8435>

As such, we get a HOLD 2 #<11241>--->#<11226> from #<14111> pong and think everything is fine because the lowest-weight rec is still 2, so we clear ourselves to reweight. And then, 11241 is a solo node (whereas 11226 is part of a 3-tree), it is able to reweight, re-check (deciding that everything is once again fine), and finalize the reweight before 11226's tree realizes it has reweighted too much and needs to rewind, resulting in a negative-weight edge between 11241 and 11226.

Clearly this is very rare, but of course that has never stopped me, so after much consternation and deep thought I realized that we basically just need a way to say "don't trust mid-reweight pongs from negative nodes." However, making them non-pingable is a recipe for deadlock, and so what I came up with is having the negative nodes "pretend" that they actually havent reweighted until the operation is fully finalized. We do this by "stashing" their original weights in a slot before the reweight happens and returning this value when pinged. This value is then "unstashed" at the end of the reweight operation.

NB: This basically reverts all the material changes in #86 (only the logging improvements remain).

Copy link
Contributor

@ecpeterson ecpeterson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks safe to me: I think this message resolution order was possible before, where pings to will-have-negative-weight nodes could resolve before whatever reweight becomes in-progress, and so we’re now just guaranteeing that that’s the effective message outcome all the time. Whether that fixes things / doesn’t cause more problems is less clear to me, but I’m optimistic.

@karalekas
Copy link
Member Author

From offline discussion

We’re basically guaranteeing that negative node info is not stale

Yeah that. It reads to me like we’re guaranteeing that it is stale, in a sense, since we’re recalling from a stash

Well we’re guaranteeing that it is the same as the beginning of the supervisor operation, which is “deterministically stale” I guess but not arbitrarily stale

@karalekas karalekas merged commit 7fadb6d into main Oct 5, 2025
1 check passed
@karalekas karalekas deleted the neg-weight-bug-take-2 branch October 5, 2025 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants