Revert HOLD preference; instead don't trust reweighting negative nodes #87

karalekas · 2025-10-05T03:41:47Z

I really thought I did something in #86 but turns out I was wrong.

The situation I observed was not simply due to an unfortunately timed reweight + rewind -- it was only possible due to a very unlucky combination of stale information + simultaneous reweights.

Revisiting the offending problem graph -- the only reason that 11241 is able to make it to its reweight critical section is that the pong from 11226 that contributed to its reweight was so stale that 11226 was still unweighted and unmatched at the time the pong was sent:

30.26: [#<D-N 11241>] processing reply pong AUGMENT 4 #<11241>--->#<11237> from #<11237>
30.26: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<11226> from #<11226>
30.26: [#<D-N 11241>] processing reply pong AUGMENT 4 #<11241>--->#<11274> from #<11274>
30.26: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<8435> from #<8435>
30.26: [#<D-N 11241>] unified pong is AUGMENT 2 #<11241>--->#<8435> from #<8435>
30.34: [#<D-S 36117>] got AUGMENT 2 #<11241>--->#<8435> from #<11241>

This stale pong AUGMENT 2 #<11241>--->#<11226> from #<11226> then happens to have the same weight as the soft pong that is returned to 11241 during CHECK-REWEIGHT, as 11226 is in the middle of a CONTRACT 2 reweight and has been de-weighted back to 0:

44.06: [#<D-N 11241>] pinging vertices #<14111> #<11237> #<11226> #<11156> #<11274> #<8435>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<14111> from #<14111>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<11237> from #<14111>
46.94: [#<D-N 11241>] processing reply pong HOLD 2 #<11241>--->#<11226> from #<14111>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 4 #<11241>--->#<11156> from #<11156>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<11274> from #<11274>
46.94: [#<D-N 11241>] processing reply pong AUGMENT 2 #<11241>--->#<8435> from #<8435>
46.94: [#<D-N 11241>] unified pong is AUGMENT 2 #<11241>--->#<8435> from #<8435>

As such, we get a HOLD 2 #<11241>--->#<11226> from #<14111> pong and think everything is fine because the lowest-weight rec is still 2, so we clear ourselves to reweight. And then, 11241 is a solo node (whereas 11226 is part of a 3-tree), it is able to reweight, re-check (deciding that everything is once again fine), and finalize the reweight before 11226's tree realizes it has reweighted too much and needs to rewind, resulting in a negative-weight edge between 11241 and 11226.

Clearly this is very rare, but of course that has never stopped me, so after much consternation and deep thought I realized that we basically just need a way to say "don't trust mid-reweight pongs from negative nodes." However, making them non-pingable is a recipe for deadlock, and so what I came up with is having the negative nodes "pretend" that they actually havent reweighted until the operation is fully finalized. We do this by "stashing" their original weights in a slot before the reweight happens and returning this value when pinged. This value is then "unstashed" at the end of the reweight operation.

NB: This basically reverts all the material changes in #86 (only the logging improvements remain).

…alfway

ecpeterson

I think this looks safe to me: I think this message resolution order was possible before, where pings to will-have-negative-weight nodes could resolve before whatever reweight becomes in-progress, and so we’re now just guaranteeing that that’s the effective message outcome all the time. Whether that fixes things / doesn’t cause more problems is less clear to me, but I’m optimistic.

src/supervisor.lisp

karalekas · 2025-10-05T04:23:26Z

From offline discussion

We’re basically guaranteeing that negative node info is not stale

Yeah that. It reads to me like we’re guaranteeing that it is stale, in a sense, since we’re recalling from a stash

Well we’re guaranteeing that it is the same as the beginning of the supervisor operation, which is “deterministically stale” I guess but not arbitrarily stale

karalekas added 12 commits October 4, 2025 14:41

Add log-entry for check-pong aborting

e9e5192

Don't log the root-bucket anymore for HOLD reweights

0bd3050

Remove the additional abort checks from check-reweight

0c80647

Go back to aggregating the root-buckets only for HOLD 0s

a1fbd67

Revert preference for non-zero HOLDs

0ba20b9

Also bring back the augment preference

71635ce

Revert the unify-pongs test due to augment precedence change

c1a0035

Add stashing and unstashing mechanisms

4a5c13b

Use the stashed-weight when adjoining root if it is set

47ff3ec

Add back test-supervisor-multireweight-simultaneous-rewind-non-integer

3ec019b

Revert changes to test-supervisor-multireweight-simultaneous-rewind-h…

13cc9d8

…alfway

Revert changes to test-supervisor-reweight-rewind-simultaneous

88d83d6

karalekas requested a review from ecpeterson October 5, 2025 03:42

karalekas assigned ecpeterson Oct 5, 2025

ecpeterson approved these changes Oct 5, 2025

View reviewed changes

src/supervisor.lisp Outdated Show resolved Hide resolved

Review feedback: if -> when

5934c3c

karalekas merged commit 7fadb6d into main Oct 5, 2025
1 check passed

karalekas deleted the neg-weight-bug-take-2 branch October 5, 2025 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert HOLD preference; instead don't trust reweighting negative nodes #87

Revert HOLD preference; instead don't trust reweighting negative nodes #87

Uh oh!

karalekas commented Oct 5, 2025 •

edited

Loading

Uh oh!

ecpeterson left a comment

Uh oh!

Uh oh!

karalekas commented Oct 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Revert HOLD preference; instead don't trust reweighting negative nodes #87

Revert HOLD preference; instead don't trust reweighting negative nodes #87

Uh oh!

Conversation

karalekas commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ecpeterson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

karalekas commented Oct 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karalekas commented Oct 5, 2025 •

edited

Loading