Skip to content

New Sentinel - sentinels frequently executing conflicting failovers #1446

@shanemadden

Description

@shanemadden

In 2.8.2, sentinels seem to sometimes get into a state where they're stuck in a loop of voting and promoting new masters every couple of seconds.

Three node setup, each with a server and a sentinel:

1)  1) "name"
   2) "test"
   3) "ip"
   4) "10.33.25.11"
   5) "port"
   6) "6379"
   7) "runid"
   8) "b531ea7398bb78ce8e1bc529b0a46a4670c3bf48"
   9) "flags"
  10) "master"
  11) "pending-commands"
  12) "0"
  13) "last-ok-ping-reply"
  14) "366"
  15) "last-ping-reply"
  16) "366"
  17) "info-refresh"
  18) "5836"
  19) "role-reported"
  20) "master"
  21) "role-reported-time"
  22) "45971"
  23) "config-epoch"
  24) "0"
  25) "num-slaves"
  26) "2"
  27) "num-other-sentinels"
  28) "2"
  29) "quorum"
  30) "2"

The sdown timer is set to 15000ms.

On some (but not all) failovers, the sentinels get into this state where they start firing failovers at a breakneck speed - they seem to never converge on one version of the configuration before voting and failing over again, every two seconds.

I've seen it happen mostly on manual failovers, but also on a crash-initiated failover. I haven't been able to pin down a consistent reproduce, but it seems to occur frequently with the following steps:

  1. Run a manual failover on one sentinel
  2. Wait for a minute
  3. Run a manual failover on a different sentinel
  4. Wait for issue to occur

I can't figure out why the two nodes that didn't execute the failover would have marked the master as +sdown when they had working command and pubsub links, 45 seconds after the failover completes (so 3x the sdown timer).

.13 (executes the failover):

[25096] 05 Dec 16:08:37.907 - Accepted 127.0.0.1:41777
[25096] 05 Dec 16:08:44.937 # Executing user requested FAILOVER of 'test'
[25096] 05 Dec 16:08:44.937 # +new-epoch 3
[25096] 05 Dec 16:08:44.937 # +try-failover master test 10.33.25.11 6379
[25096] 05 Dec 16:08:44.954 # +vote-for-leader fb4d1fc2e6d93abe89c8ad74454b815ee56863ff 3
[25096] 05 Dec 16:08:44.954 # +elected-leader master test 10.33.25.11 6379
[25096] 05 Dec 16:08:44.954 # +failover-state-select-slave master test 10.33.25.11 6379
[25096] 05 Dec 16:08:45.009 # +selected-slave slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.11 6379
[25096] 05 Dec 16:08:45.009 * +failover-state-send-slaveof-noone slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.11 6379
[25096] 05 Dec 16:08:45.064 * +failover-state-wait-promotion slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.11 6379
[25096] 05 Dec 16:08:45.972 # +promoted-slave slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.11 6379
[25096] 05 Dec 16:08:45.972 # +failover-state-reconf-slaves master test 10.33.25.11 6379
[25096] 05 Dec 16:08:46.052 * +slave-reconf-sent slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.11 6379
[25096] 05 Dec 16:08:46.103 - Client closed connection
[25096] 05 Dec 16:08:46.990 * +slave-reconf-inprog slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.11 6379
[25096] 05 Dec 16:08:46.990 * +slave-reconf-done slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.11 6379
[25096] 05 Dec 16:08:46.991 * +convert-to-slave slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.11 6379
[25096] 05 Dec 16:08:47.045 # +failover-end master test 10.33.25.11 6379
[25096] 05 Dec 16:08:47.045 # +switch-master test 10.33.25.11 6379 10.33.25.12 6379
[25096] 05 Dec 16:08:47.046 * +slave slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[25096] 05 Dec 16:08:47.048 * +slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[25096] 05 Dec 16:08:47.130 . +cmd-link slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[25096] 05 Dec 16:08:47.130 . +cmd-link master test 10.33.25.12 6379
[25096] 05 Dec 16:08:47.130 . +pubsub-link master test 10.33.25.12 6379
[25096] 05 Dec 16:08:47.130 . +pubsub-link slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[25096] 05 Dec 16:08:47.130 . +cmd-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[25096] 05 Dec 16:08:47.130 . +pubsub-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[25096] 05 Dec 16:09:30.032 # +new-epoch 4
[25096] 05 Dec 16:09:30.032 # +vote-for-leader e78001c73fde38c410e8cd5d4f9c39105b0c0f27 4
[25096] 05 Dec 16:09:30.038 # +sdown master test 10.33.25.12 6379
[25096] 05 Dec 16:09:30.114 # +odown master test 10.33.25.12 6379 #quorum 3/2
[25096] 05 Dec 16:09:32.014 # +switch-master test 10.33.25.12 6379 10.33.25.13 6379
[25096] 05 Dec 16:09:32.015 * +slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[25096] 05 Dec 16:09:32.018 * +slave slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[25096] 05 Dec 16:09:32.034 # +new-epoch 5
[25096] 05 Dec 16:09:32.034 # +vote-for-leader b24e8d29557469c90bff8194c10c39e1f6f6b446 5
[25096] 05 Dec 16:09:32.075 # +sdown master test 10.33.25.13 6379
[25096] 05 Dec 16:09:32.075 # +odown master test 10.33.25.13 6379 #quorum 3/2
[25096] 05 Dec 16:09:32.075 . +cmd-link master test 10.33.25.13 6379
[25096] 05 Dec 16:09:32.075 . +pubsub-link master test 10.33.25.13 6379
[25096] 05 Dec 16:09:32.075 . +cmd-link slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[25096] 05 Dec 16:09:32.075 . +pubsub-link slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[25096] 05 Dec 16:09:32.076 . +cmd-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[25096] 05 Dec 16:09:32.076 . +pubsub-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[25096] 05 Dec 16:09:34.027 # +switch-master test 10.33.25.13 6379 10.33.25.12 6379
[25096] 05 Dec 16:09:34.028 * +slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[25096] 05 Dec 16:09:34.030 * +slave slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379

.12:

[26140] 05 Dec 16:08:45.140 # +new-epoch 3
[26140] 05 Dec 16:08:46.053 # +switch-master test 10.33.25.11 6379 10.33.25.12 6379
[26140] 05 Dec 16:08:46.053 * +slave slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:08:46.057 * +slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:08:46.098 . +cmd-link master test 10.33.25.12 6379
[26140] 05 Dec 16:08:46.098 . +pubsub-link master test 10.33.25.12 6379
[26140] 05 Dec 16:08:46.098 . +cmd-link slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:08:46.098 . +pubsub-link slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:08:46.098 . +cmd-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:08:46.098 . +pubsub-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:29.975 # +sdown master test 10.33.25.12 6379
[26140] 05 Dec 16:09:30.031 # +odown master test 10.33.25.12 6379 #quorum 2/2
[26140] 05 Dec 16:09:30.031 # +new-epoch 4
[26140] 05 Dec 16:09:30.032 # +try-failover master test 10.33.25.12 6379
[26140] 05 Dec 16:09:30.032 # +vote-for-leader e78001c73fde38c410e8cd5d4f9c39105b0c0f27 4
[26140] 05 Dec 16:09:30.032 # 10.33.25.13:26379 voted for e78001c73fde38c410e8cd5d4f9c39105b0c0f27 4
[26140] 05 Dec 16:09:30.032 # 10.33.25.11:26379 voted for e78001c73fde38c410e8cd5d4f9c39105b0c0f27 4
[26140] 05 Dec 16:09:30.094 # +elected-leader master test 10.33.25.12 6379
[26140] 05 Dec 16:09:30.094 # +failover-state-select-slave master test 10.33.25.12 6379
[26140] 05 Dec 16:09:30.184 # +selected-slave slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:30.184 * +failover-state-send-slaveof-noone slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:30.255 * +failover-state-wait-promotion slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:31.075 # +promoted-slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:31.075 # +failover-state-reconf-slaves master test 10.33.25.12 6379
[26140] 05 Dec 16:09:31.133 * +slave-reconf-sent slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:32.134 # +new-epoch 5
[26140] 05 Dec 16:09:32.143 # -odown master test 10.33.25.12 6379
[26140] 05 Dec 16:09:32.144 * +slave-reconf-inprog slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:32.144 * +slave-reconf-done slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:32.201 # +failover-end master test 10.33.25.12 6379
[26140] 05 Dec 16:09:32.201 # +switch-master test 10.33.25.12 6379 10.33.25.13 6379
[26140] 05 Dec 16:09:32.201 * +slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[26140] 05 Dec 16:09:32.204 * +slave slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[26140] 05 Dec 16:09:32.275 # +sdown master test 10.33.25.13 6379
[26140] 05 Dec 16:09:32.275 . +cmd-link master test 10.33.25.13 6379
[26140] 05 Dec 16:09:32.275 . +cmd-link slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[26140] 05 Dec 16:09:32.276 . +pubsub-link master test 10.33.25.13 6379
[26140] 05 Dec 16:09:32.276 . +pubsub-link slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[26140] 05 Dec 16:09:32.276 . +cmd-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[26140] 05 Dec 16:09:32.276 . +pubsub-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[26140] 05 Dec 16:09:33.090 # +vote-for-leader b24e8d29557469c90bff8194c10c39e1f6f6b446 5
[26140] 05 Dec 16:09:33.205 # +odown master test 10.33.25.13 6379 #quorum 3/2
[26140] 05 Dec 16:09:34.027 # +switch-master test 10.33.25.13 6379 10.33.25.12 6379
[26140] 05 Dec 16:09:34.028 * +slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:34.030 * +slave slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[26140] 05 Dec 16:09:34.051 # +sdown master test 10.33.25.12 6379
[26140] 05 Dec 16:09:34.051 # +odown master test 10.33.25.12 6379 #quorum 3/2

.11:

[28554] 05 Dec 16:08:45.140 # +new-epoch 3
[28554] 05 Dec 16:08:46.053 # +switch-master test 10.33.25.11 6379 10.33.25.12 6379
[28554] 05 Dec 16:08:46.053 * +slave slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[28554] 05 Dec 16:08:46.057 * +slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[28554] 05 Dec 16:08:46.088 . +cmd-link master test 10.33.25.12 6379
[28554] 05 Dec 16:08:46.088 . +pubsub-link master test 10.33.25.12 6379
[28554] 05 Dec 16:08:46.088 . +cmd-link slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[28554] 05 Dec 16:08:46.088 . +pubsub-link slave 10.33.25.13:6379 10.33.25.13 6379 @ test 10.33.25.12 6379
[28554] 05 Dec 16:08:46.088 . +cmd-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[28554] 05 Dec 16:08:46.088 . +pubsub-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.12 6379
[28554] 05 Dec 16:09:29.925 # +sdown master test 10.33.25.12 6379
[28554] 05 Dec 16:09:30.032 # +new-epoch 4
[28554] 05 Dec 16:09:30.032 # +vote-for-leader e78001c73fde38c410e8cd5d4f9c39105b0c0f27 4
[28554] 05 Dec 16:09:31.024 # +odown master test 10.33.25.12 6379 #quorum 3/2
[28554] 05 Dec 16:09:32.014 # +switch-master test 10.33.25.12 6379 10.33.25.13 6379
[28554] 05 Dec 16:09:32.015 * +slave slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.018 * +slave slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.033 # +sdown master test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.033 # +odown master test 10.33.25.13 6379 #quorum 3/2
[28554] 05 Dec 16:09:32.033 # +new-epoch 5
[28554] 05 Dec 16:09:32.033 # +try-failover master test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.034 # +vote-for-leader b24e8d29557469c90bff8194c10c39e1f6f6b446 5
[28554] 05 Dec 16:09:32.034 . +cmd-link master test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.034 . +pubsub-link master test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.034 . +cmd-link slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.034 . +pubsub-link slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.034 . +cmd-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.035 . +pubsub-link slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.035 # 10.33.25.13:26379 voted for b24e8d29557469c90bff8194c10c39e1f6f6b446 5
[28554] 05 Dec 16:09:32.089 # -odown master test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.089 # +elected-leader master test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.089 # +failover-state-select-slave master test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.160 # +selected-slave slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.160 * +failover-state-send-slaveof-noone slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:32.261 * +failover-state-wait-promotion slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:33.090 # 10.33.25.12:26379 voted for b24e8d29557469c90bff8194c10c39e1f6f6b446 5
[28554] 05 Dec 16:09:33.093 # +promoted-slave slave 10.33.25.12:6379 10.33.25.12 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:33.093 # +failover-state-reconf-slaves master test 10.33.25.13 6379
[28554] 05 Dec 16:09:33.148 # +odown master test 10.33.25.13 6379 #quorum 3/2
[28554] 05 Dec 16:09:33.148 * +slave-reconf-sent slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:34.099 * +slave-reconf-inprog slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379
[28554] 05 Dec 16:09:34.099 * +slave-reconf-done slave 10.33.25.11:6379 10.33.25.11 6379 @ test 10.33.25.13 6379

This continues indefinitely until enough sentinels are stopped that they can no longer gain enough votes for a failover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions