-
Notifications
You must be signed in to change notification settings - Fork 676
Description
NPD has no backoff on connection attempts to sync node conditions.
func (c *conditionManager) sync() {
c.latestTry = c.clock.Now()
c.resyncNeeded = false
conditions := []v1.NodeCondition{}
for i := range c.conditions {
conditions = append(conditions, problemutil.ConvertToAPICondition(c.conditions[i]))
}
if err := c.client.SetConditions(conditions); err != nil {
// The conditions will be updated again in future sync
glog.Errorf("failed to update node conditions: %v", err)
c.resyncNeeded = true
return
}
}
In the sync() function, c.LatestTry is set before the call to SetConditions and the SetConditions function takes about 15 seconds to timeout, so when SetConditions fails, it is already more than the 10s resyncPeriod since the last resync. This means that within one second of SetConditions failure, sync() is trying SetConditions again. In a large cluster, this can extend outages, because if kube-apiserver gets overloaded and starts not accepting connections, every node in the cluster will start repeatedly trying to reconnect with no delays.
I propose either adding an update to latestTry in the SetConditions error-handling loop (which would provide an additional 10 seconds between attempts) or ideally adding in some exponential backoff logic for retries.