Skip to content

When an API server gets into a heavily loaded state, NPD will continuously try to reconnect. #764

@ChristinaJShafer

Description

@ChristinaJShafer

NPD has no backoff on connection attempts to sync node conditions.

func (c *conditionManager) sync() {
	c.latestTry = c.clock.Now()
	c.resyncNeeded = false
	conditions := []v1.NodeCondition{}
	for i := range c.conditions {
		conditions = append(conditions, problemutil.ConvertToAPICondition(c.conditions[i]))
	}
	if err := c.client.SetConditions(conditions); err != nil {
		// The conditions will be updated again in future sync
		glog.Errorf("failed to update node conditions: %v", err)
		c.resyncNeeded = true
		return
	}
}

In the sync() function, c.LatestTry is set before the call to SetConditions and the SetConditions function takes about 15 seconds to timeout, so when SetConditions fails, it is already more than the 10s resyncPeriod since the last resync. This means that within one second of SetConditions failure, sync() is trying SetConditions again. In a large cluster, this can extend outages, because if kube-apiserver gets overloaded and starts not accepting connections, every node in the cluster will start repeatedly trying to reconnect with no delays.
I propose either adding an update to latestTry in the SetConditions error-handling loop (which would provide an additional 10 seconds between attempts) or ideally adding in some exponential backoff logic for retries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions