When an API server gets into a heavily loaded state, NPD will continuously try to reconnect.

NPD has no backoff on connection attempts to [sync node conditions](https://github.com/kubernetes/node-problem-detector/blob/master/pkg/exporters/k8sexporter/condition/manager.go#L153).
```
func (c *conditionManager) sync() {
	c.latestTry = c.clock.Now()
	c.resyncNeeded = false
	conditions := []v1.NodeCondition{}
	for i := range c.conditions {
		conditions = append(conditions, problemutil.ConvertToAPICondition(c.conditions[i]))
	}
	if err := c.client.SetConditions(conditions); err != nil {
		// The conditions will be updated again in future sync
		glog.Errorf("failed to update node conditions: %v", err)
		c.resyncNeeded = true
		return
	}
}
```
In the sync() function, c.LatestTry is set before the call to SetConditions and the SetConditions function takes about 15 seconds to timeout, so when SetConditions fails, it is already more than the 10s resyncPeriod  since the last resync.  This means that within one second of SetConditions failure, sync() is trying SetConditions again.  In a large cluster, this can extend outages, because if kube-apiserver gets overloaded and starts not accepting connections, every node in the cluster will start repeatedly trying to reconnect with no delays.
I propose either adding an update to latestTry in the SetConditions error-handling loop (which would provide an additional 10 seconds between attempts) or ideally adding in some exponential backoff logic for retries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When an API server gets into a heavily loaded state, NPD will continuously try to reconnect. #764

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When an API server gets into a heavily loaded state, NPD will continuously try to reconnect. #764

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions