Skip to content

service: retry on transport-level errors from Etcd within keyspace.Watch #294

@jgraettinger

Description

@jgraettinger

During a recent automated GKE upgrade, all brokers and Etcd pods were simultaneously signaled to exit (not ideal, but also not the issue at hand).

Etcd pods exited, and on the way out Gazette brokers observed transport-level errors which were treated as terminal, and caused a controlled but fatal shutdown across all brokers (along with a pod restart):

{"err":"service.Watch: rpc error: code = Unknown desc = closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: NO_ERROR, debug data: ","level":"fatal","msg":"broker task failed","time":"2021-09-01T14:08:13Z"}

The shutdown was controlled -- no data loss is believed or expected to have occurred -- but it did cause cluster consistency to be lost and require operator intervention (gazctl journals reset-head).

What should happen instead

Brokers should have retried the Etcd watch on this transport-level error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions