-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Open
Labels
Description
Motivation
- Reduce dependence on ZooKeeper by offering Consul as a first-class discovery and leader-election option.
- Many operators already run Consul; supporting it simplifies their Druid deployments and aligns with existing service catalogs and security (ACL/TLS).
- Completes the HTTP-based discovery path (works with http server view/task runner) for clusters outside Kubernetes.
Proposed changes
- Add contrib extension
druid-consul-extensionsthat wires Consul into Druid discovery (announcer + service watcher) and leader election for Coordinator/Overlord. - Provide configuration for agent host/port, service prefix, datacenter pinning, ACL token, Basic Auth, TLS/mTLS, watch retry/backoff, health-check interval/TTL, deregister
timeout, and extra service tags. - Package the extension and document setup, config examples, and operational notes.
- Emit metrics for Consul registration, leader election, and watch retries; log failures clearly.
- Add dev harness docs (e.g., docker run consul) and integration tests/smoke checks that exercise registration, discovery, and leader failover.
Rationale
- Consul is widely deployed and already handles service catalog + KV + sessions suitable for leader locks.
- Alternative considered: keep using ZooKeeper or K8s API-based discovery; those don’t fit environments standardized on Consul or where operators want to retire ZooKeeper.
- Reuses Druid’s existing HTTP-based server view/task runner, avoiding additional protocols.
Operational impact
- Prereqs:
druid.serverview.type=httpanddruid.indexer.runner.type=httpRemotemust be set cluster-wide before switching discovery/leader selectors to Consul. - Migration path from ZooKeeper:
- Enable HTTP server view/task runner while still on ZooKeeper discovery.
- Load the Consul extension and set Consul configs on all nodes (common DC).
- Cut over discovery + leader selectors to Consul per role/tier; monitor metrics/logs.
- Remove ZooKeeper dependency after stability validated; rollback by pointing selectors back to ZooKeeper.
- Rolling/zero downtime: true zero downtime is not realistic because discovery catalogs and leader election can’t be mixed. Roll out the bits/configs under ZooKeeper, then do a
fast, coordinated switch to Consul (expect a brief disruption for restarts/leader flip).
Test plan (optional)
- Stand up a local Consul agent and verify node registration, health TTL updates, service discovery watch, and Coordinator/Overlord leader election failover.
- Add unit tests for config parsing and session/watch retry logic; integration tests for watch loop and leader lock behavior.
- Manual smoke doc: docker-run Consul, run minimal Druid nodes with Consul discovery/election, exercise restart/failover.
- CI setup that can be optionally triggered.