Skip to content

Proposal: Add Consul-based service discovery and leader election (as a contrib extension) #18746

@santosh-d3vpl3x

Description

@santosh-d3vpl3x

Motivation

  • Reduce dependence on ZooKeeper by offering Consul as a first-class discovery and leader-election option.
  • Many operators already run Consul; supporting it simplifies their Druid deployments and aligns with existing service catalogs and security (ACL/TLS).
  • Completes the HTTP-based discovery path (works with http server view/task runner) for clusters outside Kubernetes.

Proposed changes

  • Add contrib extension druid-consul-extensions that wires Consul into Druid discovery (announcer + service watcher) and leader election for Coordinator/Overlord.
  • Provide configuration for agent host/port, service prefix, datacenter pinning, ACL token, Basic Auth, TLS/mTLS, watch retry/backoff, health-check interval/TTL, deregister
    timeout, and extra service tags.
  • Package the extension and document setup, config examples, and operational notes.
  • Emit metrics for Consul registration, leader election, and watch retries; log failures clearly.
  • Add dev harness docs (e.g., docker run consul) and integration tests/smoke checks that exercise registration, discovery, and leader failover.

Rationale

  • Consul is widely deployed and already handles service catalog + KV + sessions suitable for leader locks.
  • Alternative considered: keep using ZooKeeper or K8s API-based discovery; those don’t fit environments standardized on Consul or where operators want to retire ZooKeeper.
  • Reuses Druid’s existing HTTP-based server view/task runner, avoiding additional protocols.

Operational impact

  • Prereqs: druid.serverview.type=http and druid.indexer.runner.type=httpRemote must be set cluster-wide before switching discovery/leader selectors to Consul.
  • Migration path from ZooKeeper:
    1. Enable HTTP server view/task runner while still on ZooKeeper discovery.
    2. Load the Consul extension and set Consul configs on all nodes (common DC).
    3. Cut over discovery + leader selectors to Consul per role/tier; monitor metrics/logs.
    4. Remove ZooKeeper dependency after stability validated; rollback by pointing selectors back to ZooKeeper.
  • Rolling/zero downtime: true zero downtime is not realistic because discovery catalogs and leader election can’t be mixed. Roll out the bits/configs under ZooKeeper, then do a
    fast, coordinated switch to Consul (expect a brief disruption for restarts/leader flip).

Test plan (optional)

  • Stand up a local Consul agent and verify node registration, health TTL updates, service discovery watch, and Coordinator/Overlord leader election failover.
  • Add unit tests for config parsing and session/watch retry logic; integration tests for watch loop and leader lock behavior.
  • Manual smoke doc: docker-run Consul, run minimal Druid nodes with Consul discovery/election, exercise restart/failover.
  • CI setup that can be optionally triggered.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions