Multi-server pubsub healthcheck endpoint #1730

jordigh · 2025-07-24T22:20:27Z

Context

In order to properly restart multi-instance Grist installs, I need to know when all of the instances (home servers, doc workers) are up and ready and restarted with the new Grist parameters.

This adds a new option to the /status health check endpoint called allServersReady to make sure that all the servers are up and ready to serve requests.

In the absence of Redis, the same endpoint also works with in-memory dummy pubsub provided by the PubSubManager.

Proposed solution

This adds to FlexServer a new option handled by a HealthChecker class that uses Redis pubsub to broadcast and handle requests to check each instances ready status. Each server registers and de-registers itself on Redis during startup and shutdown respectively. During a request to check if all servers are okay, the server handling the request from the web browser will report back when all registered servers respond one way or another, with a default timeout of 10 seconds.

This also adds a new optional GRIST_INSTANCE_ID parameter, which defaults to a combination of instance's host and port. This is ID is used to identify registered instances as well as to supply the listening channels unique to each instance for pubsub.

Related issues

Follow-up to #1697

Has this been tested?

👍 yes, I added tests to the test suite
💭 no, because this PR is a draft and still needs work
🙅 no, because this is not relevant here
🙋 no, because I need help

app/server/lib/FlexServer.ts

dsagal · 2025-07-29T14:51:19Z

app/server/lib/HealthChecker.ts

@@ -0,0 +1,121 @@
+import {GristServer} from 'app/server/lib/GristServer';


Could you add a doc-comment to this file that describes how these work, and why it's done this way? In particular, I assume it's a deliberate choice to rely on redis rather than, say, call each server's /status endpoint -- is that because we don't want to assume that servers can communicate with each other directly? That also means that this check doesn't reflect the ability for the servers to communicate with each other, right?

Sure, I'll write something.

How would servers know what other servers exist without Redis?

And since we're already using Redis to communicate between doc workers, it seemed like a natural extension to communicate with all Grist instances.

Does that make sense?

Well... this isn't entirely convincing. But I might not fully understand the goal of this check. Is it about checking whether a multi-server setup works correctly, or is it more narrow than that, like how many home-server or doc-workers are running?

How would servers know what other servers exist without Redis?

Good question, even without the "Redis" part. Is this check intended to verify that servers know about each other for the purpose of their functionality, or only for the purpose of the health check?

we're already using Redis to communicate between doc workers

We use Redis to coordinate assignments of docs to workers, and for some other things, like sessions and notifications, but for most traffic between home servers and doc workers, I think we rely on them being able to make HTTP requests to one another.

I am asking these questions to understand the purposes and make sure we are on the same page, but not saying this should be part of this healthcheck -- I can imagine that when setting up, there is value to know that, say, 3 doc-workers are running and healthy, even if the networking part of communicating to them isn't yet working. Is that the goal?

One more question: if I restart some servers, do we want this check to tell me how many got successfully restarted? Just knowing how many are healthy doesn't tell me if I have a mix of old and new ones.

The ultimate goal here is to be able to restart all the instances. That will be the next step. The restart is a button on the admin, but currently it only restarts whatever server gets hit with the request to restart.

After restarting all the instances, how does the web browser know when it's time to reload the page and get the newly restarted Grist page?

I decided it should be when all of the servers were restarted and ready to serve requests. That's how I arrived to this design. If we reloaded the page without all servers being restarted, we might be in some ambiguous state where some servers aren't ready yet.

I can imagine that when setting up, there is value to know that, say, 3 doc-workers are running and healthy, even if the networking part of communicating to them isn't yet working. Is that the goal?

Not explicitly. This is using the ready status of an instance which is a little different from healthy. The healthy state just means that the status endpoint is ready, which happens very early during Grist initialisation. This new endpoint instead checks that all instances that are registered on Redis have hit the ready status, which is at the end of the initialisation of an instance, one of the last things that MergedServer does. The ready status should mean that networking is also ready. Of course, if you have some networking problem between your servers where they can't even reach each other or Redis, this check won't catch that.

As to the choice of using Redis instead of internal http requests between the Grist instances, it just seemed easier to me to use a shared Redis channel since Redis seemed inevitable for discovery to begin with.

If restart is the main goal, then I think my last question from previous comment might be relevant?

Right, so it seems to me like the problem of an instance spontaneously dying already requires infrastructure not directly shipped with Grist Core. Maybe we can offer advice on how to handle this problem or supply this infrastructure outside of this PR.

What I can add here is an extra endpoint to instruct a server to clean up the registry of instances on Redis.

I think you are right, and this comment comes closest to explaining why I feel this whole approach isn't the right one for checking health of a multi-server installation. The issue is that not one part of this knows anything about which servers are expected to be a part of the installation. That knowledge is presumably in some external infrastructure, but then only with the help of that external infrastructure can you know the health of the installation.

E.g. let's say you start a cluster with 3 home servers and 10 doc workers, and you forgot to configure Redis for doc-workers. Then home servers register with Redis, doc-workers don't, and the healthcheck done this way will report that everything is healthy, when it's not even close. Or let's say everything is up and connected to Redis, and you updated some config and told everything to restart, but half your servers didn't react to your restart request, and are still running with the old config (e.g. wrong activation key, or wrong version, etc). Still, this healthcheck will report healthy. If you messed up and got two home servers running that point to different Redis URLs, whichever one the LB happens to ask will each independently report it is healthy, though the setup is broken.

All this approach does is check whether servers that successfully connected to Redis and registered themselves are still connected. Even for the narrow situation when everything is up and clean and registered in Redis, and there is nothing stale, does this help with restarts? As soon as you tell all servers to restart, they'll unregister themselves before exiting, and the healthcheck will return "all clear" even if they don't start up again, since it only checks servers that have successfully started and registered themselves.

I'm just making up scenarios, but fundamentally, it seems like anyone running an actual multi-server installation must have some external infrastructure to know what the servers are intended to be running. If we try to build tools to manage a cluster (even e.g. turn on "enterprise") without at all being aware of this infrastructure, or knowing which servers the admin intends to be affected, I think we might make the job of the administrator harder rather than easier.

jordigh changed the title ~~wip~~ Multi-server pubsub healthcheck endpoint Jul 24, 2025

jordigh force-pushed the jordigh/multi-server-health branch from 9dca8c9 to cbb01d6 Compare July 25, 2025 17:11

dsagal reviewed Jul 25, 2025

View reviewed changes

app/server/lib/FlexServer.ts Outdated Show resolved Hide resolved

jordigh force-pushed the jordigh/multi-server-health branch 9 times, most recently from 7f18ee8 to 6911c03 Compare July 29, 2025 14:23

dsagal reviewed Jul 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multi-server pubsub healthcheck endpoint #1730

Multi-server pubsub healthcheck endpoint #1730

Uh oh!

jordigh commented Jul 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

dsagal Jul 29, 2025

Uh oh!

jordigh Jul 29, 2025 •

edited

Loading

Uh oh!

dsagal Jul 29, 2025

Uh oh!

jordigh Jul 29, 2025 •

edited

Loading

Uh oh!

dsagal Jul 29, 2025

Uh oh!

jordigh Jul 30, 2025 •

edited

Loading

Uh oh!

dsagal Jul 30, 2025

Uh oh!

paulfitz Jul 30, 2025

		@@ -0,0 +1,121 @@
		import {GristServer} from 'app/server/lib/GristServer';

Uh oh!

Multi-server pubsub healthcheck endpoint #1730

Multi-server pubsub healthcheck endpoint #1730

Uh oh!

Conversation

jordigh commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Proposed solution

Related issues

Has this been tested?

Uh oh!

Uh oh!

dsagal Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jordigh Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsagal Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jordigh Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsagal Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jordigh Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsagal Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

paulfitz Jul 30, 2025

Choose a reason for hiding this comment

jordigh commented Jul 24, 2025 •

edited

Loading

jordigh Jul 29, 2025 •

edited

Loading

jordigh Jul 29, 2025 •

edited

Loading

jordigh Jul 30, 2025 •

edited

Loading