-
-
Notifications
You must be signed in to change notification settings - Fork 479
Multi-server pubsub healthcheck endpoint #1730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9dca8c9
to
cbb01d6
Compare
7f18ee8
to
6911c03
Compare
@@ -0,0 +1,121 @@ | |||
import {GristServer} from 'app/server/lib/GristServer'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a doc-comment to this file that describes how these work, and why it's done this way? In particular, I assume it's a deliberate choice to rely on redis rather than, say, call each server's /status endpoint -- is that because we don't want to assume that servers can communicate with each other directly? That also means that this check doesn't reflect the ability for the servers to communicate with each other, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll write something.
How would servers know what other servers exist without Redis?
And since we're already using Redis to communicate between doc workers, it seemed like a natural extension to communicate with all Grist instances.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... this isn't entirely convincing. But I might not fully understand the goal of this check. Is it about checking whether a multi-server setup works correctly, or is it more narrow than that, like how many home-server or doc-workers are running?
How would servers know what other servers exist without Redis?
Good question, even without the "Redis" part. Is this check intended to verify that servers know about each other for the purpose of their functionality, or only for the purpose of the health check?
we're already using Redis to communicate between doc workers
We use Redis to coordinate assignments of docs to workers, and for some other things, like sessions and notifications, but for most traffic between home servers and doc workers, I think we rely on them being able to make HTTP requests to one another.
I am asking these questions to understand the purposes and make sure we are on the same page, but not saying this should be part of this healthcheck -- I can imagine that when setting up, there is value to know that, say, 3 doc-workers are running and healthy, even if the networking part of communicating to them isn't yet working. Is that the goal?
One more question: if I restart some servers, do we want this check to tell me how many got successfully restarted? Just knowing how many are healthy doesn't tell me if I have a mix of old and new ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ultimate goal here is to be able to restart all the instances. That will be the next step. The restart is a button on the admin, but currently it only restarts whatever server gets hit with the request to restart.
After restarting all the instances, how does the web browser know when it's time to reload the page and get the newly restarted Grist page?
I decided it should be when all of the servers were restarted and ready to serve requests. That's how I arrived to this design. If we reloaded the page without all servers being restarted, we might be in some ambiguous state where some servers aren't ready yet.
I can imagine that when setting up, there is value to know that, say, 3 doc-workers are running and healthy, even if the networking part of communicating to them isn't yet working. Is that the goal?
Not explicitly. This is using the ready
status of an instance which is a little different from healthy
. The healthy
state just means that the status
endpoint is ready, which happens very early during Grist initialisation. This new endpoint instead checks that all instances that are registered on Redis have hit the ready
status, which is at the end of the initialisation of an instance, one of the last things that MergedServer
does. The ready
status should mean that networking is also ready. Of course, if you have some networking problem between your servers where they can't even reach each other or Redis, this check won't catch that.
As to the choice of using Redis instead of internal http requests between the Grist instances, it just seemed easier to me to use a shared Redis channel since Redis seemed inevitable for discovery to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If restart is the main goal, then I think my last question from previous comment might be relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so it seems to me like the problem of an instance spontaneously dying already requires infrastructure not directly shipped with Grist Core. Maybe we can offer advice on how to handle this problem or supply this infrastructure outside of this PR.
What I can add here is an extra endpoint to instruct a server to clean up the registry of instances on Redis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right, and this comment comes closest to explaining why I feel this whole approach isn't the right one for checking health of a multi-server installation. The issue is that not one part of this knows anything about which servers are expected to be a part of the installation. That knowledge is presumably in some external infrastructure, but then only with the help of that external infrastructure can you know the health of the installation.
E.g. let's say you start a cluster with 3 home servers and 10 doc workers, and you forgot to configure Redis for doc-workers. Then home servers register with Redis, doc-workers don't, and the healthcheck done this way will report that everything is healthy, when it's not even close. Or let's say everything is up and connected to Redis, and you updated some config and told everything to restart, but half your servers didn't react to your restart request, and are still running with the old config (e.g. wrong activation key, or wrong version, etc). Still, this healthcheck will report healthy. If you messed up and got two home servers running that point to different Redis URLs, whichever one the LB happens to ask will each independently report it is healthy, though the setup is broken.
All this approach does is check whether servers that successfully connected to Redis and registered themselves are still connected. Even for the narrow situation when everything is up and clean and registered in Redis, and there is nothing stale, does this help with restarts? As soon as you tell all servers to restart, they'll unregister themselves before exiting, and the healthcheck will return "all clear" even if they don't start up again, since it only checks servers that have successfully started and registered themselves.
I'm just making up scenarios, but fundamentally, it seems like anyone running an actual multi-server installation must have some external infrastructure to know what the servers are intended to be running. If we try to build tools to manage a cluster (even e.g. turn on "enterprise") without at all being aware of this infrastructure, or knowing which servers the admin intends to be affected, I think we might make the job of the administrator harder rather than easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context
In order to properly restart multi-instance Grist installs, I need to know when all of the instances (home servers, doc workers) are up and ready and restarted with the new Grist parameters.
This adds a new option to the
/status
health check endpoint calledallServersReady
to make sure that all the servers are up and ready to serve requests.In the absence of Redis, the same endpoint also works with in-memory dummy pubsub provided by the
PubSubManager
.Proposed solution
This adds to
FlexServer
a new option handled by aHealthChecker
class that uses Redis pubsub to broadcast and handle requests to check each instancesready
status. Each server registers and de-registers itself on Redis during startup and shutdown respectively. During a request to check if all servers are okay, the server handling the request from the web browser will report back when all registered servers respond one way or another, with a default timeout of 10 seconds.This also adds a new optional
GRIST_INSTANCE_ID
parameter, which defaults to a combination of instance's host and port. This is ID is used to identify registered instances as well as to supply the listening channels unique to each instance for pubsub.Related issues
Follow-up to #1697
Has this been tested?