|
2 | 2 | <#import "/templates/links.adoc" as links> |
3 | 3 |
|
4 | 4 | <@tmpl.guide |
5 | | -title="Deploy an AWS Lambda to guard against Split-Brain" |
| 5 | +title="Deploy an AWS Lambda to disable a non-responding site" |
6 | 6 | summary="Building block for loadbalancer resilience" |
7 | 7 | tileVisible="false" > |
8 | 8 |
|
9 | | -This {section} explains how to reduce the impact when split-brain scenarios occur between two sites in a multi-site deployment. |
| 9 | +This {section} explains how to resolve a split-brain scenarios between two sites in a multi-site deployment. |
| 10 | +It also disables replication if one site fails, so the other site can continue to serve requests. |
10 | 11 |
|
11 | 12 | This deployment is intended to be used with the setup described in the <@links.ha id="concepts-multi-site"/> {section}. |
12 | 13 | Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}. |
13 | 14 |
|
14 | 15 | include::partials/blueprint-disclaimer.adoc[] |
15 | 16 |
|
16 | 17 | == Architecture |
17 | | -In the event of a network communication failure between the two sites in a multi-site deployment, it is no |
18 | | -longer possible for the two sites to continue to replicate data between themselves and the two sites |
19 | | -will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different |
20 | | -sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites. |
21 | 18 |
|
22 | | -In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site deployments only consist of two sites, this is not possible. |
23 | | -Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this site is able to serve subsequent users requests. |
| 19 | +In the event of a network communication failure between sites in a multi-site deployment, it is no longer possible for the two sites to continue to replicate data between them. |
| 20 | +The {jdgserver_name} is configured with a `FAIL` failure policy, which ensures consistency over availability. Consequently, all user requests are served with an error message until the failure is resolved, either by restoring the network connection or by disabling cross-site replication. |
24 | 21 |
|
25 | | -Once the fencing procedure is triggered the replication between two {jdgserver_name} clusters in each site is no longer enabled and as a result the sites will be out-of-sync. |
26 | | -To recover from the out-of-sync state a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />. |
27 | | -This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved, but only after such a synchronisation using the manual procedure <@links.ha id="operate-site-online" />. |
| 22 | +In such scenarios, a quorum is commonly used to determine which sites are marked as online or offline. |
| 23 | +However, as multi-site deployments only consist of two sites, this is not possible. |
| 24 | +Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration, and hence only this site is able to serve subsequent users requests. |
| 25 | + |
| 26 | +In addition to the loadbalancer configuration, the fencing procedure disables replication between the two {jdgserver_name} clusters to allow serving user requests from the site that remains in the loadbalancer configuration. |
| 27 | +As a result, the sites will be out-of-sync once the replication has been disabled. |
| 28 | + |
| 29 | +To recover from the out-of-sync state, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />. |
| 30 | +This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved. The remove site should only be re-added once the two sites have been synchronized using the outlined procedure <@links.ha id="operate-site-online" />. |
28 | 31 |
|
29 | 32 | In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts] |
30 | | -and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, |
31 | | -which results in the Prometheus AlertManager calling the AWS Lambda based webhook. The triggered Lambda function inspects |
32 | | -the current Global Accelerator configuration and removes the site reported to be offline. |
| 33 | +and AWS Lambda functions. |
| 34 | +A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, which results in the Prometheus AlertManager calling the AWS Lambda based webhook. |
| 35 | +The triggered Lambda function inspects the current Global Accelerator configuration and removes the site reported to be offline. |
33 | 36 |
|
34 | | -In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both |
35 | | -sites will trigger the webhook simultaneously. We guard against this by ensuring that only a single Lambda instance can be executed at |
36 | | -a given time. |
| 37 | +In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both sites will trigger the webhook simultaneously. |
| 38 | +We guard against this by ensuring that only a single Lambda instance can be executed at a given time. |
| 39 | +The logic in the AWS Lambda ensures that always one site entry remains in the loadbalancer configuration. |
37 | 40 |
|
38 | 41 | == Prerequisites |
39 | 42 |
|
|
0 commit comments