Skip to content

Commit cb12f03

Browse files
authored
Rework AWS Lambda doc to show it is required (#33462)
Closes #33461 Signed-off-by: Alexander Schwartz <[email protected]>
1 parent c165344 commit cb12f03

File tree

3 files changed

+28
-22
lines changed

3 files changed

+28
-22
lines changed

docs/documentation/release_notes/topics/26_0_0.adoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,9 @@ The new `footer.ftl` template provides a `content` macro that is rendered at the
8181

8282
{project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably:
8383

84-
- {project_name} deployments on each site are now able to handle user requests simultaneously, therefore active/active setups are now supported.
84+
- {project_name} deployments are now able to handle user requests simultaneously in both sites.
85+
86+
- Active monitoring of the connectivity between the sites is now required to update the replication between the sites in case of a failure.
8587

8688
- The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times caused by DNS caching by clients.
8789

docs/documentation/upgrading/topics/changes/changes-26_0_0.adoc

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -105,8 +105,9 @@ This is enforced by default, and can be disabled using the SPI option `spi-singl
105105

106106
{project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably:
107107

108-
- {project_name} deployments on each site are now able to handle user requests simultaneously, therefore active/active
109-
setups are now supported, while previous configurations which leveraged active/passive loadbalancer will continue to work.
108+
- {project_name} deployments are now able to handle user requests simultaneously in both sites. Previous load balancer configurations handling requests only in one site at a time will continue to work.
109+
110+
- Active monitoring of the connectivity between the sites is now required to the replication between the sites in case of a failure. The blueprints describe a setup with Alertmanager and AWS Lambda.
110111

111112
- The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times
112113
caused by DNS caching by clients.
@@ -127,8 +128,8 @@ While previous versions of the cache configurations only logged warnings when th
127128
Due to that, you need to set up monitoring to disconnect the two sites in case of a site failure.
128129
The Keycloak High Availability Guide contains a blueprint on how to set this up.
129130

130-
. While previous LoadBalancer configurations will continue to work with {project_name}, consider upgrading
131-
an existing Route53 configurations to avoid prolonged failover times due to client side DNS caching.
131+
. While previous load balancer configurations will continue to work with {project_name}, consider upgrading
132+
an existing Route53 configuration to avoid prolonged failover times due to client side DNS caching.
132133

133134
. If you have updated your cache configuration XML file with remote-store configurations, those will no longer work.
134135
Instead, enable the `multi-site` feature and use the `cache-remove-*` options.

docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,38 +2,41 @@
22
<#import "/templates/links.adoc" as links>
33

44
<@tmpl.guide
5-
title="Deploy an AWS Lambda to guard against Split-Brain"
5+
title="Deploy an AWS Lambda to disable a non-responding site"
66
summary="Building block for loadbalancer resilience"
77
tileVisible="false" >
88

9-
This {section} explains how to reduce the impact when split-brain scenarios occur between two sites in a multi-site deployment.
9+
This {section} explains how to resolve a split-brain scenarios between two sites in a multi-site deployment.
10+
It also disables replication if one site fails, so the other site can continue to serve requests.
1011

1112
This deployment is intended to be used with the setup described in the <@links.ha id="concepts-multi-site"/> {section}.
1213
Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}.
1314

1415
include::partials/blueprint-disclaimer.adoc[]
1516

1617
== Architecture
17-
In the event of a network communication failure between the two sites in a multi-site deployment, it is no
18-
longer possible for the two sites to continue to replicate data between themselves and the two sites
19-
will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different
20-
sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites.
2118

22-
In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site deployments only consist of two sites, this is not possible.
23-
Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this site is able to serve subsequent users requests.
19+
In the event of a network communication failure between sites in a multi-site deployment, it is no longer possible for the two sites to continue to replicate data between them.
20+
The {jdgserver_name} is configured with a `FAIL` failure policy, which ensures consistency over availability. Consequently, all user requests are served with an error message until the failure is resolved, either by restoring the network connection or by disabling cross-site replication.
2421

25-
Once the fencing procedure is triggered the replication between two {jdgserver_name} clusters in each site is no longer enabled and as a result the sites will be out-of-sync.
26-
To recover from the out-of-sync state a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
27-
This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved, but only after such a synchronisation using the manual procedure <@links.ha id="operate-site-online" />.
22+
In such scenarios, a quorum is commonly used to determine which sites are marked as online or offline.
23+
However, as multi-site deployments only consist of two sites, this is not possible.
24+
Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration, and hence only this site is able to serve subsequent users requests.
25+
26+
In addition to the loadbalancer configuration, the fencing procedure disables replication between the two {jdgserver_name} clusters to allow serving user requests from the site that remains in the loadbalancer configuration.
27+
As a result, the sites will be out-of-sync once the replication has been disabled.
28+
29+
To recover from the out-of-sync state, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
30+
This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved. The remove site should only be re-added once the two sites have been synchronized using the outlined procedure <@links.ha id="operate-site-online" />.
2831

2932
In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts]
30-
and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics,
31-
which results in the Prometheus AlertManager calling the AWS Lambda based webhook. The triggered Lambda function inspects
32-
the current Global Accelerator configuration and removes the site reported to be offline.
33+
and AWS Lambda functions.
34+
A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, which results in the Prometheus AlertManager calling the AWS Lambda based webhook.
35+
The triggered Lambda function inspects the current Global Accelerator configuration and removes the site reported to be offline.
3336

34-
In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both
35-
sites will trigger the webhook simultaneously. We guard against this by ensuring that only a single Lambda instance can be executed at
36-
a given time.
37+
In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both sites will trigger the webhook simultaneously.
38+
We guard against this by ensuring that only a single Lambda instance can be executed at a given time.
39+
The logic in the AWS Lambda ensures that always one site entry remains in the loadbalancer configuration.
3740

3841
== Prerequisites
3942

0 commit comments

Comments
 (0)