Keycloak health checks failing & server repeatedly restarting after second deployment after upgrade to 26.4.0 on AWS Fargate #43194
Replies: 5 comments 25 replies
-
DB connection errors will under some circumstances cause the "Keycloak database connections async health check" to fail. From what you are showing the problem you are having is with the "Keycloak cluster health check" - which is new for the 26.4 release. cc @pruivo @ahus1 How many keycloak instances are you trying to cluster? Can you try setting the log level to debug and look for logs from org.keycloak.infinispan.health.impl.JdbcPingClusterHealthImpl - that should give a indication of why that check is failing. |
Beta Was this translation helpful? Give feedback.
-
|
@pruivo @ahus1 thank you for the information. In #43194 (reply in thread) I neglected to adjust a security group mapping which I realized because it was mentioned in #23400. Recap
Recap Log SummaryAfter deploying again, I now see two patterns of Those messages stop after 1 minute and 20 seconds. @pruivo This makes me think the DB is being updated & the way the container is stopped isn't the cause of the problem. Then I start to see a similar but different message: After that, another task starts up. Right after that new task starts up, I start to see both the above "Connect timed out" IP and a new one too: My Current Understanding
Attempts to fix the problemI'm trying some things - both https://www.keycloak.org/server/caching#_running_instances_on_different_networks and changing how health checks are read and looking at adjusting the entire setup - and will post again tomorrow. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you everyone for your help. I seem to have found a configuration which both works and doesn't sometimes restart. For future reference - this is a working gist using Fargate + RDS / Aurora, in CloudFormation with Dockerfile & deployment instructions: https://gist.github.com/Gregory-Ledray/b1dd6a8958c6ed224aed4838450922fb Main changes from my post were:
These changes resulted in:
As for why it was that not having these changes applied resulted in the exact behavior I saw, that I'm not sure. I hadn't set container-level health checks in Fargate, so container health wasn't being evaluated using the HEALTHCHECK in the Dockerfile. Tasks cycled because restarts were initiated by load balancer health checks. And the errors were due to misconfiguration. But why did those errors sometimes go away, resulting in a stable deployment? That I do not know and don't have time to investigate. Thank you again to everyone who helped! |
Beta Was this translation helpful? Give feedback.
-
|
I never got this to work on This was probably caused by whatever caused #43561 which caused shutdown to never run. I checked the logs for After updating to |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for sharing your approach in the gist. Some notes on adding RPM packages. The docs state an approach that might be slightly simpler as it avoids copying over single files or an extra dependency. See the snippet just after "It is possible to install new RPMs if absolutely required..." There you could copy those extra binaries and their dependencies onto the official Keycloak image, which would make your script a lot smaller. It would also then be less fragile towards changes in our build file like, for example, changing the JDK version. Our latest images use JDK21, while your script is using JDK17. The docs also state how to run your custom endpoint scripts. Looking at your example, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi. After upgrading to 26.4.0 I've been unable to keep Keycloak stable, as the service has repeatedly reported failing health checks & when that happens the service/task restarts.
During functional testing, the service tends to work as expected, ie I can log in to both the admin portal and as a user.
Timeline
2025-09-25T11:15:48.330-05:00deployed upgrade to 26.3.5Saw this error shortly after startup. This error was NOT new with 26.3.5:
2025-09-25T11:16:25.006-05:002025-09-25 16:16:25,006 ERROR [org.jgroups.protocols.TCP] (pd-bundler-8,ip-10-30-181-178-15876) ip-10-30-181-178-15876: failed sending message to 10.30.184.253:7800: java.net.SocketTimeoutException: Connect timed outThis error appeared once and then stopped. In a subsequent deployment at
2025-09-26T16:37:47.720-05:00, still on 26.3.5, this error would appear twice during startup and then stop.2025-09-30T18:01:28.699-05:00deployed upgrade to 26.4.0Starting with the
26.4.0release, I began to see the following log message accompany the above ERROR in the logs, intially preceding the above error but also appearing after & at other intervals:In the first deployment, these failing health checks occurred for 1 minute and 30 seconds and then stopped. Everything seemed to be OK.
2025-10-02T11:21:14.048-05:00deployed with 26.4.0 for the second timeStarting in this deployment, the above health down status was repeatedly logged over 8 minutes and 30 seconds. The "failed sending message to..." error was interspersed with these problems. From an admin perspective and an end user perspective the service seems functional. I'm still able to log in to the admin page & to the application.
Because so many health checks failed, the service was terminated and a new one was started up. The new one had the same problem as the last one - it started up, reported unhealthy, and shut down. The service performed this loop several hundred times. I downgraded to 26.3.5 to "fix" the problem.
Does anyone have any advice on how to fix this?
Infrastructure / Setup
The service runs on AWS Fargate as a Task with Docker. The service restarts if health checks fail. Health check path is
/healthon port 9000. Docker configured health checks as follows:Dockerfile
Not shown in Dockerfile are default username/password of admin & KC_HOSTNAME.
What I Have Tried
LOG: could not receive data from client: Connection reset by peerbut these seem to happen during service restarts, as they only happened every 7 to 8 minutes.Questions
26.4.0to cause these DB connection errors to result in the service as being marked unhealthy?26.4.0cause this problem, but not the first? I presume that the first startup included migrating to the new version, so maybe there is something in there which delayed startup which caused this issue to not manifest?Beta Was this translation helpful? Give feedback.
All reactions