-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Avoid loading entire composite role collection to add/remove items or check presence (fixes #11796) #11799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@ahus1 Here is the first PR for storage-related changes. |
|
Thank you for this pull request. I see that you started with a change on the JPA legacy store. I see that this is touching only a small area of the code, which is good. Even better, it doesn't change the table structure, which would otherwise need even more testing. I'd be happy if we could go without structural DB changes. Still, as it is touching the legacy store, I'll have to ask what makes this PR a good starting point to increase the realm performance. A good argument might be that it has the biggest impact overall, or that it has the best ratio of code change vs. performance impact. If there is a change in another area, I'd rather touch that other area first. If this is the PR we would start with, I ask you to open the next PR only after this one has been closed, or if that other PR would have a higher priority. This would avoid conflicts, and I would be able to focus on one PR after the other, and would work better with my time constraints given my focus on the new store. Please comment if you'd see this differently. |
Indeed. ;-)
Yes, I do feel better as well if the DB structure stays the same. I know the pain. :-)
Agreed. So this specific one was one of the very first top outliers when going through the measuring-profiling-optimizing steps (always optimize for the top most cost prohibitive operation). It also introduces the entity for the join table which will be used in the next PR - which will probably be more chunky btw.
Sure, no problem. |
pedroigor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model/jpa/src/main/java/org/keycloak/models/jpa/RoleAdapter.java
Outdated
Show resolved
Hide resolved
|
@ahus1 @pedroigor I saw that the CI workflow failed before the PR was updated - do I need to do something here? |
|
@davoustp - as you're contributing to this repository for the first time, a maintainer (like Pedro) will need to approve the GitHub actions to run. The previous run was https://github.com/keycloak/keycloak/actions/runs/2258568976. The Quickstart tests currently fail, and this is unrelated to your changes; this will need to be fixed in main by someone else. The base test suite showed the following failing test, and this might be related to your changes as it (at least by the name) is about roles. There is a guide on how to run those tests - if you find yourself struggling with a local setup, can give it a try and report back. |
Ok
Ah! I did not see that was just the plain arquillian tests base - dummy me.
That makes sense - because this is exactly the problem @pedroigor mentioned: I can reproduce the issue locally with the initial PR codebase, and test is now passing with the updated PR. |
|
Ok, so now it's working with Wildfly but still not with Quarkus. However, I'm struggling to debug the Quarkus process while the test is running. Any advice, @ahus1 or @pedroigor ? |
|
@ahus1 @pedroigor - I was able to reproduce the test case issue with a plain REST API call (could not find any way to debug the Quarkus server with integration tests). I was able to upgrade the Undertow deployment up to the same Hibernate version as used by Quarkus, but it seems it still working fine with Unertow, so I suspect that Hibernate settings are slightly different between the two flavours - which ones, I don't know. The core problem is that the SQL statements are produced out of order by Hibernate: the Obviously, adding a I highly suspect that this related to the way the newly introduced |
|
Hi @pedroigor and @ahus1 As I said, I reproduced the issue with the Quarkus distro, even outside the arquillian test suite. I finally managed to build the Wildfly distro (maven repo issue on my side), and run the arquillan test against it and... it passes with flying colors. So I updated the Wildfly distro to match the same exact Hibernate version, and it still passes... Now I'm puzzled. What's so special with the way Hibernate is used with the Quarkus distro? I also deeply scanned the Hibernate issues to find a matching problem as a sanity check, to no avail. I also attempted to change the definition of the So I'm starting to wonder if there would be some different settings / instrumentation / whatever into the Quarkus side of things... Any idea? |
|
@pedroigor @ahus1 - updated the PR with some changes. It's not been a bed of roses. :-/ First, I found out why there was a difference between the Wildfly and the Quarkus distro: it seems that the This explains that the tests were passing with Wildfly (and Undertow btw), because when the collection is already loaded, the executed code path is basically the existing (already tested) one. And this allowed me to see why the new code path (performing inserts directly into the BTW, I did not go into why Wildfly always loads the lazy collections, mind. There was enough on my plate already. So I went down the road to make Hibernate aware that pending statements on both I finally settled onto a basic, standard native Note that all other query spaces are left untouched. In other words, this does not trigger a full flush as an I also reviewed the deletion to use a named query to save on the I successfully ran the arquillian base tests with Undertow, Wildfly and Quarkus. Hopefully this will (finally) result into a successful build. ;-) |
model/jpa/src/main/java/org/keycloak/models/jpa/entities/CompositeRoleEntity.java
Outdated
Show resolved
Hide resolved
model/jpa/src/main/java/org/keycloak/models/jpa/RoleAdapter.java
Outdated
Show resolved
Hide resolved
model/jpa/src/main/java/org/keycloak/models/jpa/RoleAdapter.java
Outdated
Show resolved
Hide resolved
… check presence (fixes keycloak#11796)
|
Since @ahus1 mentionned the quick test issue was due to a pb in |
|
@davoustp - thanks for updating the PR, the code now looks good to me, and all relevant tests are green as well. That's great! As this PR is all about a performance improvement, could you please run a performance test and show some before/after results? I assume keycloak-benchmark CreateRealms might be the right test case here? It would also be good to know about the parameters of your test (for example, if you would go for the Quarkus distribution, and which DB you would use). Thanks! |
|
@ahus1 - sure, no problem. HostType: MacBook Pro (16 inches, 2019) Docker: Docker Desktop for Mac Benchmarking processRunning the database and Keycloak in Docker: Running Gatling scenario Benchmark BASELINE - commit
|
|
Hi @ahus1 , Just eager to know when this changes will get merged. |
Hi @sreekesh93 , this is just the first PR, right, there are some more in the pipe - but we go one at a time, avoiding merge conflicts and being focused onto each. So even though this one helps (as showed by the benchmark), there is still way to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @davoustp - thank you for this PR. I also did some performance tests locally, and found that adding realms is still slow, but works faster as before as the entities are not fetched from the database all the time. This is good.
I created a setup with 200 realms and then compared logging in for the first time (with no filled cache) and after the cache has been filled.
With the filled cache login is quick. When the cache is filled during the first login, the slowness is expected, but has some quirks:
With the Wildfly distribution before and after your change, this take ~1 minute to login with an unfilled cache. As that distribution doesn't do lazy loading and therefore won't touch your code, this is expected. This is also the baseline.
With the Quarkus distribution before your change, it also took ~1 minute to login. After your change it tool ~2 minutes to login. This is a regression.
I think this is due to the changed intrinsics: There are several places in the code that first check if the role is a composite role, and then fetch the composite stream. Those will hit the database twice: once to check if there are composite roles, and if there are, it is hitting it another time to actually fetch the composites. I found some places here - f110388. After those are fixed, it seems to be as fast as before (~1 min to login).
There might also be a potential regression if within a request the isComposite() is queried multiple times; still I didn't find an evidence that there is a performance impact due to that, so I wouldn't touch this.
I now wonder: did you plan changes in regarding to not calling isComposite() when the full list would be fetched in the same method in a follow-up PR?
I'm still not sure if this should be part of this PR -- I wonder if removing those would break assumptions of a different store. Please first let me hear your thoughts on this before changing this PR.
|
@sreekesh93 - there is no fixed schedule when this will be merged. @davoustp has done a great job to find a working solution here, and we're now looking at the performance. Both areas - correctness and performance - are important here, and don't allow for rushing. I'll continue to provide feedback as time permits while working on the new store for Keycloak and performance tests. For the Keycloak team, working on the new store has priority as it will allow for example zero-downtime upgrades of Keycloak and a wider range of stores. I'll try to find the time for this PR to provide my feedback here in a timely manner. |
@ahus1 Thanks for the clarification and appreciate your support. |
Thanks for investigating!
Yup
That's correct.
Exactly.
There is a chance that this occurs, because the JPA legacy implementation goes to the DB each time if the collection was not loaded in the first place. I did not see any such occurence, but that's possible.
Yes, there is line 49 at https://github.com/davoustp/keycloak/blob/KEYCLOAK-4593-investigations/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/entities/CachedRole.java#L49 that shows: As you can see here (disregarding the way children ids are retrieved), I'm not even relying onto the However I did not touch the
Well, the SPI provides a way to know whether the role is composite (= has children) without actually getting them, fine. This is interesting because, beyond the better code readability it provides, it allows SPI implementations to be optimized according to their own internal structures. Obviously, this is not the case with the JPA legacy implementation (since it can trigger a full load operation, and rely on the JPA layer to provide caching when done once already). It would be much better if this flag was actually denormalized in the In addition to this, checking the If an implementation for Regarding whether we should change the So we've got two ways here: either we do a small PR fixing this code pattern (which is actually disconnected from the changes in this PR) in addition to this one, or we integrate the few changes in this PR. Honestly, I'd go for a separate PR and link them together, as this is touching other modules than the JPA model one. WDYT? |
|
@davoustp - I gave it a thought and also presented this PR to the storage team. I want this PR to include all the changes needed for an in-par-performance, and don't depend on another PR to restore the performance. One of the reasons why I'm interested in the PR is to figure out to handle the I then compared how this behaves compared to the old version; it still seems to be slower when logging in for the first time (roughly double the time when testing with 200 realms). I assume it is pulling in too many entities in the context, and the dirty checking takes up more time. It is probably one extra dirty check per composite role that makes it so slow, still I don't see it touching the database an extra time. It might be that when Hibernate does this internally, it can do without one more dirty checking. Seeing that this is slower than before, and seeing that the Admin UI is unusable with a large number of realms (with both the old and the new way), I suggest to take a step back and have a look what else could be done. At the moment for each realm there is a new client with client roles attached to it. I wonder if the problem we see could be solved by removing that client once the admin user has set up that realm and added an admin user to that specific realm so it can be managed afterwards. So what I am suggesting is as a workflow like this for each new realm:
Although the last step is not as simple as it sounds, I wonder if that would solve your scenario: It would allow Keycloak to scale to many realm with one admin user per realm, although no overall admin user. If that would work for you, there are some things that need to be ironed out: Keycloak will be unhappy about removing the client: It won't allow you to remove a client with the suffix "-realm". A workaround is to first rename it, then remove it. After that, there are at least two places that need fixing to allow for this new situation, see fdbf876. Once this has been done, the new realm will no longer appear in the admin UI when he master realm is opened, thereby the master realm's UI is fast. The last missing piece would then how to re-add the client in case the user of the new realm locks out itself. I haven't tested that, still I don't see why this shouldn't be possible by re-adding the client with minimal roles. I understand this is heading in a very different direction from this PR and your previous work and is a big step. Still it might solve both the creating of lots of realms, and make the Admin UI (plus the REST endpoints) usable. Once we consider this path worth exploring, I'll double-check with a maintainer to get indication on that approach. |
|
@ahus1 - sorry, was off for a few days.
Ok, we're all agree then.
I agree: I find this code part brittle and fragile from a maintenance standpoint.
I see that. Actually, I already went that route when investigating (I still think the Also note that method
Hmm, I suspect this is because the I think the change suggested in comment #11799 (review) is actually more appropriate - and I would not load the collection from the
Yup, avoiding the huge Role composition graph in the first place would obviously be a better approach - even though the graph loading mechanism should be optimized (next PR) in any case. I'm not in a position to tell whether this is possible, of course - only maintainers can, I imagine. My guess is that it would be a large breaking change: the root admin account could not manage all the realms any more. I suspect that this is hardly acceptable from a functional standpoint, but we can / should still ask maintainers, of course.
No problem - if there is a more elegant and faster way of fixing the issue, why not? ;-) |
|
I've started two GitHub discussions as a spin-off from this PR:
I'd be happy if you could add your view to those. There needs to be buy-in from some maintainers and the community first before they can be attempted, so let's continue with this PR for now until it is faster enough. I'll reply on your previous comment in more details soon. |
|
Hi @ahus1 Shall we proceed with removing the isComposite-then-getCompositeRoles code pattern as discussed previously, since this looks like the most appropriate and straightforward way to keep the same performance re: UI login time? |
|
Hi @davoustp - no worries, I am quite busy myself. I really like the concepts of this PR, still I don't see how this could
Another note about the UI performance: Even when we get the UI performance as good as before the PR for lots of realms (>200), the performance is so bad that it is unusable. Therefore I plan to pause my work here and focus on the new store:
|
Emerging the JPA entity join COMPOSITE_ROLE table to allow optimized access and lookups.
Used here in the role composite collection management (to avoid trigger a full JPA load via the join table).
Closes #11796