Skip to content

Conversation

@davoustp
Copy link

@davoustp davoustp commented May 2, 2022

Emerging the JPA entity join COMPOSITE_ROLE table to allow optimized access and lookups.

Used here in the role composite collection management (to avoid trigger a full JPA load via the join table).

Closes #11796

@davoustp
Copy link
Author

davoustp commented May 2, 2022

@ahus1 Here is the first PR for storage-related changes.

@ahus1 ahus1 self-assigned this May 2, 2022
@ahus1
Copy link
Contributor

ahus1 commented May 2, 2022

Thank you for this pull request. I see that you started with a change on the JPA legacy store.

I see that this is touching only a small area of the code, which is good. Even better, it doesn't change the table structure, which would otherwise need even more testing. I'd be happy if we could go without structural DB changes.

Still, as it is touching the legacy store, I'll have to ask what makes this PR a good starting point to increase the realm performance. A good argument might be that it has the biggest impact overall, or that it has the best ratio of code change vs. performance impact. If there is a change in another area, I'd rather touch that other area first.

If this is the PR we would start with, I ask you to open the next PR only after this one has been closed, or if that other PR would have a higher priority. This would avoid conflicts, and I would be able to focus on one PR after the other, and would work better with my time constraints given my focus on the new store. Please comment if you'd see this differently.

@davoustp
Copy link
Author

davoustp commented May 2, 2022

Thank you for this pull request. I see that you started with a change on the JPA legacy store.

Indeed. ;-)

I see that this is touching only a small area of the code, which is good. Even better, it doesn't change the table structure, which would otherwise need even more testing. I'd be happy if we could go without DB changes.

Yes, I do feel better as well if the DB structure stays the same. I know the pain. :-)
This is the case will all the changes: there is no schema change, hence no schema upgrade or testing required in this area.

Still, as it is touching the legacy store, I'll have to ask what makes this PR a good starting point to increase the realm performance. A good argument might be that it has the biggest impact overall, or that it has the best ratio of code change vs. performance impact. If there is a change in another area, I'd rather touch that other area first.

Agreed.
I'm trying to balance the dependencies between changes against the self-containment, the simplicity of each change, in addition to the performance benefits that each bring.

So this specific one was one of the very first top outliers when going through the measuring-profiling-optimizing steps (always optimize for the top most cost prohibitive operation).

It also introduces the entity for the join table which will be used in the next PR - which will probably be more chunky btw.

If this is the PR we would start with, I ask you to open the next PR only after this one has been closed, or if that other PR would have a higher priority. This would avoid conflicts, and I would be able to focus on one PR after the other, and would work better with my time constraints given my focus on the new store. Please comment if you'd see this differently.

Sure, no problem.

Copy link
Contributor

@pedroigor pedroigor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davoustp I'm also wondering if the changes from this PR should also help here 86d2a89.

@davoustp
Copy link
Author

davoustp commented May 4, 2022

@ahus1 @pedroigor I saw that the CI workflow failed before the PR was updated - do I need to do something here?
I don't have any idea about what tests are run for each group.
Some are also complaining about failing dependencies, but there is no change around dependencies or versions in this PR... Not sure what to do here.

@ahus1
Copy link
Contributor

ahus1 commented May 4, 2022

@davoustp - as you're contributing to this repository for the first time, a maintainer (like Pedro) will need to approve the GitHub actions to run.

The previous run was https://github.com/keycloak/keycloak/actions/runs/2258568976.

The Quickstart tests currently fail, and this is unrelated to your changes; this will need to be fixed in main by someone else.

The base test suite showed the following failing test, and this might be related to your changes as it (at least by the name) is about roles.

Error: OR] Failures: 
Error: OR]   ClientRegistrationTest.clientWithDefaultRoles:203 
-- Expected: ["test-default-role1", "test-default-role2"] in any order
--      but: no item matches: "test-default-role2" in ["test-default-role1"]
-- [INFO] 
Error: OR] Tests run: 1950, Failures: 1, Errors: 0, Skipped: 21

There is a guide on how to run those tests - if you find yourself struggling with a local setup, can give it a try and report back.

@davoustp
Copy link
Author

davoustp commented May 4, 2022

The Quickstart tests currently fail, and this is unrelated to your changes; this will need to be fixed in main by someone else.

Ok

There is a guide on how to run those tests - if you find yourself struggling with a local setup, can give it a try and report back.

Ah! I did not see that was just the plain arquillian tests base - dummy me.
Sorry for asking, I should have pushed my investigation further before bothering you. :-(

The base test suite showed the following failing test, and this might be related to your changes as it (at least by the name) is about roles.

That makes sense - because this is exactly the problem @pedroigor mentioned: I can reproduce the issue locally with the initial PR codebase, and test is now passing with the updated PR.
So next time the GitHub actions are enabled/run, this test group should be fine.

@davoustp
Copy link
Author

davoustp commented May 4, 2022

Ok, so now it's working with Wildfly but still not with Quarkus.
I can definitely reproduce the issue locally by running:

mvn -f testsuite/integration-arquillian/pom.xml clean install \
  -Dtest=ClientRegistrationTest#clientWithDefaultRoles -Pauth-server-quarkus

However, I'm struggling to debug the Quarkus process while the test is running.
I did not find any instruction how to do so (actually found a line stating that I'm basically in a difficult situation ;-)).
Attempted to run the test against a 'remote' Keycloak (it being my own Keycloak which I can debug), but it chokes onto the management APIs it cannot find on Quarkus...

Any advice, @ahus1 or @pedroigor ?

@davoustp
Copy link
Author

davoustp commented May 5, 2022

@ahus1 @pedroigor - I was able to reproduce the test case issue with a plain REST API call (could not find any way to debug the Quarkus server with integration tests).

I was able to upgrade the Undertow deployment up to the same Hibernate version as used by Quarkus, but it seems it still working fine with Unertow, so I suspect that Hibernate settings are slightly different between the two flavours - which ones, I don't know.
But at least I was able to rule any Hibernate issue out.

The core problem is that the SQL statements are produced out of order by Hibernate: the INSERTs into KEYCLOAK_ROLEand COMPOSITE_ROLE are issued after invoking the getCompositeRoles(), which means that the newly added roles are not collected when the collection is loaded from the database.
Hence the API response does not contain the added default roles, leading to the test case failing.

Obviously, adding a em.flush()does fix the problem, but this is a code smell to me, so I'm going deeper into this to understand why Hibernate does not detect the query space collision and does not flush the pending INSERTs.

I highly suspect that this related to the way the newly introduced CompositeRoleEntity is declared, as it has no connection to the RoleEntity at object level (no exactly by the book :-) ).

@davoustp
Copy link
Author

davoustp commented May 9, 2022

Hi @pedroigor and @ahus1
I've had a hard time investigating this one, and I'm not done yet. :-/

As I said, I reproduced the issue with the Quarkus distro, even outside the arquillian test suite.
However, the problem occurs only during the very first client registration: if I loop and create some more, no more problem...

I finally managed to build the Wildfly distro (maven repo issue on my side), and run the arquillan test against it and... it passes with flying colors.

So I updated the Wildfly distro to match the same exact Hibernate version, and it still passes...
It leaves me with Undertow and Wildfly based Keycloak passing the test, and Quarkus failing, with the same Hibernate version.

Now I'm puzzled. What's so special with the way Hibernate is used with the Quarkus distro?

I also deeply scanned the Hibernate issues to find a matching problem as a sanity check, to no avail.

I also attempted to change the definition of the CompositeRoleEntity entity to make Hibernate aware of the @ManyToOne side of the relationship (and potentially trigger a proper auto flush event, even though I don't think this is necessary, to my knowledge), but again, this does not make the problem go away.

So I'm starting to wonder if there would be some different settings / instrumentation / whatever into the Quarkus side of things...

Any idea?

@davoustp
Copy link
Author

@pedroigor @ahus1 - updated the PR with some changes.

It's not been a bed of roses. :-/

First, I found out why there was a difference between the Wildfly and the Quarkus distro: it seems that the RoleEntity.compositeRoles is eagerly loaded using Wildfly and not with Quarkus.

This explains that the tests were passing with Wildfly (and Undertow btw), because when the collection is already loaded, the executed code path is basically the existing (already tested) one.

And this allowed me to see why the new code path (performing inserts directly into the COMPOSITE_ROLE table) was failing: it looks like the @ManyToMany mapping basically skips any query space detection onto the join table, as it is supposed to be entirely managed using the owning side of the association. As a result, and since the CompositeRoleEntity does not have any relation to the RoleEntity, then Hibernate auto-flush scanning is oblivious to this dependency (leads to out-of-order SQL statements).
Could be seen as a problem at Hibernate side of things, but I would not be pushing too hard on this one, as I'm somehow bending the model to avoid using the purpose-built join table management implemented by @ManyToMany.

BTW, I did not go into why Wildfly always loads the lazy collections, mind. There was enough on my plate already.

So I went down the road to make Hibernate aware that pending statements on both KEYCLOAK_ROLE and COMPOSITE_ROLE tables need to be flushed before doing the INSERT.
And there is no way (as far as I know) to do this using a plain JPA entity (as it's supposed to be automatically detected using the relationships).
No way either using a JPA named query, as this is an INSERT operation.

I finally settled onto a basic, standard native INSERT query with query spaces set to both tables. This way, Hibernate scans and flushes any pending statements onto those tables before executing the native query.

Note that all other query spaces are left untouched. In other words, this does not trigger a full flush as an EntityManager.flush() would do, or running a native query without any specified query space.

I also reviewed the deletion to use a named query to save on the EntityManager.find() call.

I successfully ran the arquillian base tests with Undertow, Wildfly and Quarkus.

Hopefully this will (finally) result into a successful build. ;-)

@ahus1 ahus1 self-requested a review May 16, 2022 09:58
@davoustp
Copy link
Author

Since @ahus1 mentionned the quick test issue was due to a pb in main, I just updated (rebased) onto latest main, hoping that this would be fixed for next run...

@ahus1 ahus1 self-requested a review May 17, 2022 09:58
@ahus1
Copy link
Contributor

ahus1 commented May 17, 2022

@davoustp - thanks for updating the PR, the code now looks good to me, and all relevant tests are green as well. That's great!

As this PR is all about a performance improvement, could you please run a performance test and show some before/after results?

I assume keycloak-benchmark CreateRealms might be the right test case here? It would also be good to know about the parameters of your test (for example, if you would go for the Quarkus distribution, and which DB you would use).

Thanks!

@davoustp
Copy link
Author

@ahus1 - sure, no problem.
I'll go for Quarkus on top of Postgres.

Host

Type: MacBook Pro (16 inches, 2019)
CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (8 cores with HT)
Memory: 32 GB 2667 MHz DDR4
DIsk: SSD 512 GB

Docker: Docker Desktop for Mac
Version: 3.6.0 (67351)
Engine: 20.10.8
CPU resources: 16 cores
Memory resources: 16 GB

Benchmarking process

Running the database and Keycloak in Docker:

docker network create keycloak
docker run -d --name postgres-keycloak \
  --network keycloak \
  -e POSTGRES_USER="postgres" \
  -e POSTGRES_PASSWORD="postgres" \
  -e POSTGRES_DB="keycloak" \
  postgres:14.2-bullseye
docker run -it --rm --name keycloak \
  --network keycloak \
  --cap-add SYS_ADMIN \
  -p 8080:8080 \
  -p 8787:8787 \
  -p 8999:8999 \
  -e KEYCLOAK_ADMIN="keycloak" \
  -e KEYCLOAK_ADMIN_PASSWORD="keycloak" \
  -e KC_DB_USERNAME="postgres" \
  -e KC_DB_PASSWORD="postgres" \
  -e DEBUG="true" \
  -e DEBUG_PORT="*:8787" \
  -e JAVA_OPTS_APPEND="-Xmx1g \
    -Dcom.sun.management.jmxremote.port=8999 -Dcom.sun.management.jmxremote.rmi.port=8999 \
    -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false \
    -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname="$(hostname)" \
    -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/keycloak.hprof" \
  quay.io/keycloak/keycloak:999-SNAPSHOT start-dev \
    --db=postgres \
    --db-url-host=postgres-keycloak.keycloak \
    --db-url-port=5432 \
    --db-url-database=keycloak \
    --log-level=INFO

Running Gatling scenario CreateRealms from host:

./kcb.sh --scenario=keycloak.scenario.admin.CreateRealms \
  --server-url=http://localhost:8080 \
  --concurrent-users=5 \
  --measurement=90 \
  --log-http-on-failure \
  --admin-username=keycloak \
  --admin-password=keycloak

Benchmark BASELINE - commit 06b6e7ed7bcf8a1a21afbe8f73311edcb4607e8f

Overall KPIs:
image

=> 172 created realms

Response time for step create realm:

image

Full report:
createrealms-20220517104741313-baseline-main-commit-06b6e7ed7bcf8a1a21afbe8f73311edcb4607e8f.tar.gz

Benchmark PR11799 - commit ecd6d08642fdf4d64bf51d2642cbe94d7cbcbacf

Overall KPIs:

image

=> 248 created realms (+44%)

Time to reach nb of created realms by baseline:

================================================================================
2022-05-17 13:03:06                                          60s elapsed
---- Requests ------------------------------------------------------------------
> Global                                                   (OK=349    KO=0     )
> Get admin-cli token                                      (OK=176    KO=0     )
> Create realm                                             (OK=173    KO=0     )

---- Create realms -------------------------------------------------------------
          active: 5      / done: 173   
================================================================================

=> baseline result reached in 60 s

Response time for step create realm (60 s runtime to compare with baseline):

image

Full report:
createrealms-20220517110205220-PR11799-commit-ecd6d08642fdf4d64bf51d2642cbe94d7cbcbacf.tar.gz

@sreekesh93
Copy link

Hi @ahus1 , Just eager to know when this changes will get merged.

@davoustp
Copy link
Author

davoustp commented May 20, 2022

Hi @ahus1 , Just eager to know when this changes will get merged.

Hi @sreekesh93 , this is just the first PR, right, there are some more in the pipe - but we go one at a time, avoiding merge conflicts and being focused onto each. So even though this one helps (as showed by the benchmark), there is still way to go.
To give you a hint, the same benchmark run with the investigation branch containing all the changes gives roughly 400 realms created in 90 seconds, each with sub-second response time... ;-)

Copy link
Contributor

@ahus1 ahus1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @davoustp - thank you for this PR. I also did some performance tests locally, and found that adding realms is still slow, but works faster as before as the entities are not fetched from the database all the time. This is good.

I created a setup with 200 realms and then compared logging in for the first time (with no filled cache) and after the cache has been filled.

With the filled cache login is quick. When the cache is filled during the first login, the slowness is expected, but has some quirks:

With the Wildfly distribution before and after your change, this take ~1 minute to login with an unfilled cache. As that distribution doesn't do lazy loading and therefore won't touch your code, this is expected. This is also the baseline.

With the Quarkus distribution before your change, it also took ~1 minute to login. After your change it tool ~2 minutes to login. This is a regression.

I think this is due to the changed intrinsics: There are several places in the code that first check if the role is a composite role, and then fetch the composite stream. Those will hit the database twice: once to check if there are composite roles, and if there are, it is hitting it another time to actually fetch the composites. I found some places here - f110388. After those are fixed, it seems to be as fast as before (~1 min to login).

There might also be a potential regression if within a request the isComposite() is queried multiple times; still I didn't find an evidence that there is a performance impact due to that, so I wouldn't touch this.

I now wonder: did you plan changes in regarding to not calling isComposite() when the full list would be fetched in the same method in a follow-up PR?

I'm still not sure if this should be part of this PR -- I wonder if removing those would break assumptions of a different store. Please first let me hear your thoughts on this before changing this PR.

@ahus1
Copy link
Contributor

ahus1 commented May 23, 2022

@sreekesh93 - there is no fixed schedule when this will be merged. @davoustp has done a great job to find a working solution here, and we're now looking at the performance. Both areas - correctness and performance - are important here, and don't allow for rushing.

I'll continue to provide feedback as time permits while working on the new store for Keycloak and performance tests. For the Keycloak team, working on the new store has priority as it will allow for example zero-downtime upgrades of Keycloak and a wider range of stores. I'll try to find the time for this PR to provide my feedback here in a timely manner.

@sreekesh93
Copy link

@sreekesh93 - there is no fixed schedule when this will be merged. @davoustp has done a great job to find a working solution here, and we're now looking at the performance. Both areas - correctness and performance - are important here, and don't allow for rushing.

I'll continue to provide feedback as time permits while working on the new store for Keycloak and performance tests. For the Keycloak team, working on the new store has priority as it will allow for example zero-downtime upgrades of Keycloak and a wider range of stores. I'll try to find the time for this PR to provide my feedback here in a timely manner.

@ahus1 Thanks for the clarification and appreciate your support.

@davoustp
Copy link
Author

Hi @davoustp - thank you for this PR. I also did some performance tests locally, and found that adding realms is still slow, but works faster as before as the entities are not fetched from the database all the time. This is good.

Thanks for investigating!

I created a setup with 200 realms and then compared logging in for the first time (with no filled cache) and after the cache has been filled.
With the filled cache login is quick. When the cache is filled during the first login, the slowness is expected, but has some quirks:

Yup

With the Wildfly distribution before and after your change, this take ~1 minute to login with an unfilled cache. As that distribution doesn't do lazy loading and therefore won't touch your code, this is expected. This is also the baseline.

With the Quarkus distribution before your change, it also took ~1 minute to login. After your change it tool ~2 minutes to login. This is a regression.

That's correct.

I think this is due to the changed intrinsics: There are several places in the code that first check if the role is a composite role, and then fetch the composite stream. Those will hit the database twice: once to check if there are composite roles, and if there are, it is hitting it another time to actually fetch the composites.

Exactly.

Yes, that's what I found as well.

There might also be a potential regression if within a request the isComposite() is queried multiple times; still I didn't find an evidence that there is a performance impact due to that, so I wouldn't touch this.

There is a chance that this occurs, because the JPA legacy implementation goes to the DB each time if the collection was not loaded in the first place. I did not see any such occurence, but that's possible.

I now wonder: did you plan changes in regarding to not calling isComposite() when the full list would be fetched in the same method in a follow-up PR?

Yes, there is line 49 at https://github.com/davoustp/keycloak/blob/KEYCLOAK-4593-investigations/model/infinispan/src/main/java/org/keycloak/models/cache/infinispan/entities/CachedRole.java#L49 that shows:

        composites = new HashSet<>(model.getCompositeRoleIds());
        composite = !composites.isEmpty();

As you can see here (disregarding the way children ids are retrieved), I'm not even relying onto the model.isComposite() method, just implementing the SPI contract by looking at whether the composites collection is empty or not.

However I did not touch the server-spi-private/src/main/java/org/keycloak/models/utils/KeycloakModelUtils.java file that you discovered in your POC. I did change server-spi/src/main/java/org/keycloak/models/utils/RoleUtils.java though, but with a different approach.

I'm still not sure if this should be part of this PR -- I wonder if removing those would break assumptions of a different store. Please first let me hear your thoughts on this before changing this PR.

Well, the SPI provides a way to know whether the role is composite (= has children) without actually getting them, fine. This is interesting because, beyond the better code readability it provides, it allows SPI implementations to be optimized according to their own internal structures.

Obviously, this is not the case with the JPA legacy implementation (since it can trigger a full load operation, and rely on the JPA layer to provide caching when done once already). It would be much better if this flag was actually denormalized in the KEYCLOAK_ROLE table and maintained into the JPA RoleEntity itself: the flag would be loaded with the entity, and checking the flag would not trigger any DB round-trip in any case, irrespective of the lazy loading state of the collection.
But this is not what we have here (and I tried to avoid any schema change...).

In addition to this, checking the isComposite flag before actually getting the collection is an anti-pattern to me: the SPI defines getCompositeRoles(), and this is the responsibility of each implementation to provide the best optimized way of performing this operation. Why having the code using the SPI do this kind of "optimization"?

If an implementation for getCompositeRoles() has got a fast way to know whether it really has to perform the heavy-weight lookup/load/query operation, then it can do it internally, and decide to return an empty Collection / Stream right away when the role has got no child.

Regarding whether we should change the isComposite-then-getCompositeRoles code pattern wherever we find it, I'd go for at least the CachedRole small change in the Infinispan layer.
The reason is that I always suppose that the code changes can stop right after the PR is merged, so introducing a performance drop which is supposed to be fixed by a later PR which is never going to be merged is baaad. ;-)

So we've got two ways here: either we do a small PR fixing this code pattern (which is actually disconnected from the changes in this PR) in addition to this one, or we integrate the few changes in this PR.

Honestly, I'd go for a separate PR and link them together, as this is touching other modules than the JPA model one.

WDYT?

@ahus1
Copy link
Contributor

ahus1 commented May 25, 2022

@davoustp - I gave it a thought and also presented this PR to the storage team. I want this PR to include all the changes needed for an in-par-performance, and don't depend on another PR to restore the performance.

One of the reasons why I'm interested in the PR is to figure out to handle the @xxxToMany associations in a better way. While the tests show that your approach works and delivery correct results with the test suite, I find it hard to see how this would be maintainable given native query and its hints when adopted for other entities as well. I prepared a commit to explore how this could be handled in a more standard Hibernate way using a refactoring to @ManyToOne, see 1d6e0e0. It works when I add/remote composites from a role manually, I didn't run the full test suite. It uses the entity to put a small "backpack" to cache results about isComposite() and getCompsitesStream().

I then compared how this behaves compared to the old version; it still seems to be slower when logging in for the first time (roughly double the time when testing with 200 realms). I assume it is pulling in too many entities in the context, and the dirty checking takes up more time. It is probably one extra dirty check per composite role that makes it so slow, still I don't see it touching the database an extra time. It might be that when Hibernate does this internally, it can do without one more dirty checking.

Seeing that this is slower than before, and seeing that the Admin UI is unusable with a large number of realms (with both the old and the new way), I suggest to take a step back and have a look what else could be done.

At the moment for each realm there is a new client with client roles attached to it. I wonder if the problem we see could be solved by removing that client once the admin user has set up that realm and added an admin user to that specific realm so it can be managed afterwards.

So what I am suggesting is as a workflow like this for each new realm:

  • create a realm
  • add a user user to that new realm that and assign it admin permissions for that realm
  • remove the client of the realm from the master realm - this would clear the composite roles and make it fast again.

Although the last step is not as simple as it sounds, I wonder if that would solve your scenario: It would allow Keycloak to scale to many realm with one admin user per realm, although no overall admin user.

If that would work for you, there are some things that need to be ironed out:

Keycloak will be unhappy about removing the client: It won't allow you to remove a client with the suffix "-realm". A workaround is to first rename it, then remove it. After that, there are at least two places that need fixing to allow for this new situation, see fdbf876.

Once this has been done, the new realm will no longer appear in the admin UI when he master realm is opened, thereby the master realm's UI is fast.

The last missing piece would then how to re-add the client in case the user of the new realm locks out itself. I haven't tested that, still I don't see why this shouldn't be possible by re-adding the client with minimal roles.

I understand this is heading in a very different direction from this PR and your previous work and is a big step. Still it might solve both the creating of lots of realms, and make the Admin UI (plus the REST endpoints) usable.

Once we consider this path worth exploring, I'll double-check with a maintainer to get indication on that approach.

@davoustp
Copy link
Author

davoustp commented May 30, 2022

@ahus1 - sorry, was off for a few days.

@davoustp - I gave it a thought and also presented this PR to the storage team. I want this PR to include all the changes needed for an in-par-performance, and don't depend on another PR to restore the performance.

Ok, we're all agree then.

One of the reasons why I'm interested in the PR is to figure out to handle the @xxxToMany associations in a better way. While the tests show that your approach works and delivery correct results with the test suite, I find it hard to see how this would be maintainable given native query and its hints when adopted for other entities as well.

I agree: I find this code part brittle and fragile from a maintenance standpoint.

I prepared a commit to explore how this could be handled in a more standard Hibernate way using a refactoring to @manytoone, see 1d6e0e0. It works when I add/remote composites from a role manually, I didn't run the full test suite.

I see that. Actually, I already went that route when investigating (I still think the @EmbeddableId is a more elegant solution, we should probably keep this), but I did not go all the way to removing the @ManyToMany entirely - you're basically managing the many-to-many relationship yourself.
I skipped this approach because could not find a way to reconcile the @ManyToMany collection and the rest of the changes.
During this attempt, I ran into a problem when running theClientRegistrationTest#clientWithDefaultRoles test, so I ran it against your code to check if it did not have the flaws mine had, and there is also a problem here:

[INFO] 2022-05-30 09:33:38,293 WARN  [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (executor-thread-0) SQL Error: 23503, SQLState: 23503
[INFO] 2022-05-30 09:33:38,293 ERROR [org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (executor-thread-0) Intégrité référentielle violation de contrainte: "FK_A63WVEKFTU8JO1PNJ81E7MCE2: PUBLIC.COMPOSITE_ROLE FOREIGN KEY(COMPOSITE) REFERENCES PUBLIC.KEYCLOAK_ROLE(ID) ('a32b171e-8f24-47e9-85fc-67b5c9923c49')"
[INFO] Referential integrity constraint violation: "FK_A63WVEKFTU8JO1PNJ81E7MCE2: PUBLIC.COMPOSITE_ROLE FOREIGN KEY(COMPOSITE) REFERENCES PUBLIC.KEYCLOAK_ROLE(ID) ('a32b171e-8f24-47e9-85fc-67b5c9923c49')"; SQL statement:
[INFO] delete from KEYCLOAK_ROLE where ID=? [23503-197]
[INFO] 2022-05-30 09:33:38,296 INFO  [org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl] (executor-thread-0) HHH000010: On release of batch it still contained JDBC statements
[INFO] 2022-05-30 09:33:38,297 ERROR [org.keycloak.services.error.KeycloakErrorHandler] (executor-thread-0) Uncaught server error: javax.persistence.PersistenceException: org.hibernate.exception.ConstraintViolationException: could not execute statement
[INFO] 	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:154)
[INFO] 	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:181)
[INFO] 	at org.hibernate.internal.ExceptionConverterImpl.convert(ExceptionConverterImpl.java:188)
[INFO] 	at org.hibernate.internal.SessionImpl.doFlush(SessionImpl.java:1406)
[INFO] 	at org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1389)
[INFO] 	at org.hibernate.query.internal.NativeQueryImpl.beforeQuery(NativeQueryImpl.java:268)
[INFO] 	at org.hibernate.query.internal.AbstractProducedQuery.executeUpdate(AbstractProducedQuery.java:1694)
[INFO] 	at org.keycloak.models.jpa.JpaRealmProvider.preRemove(JpaRealmProvider.java:637)
[INFO] 	at org.keycloak.models.jpa.JpaRealmProviderFactory.onEvent(JpaRealmProviderFactory.java:87)
[INFO] 	at org.keycloak.services.DefaultKeycloakSessionFactory.publish(DefaultKeycloakSessionFactory.java:92)
[INFO] 	at org.keycloak.models.jpa.JpaRealmProvider.removeRole(JpaRealmProvider.java:384)
[INFO] 	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
[INFO] 	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
[INFO] 	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
[INFO] 	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
[INFO] 	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
[INFO] 	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
[INFO] 	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
[INFO] 	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
[INFO] 	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
[INFO] 	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
[INFO] 	at org.hibernate.query.spi.StreamDecorator.forEach(StreamDecorator.java:153)
[INFO] 	at org.keycloak.utils.ClosingStream.forEach(ClosingStream.java:128)
[INFO] 	at org.keycloak.models.jpa.JpaRealmProvider.removeRoles(JpaRealmProvider.java:414)
[INFO] 	at org.keycloak.storage.RoleStorageManager.removeRoles(RoleStorageManager.java:191)
[INFO] 	at org.keycloak.models.cache.infinispan.RealmCacheSession.removeRoles(RealmCacheSession.java:814)
[INFO] 	at org.keycloak.models.jpa.JpaRealmProvider.removeClient(JpaRealmProvider.java:784)
[INFO] 	at org.keycloak.models.jpa.JpaRealmProvider.removeClients(JpaRealmProvider.java:771)
[INFO] 	at org.keycloak.storage.ClientStorageManager.removeClients(ClientStorageManager.java:247)
[INFO] 	at org.keycloak.models.cache.infinispan.RealmCacheSession.removeClients(RealmCacheSession.java:578)
[INFO] 	at org.keycloak.models.jpa.JpaRealmProvider.removeRealm(JpaRealmProvider.java:175)
[INFO] 	at org.keycloak.models.cache.infinispan.RealmCacheSession.removeRealm(RealmCacheSession.java:495)
[INFO] 	at org.keycloak.services.managers.RealmManager.removeRealm(RealmManager.java:257)
[INFO] 	at org.keycloak.services.resources.admin.RealmAdminResource.deleteRealm(RealmAdminResource.java:473)

Also note that method org.keycloak.models.jpa.JpaRealmProvider.removeRole(RoleModel) makes use of a native query against COMPOSITE_ROLE - in any case this should be handled as a NamedQuery instead (the native query does not specify the query spaces, therefore it will trigger a full flush behind the scene).

I then compared how this behaves compared to the old version; it still seems to be slower when logging in for the first time (roughly double the time when testing with 200 realms). I assume it is pulling in too many entities in the context, and the dirty checking takes up more time. It is probably one extra dirty check per composite role that makes it so slow, still I don't see it touching the database an extra time. It might be that when Hibernate does this internally, it can do without one more dirty checking.

Hmm, I suspect this is because the isComposite method now actually triggers the loading of the RoleEntity instances. So even if you do not invoke the getCompositesStream() method, then the entities are still loaded and managed by Hibernate - hence the extra work to hydrate the entities in memory and flooding the context with a large amount of entities to check at flush time.

I think the change suggested in comment #11799 (review) is actually more appropriate - and I would not load the collection from the isComposite() method.

Seeing that this is slower than before, and seeing that the Admin UI is unusable with a large number of realms (with both the old and the new way), I suggest to take a step back and have a look what else could be done.

At the moment for each realm there is a new client with client roles attached to it. I wonder if the problem we see could be solved by removing that client once the admin user has set up that realm and added an admin user to that specific realm so it can be managed afterwards.

Yup, avoiding the huge Role composition graph in the first place would obviously be a better approach - even though the graph loading mechanism should be optimized (next PR) in any case.

I'm not in a position to tell whether this is possible, of course - only maintainers can, I imagine.

My guess is that it would be a large breaking change: the root admin account could not manage all the realms any more. I suspect that this is hardly acceptable from a functional standpoint, but we can / should still ask maintainers, of course.

I understand this is heading in a very different direction from this PR and your previous work and is a big step. Still it might solve both the creating of lots of realms, and make the Admin UI (plus the REST endpoints) usable.

No problem - if there is a more elegant and faster way of fixing the issue, why not? ;-)

@ahus1
Copy link
Contributor

ahus1 commented Jun 3, 2022

I've started two GitHub discussions as a spin-off from this PR:

I'd be happy if you could add your view to those.

There needs to be buy-in from some maintainers and the community first before they can be attempted, so let's continue with this PR for now until it is faster enough.

I'll reply on your previous comment in more details soon.

@sreekesh93
Copy link

Hi @ahus1 / @davoustp ,
So what are the next steps planned ? Any updates on this PR

@davoustp
Copy link
Author

Hi @ahus1
Sorry, I'm quite busy at the moment :-(
Will comment on the discussions, for sure.

Shall we proceed with removing the isComposite-then-getCompositeRoles code pattern as discussed previously, since this looks like the most appropriate and straightforward way to keep the same performance re: UI login time?

@ahus1
Copy link
Contributor

ahus1 commented Jun 23, 2022

Hi @davoustp - no worries, I am quite busy myself.

I really like the concepts of this PR, still I don't see how this could

  • lead to maintainable code with the legacy store
  • give a good performance in the admin UI
  • be isolated to the legacy store

Another note about the UI performance: Even when we get the UI performance as good as before the PR for lots of realms (>200), the performance is so bad that it is unusable.

Therefore I plan to pause my work here and focus on the new store:

  • The new store will be used with no caching at the beginning, therefore all the problems in the Keycloak's infinispan package are "solved".
  • The new JPA store stores Roles as JSON in the database, and lacks composites as a separate table. With the learnings I take from this PR, this should become a more efficient with its own entity like you propose it without loading the whole list when adding a new realm. This would be the first task.
  • To evaluate roles, I will give the recursive queries a try to avoid loading data from the database as those are supported in PostgreSQL and CockroachDB. Those will not provide the ultimate scalability we have in mind as the database will do the tree traversal, but they might be a first step into that direction. In places where we load all roles and that can't be avoided, we would return them without repeated SQL queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid loading entire composite role collection to add/remove items or check presence

5 participants