-
Notifications
You must be signed in to change notification settings - Fork 1
Support for perf measurements on the Jellyfin app & initial findings #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: composite-test
Are you sure you want to change the base?
Conversation
1) Use a pre-recorded nuget cache to speed up project restores (reduces docker build running time from 450 to 45 secs on my box). 2) Add two more application startup checkpoints (apphost & webhost initialization). 3) Improve JellyfinBench by adding a brief statistics summary. Thanks Tomas
For now I have modified the
|
ReadyToRun OFF, TieredCompilation OFF: TOTAL | % | RUNTIME | % | APPHOST | % | WEBHOST | % | APP | % | MODE ================================================================================== 8407 | 100 | 167 | 100 | 2840 | 100 | 4498 | 100 | 902 | 100 | baseline 7679 | 91 | 158 | 94 | 2736 | 96 | 3974 | 88 | 811 | 89 | r2r 8377 | 99 | 165 | 98 | 2934 | 103 | 4420 | 98 | 858 | 95 | app-composite-avx2 8419 | 100 | 185 | 110 | 3016 | 106 | 4362 | 96 | 856 | 94 | one-big-composite-avx2 8257 | 98 | 161 | 96 | 2939 | 103 | 4329 | 96 | 828 | 91 | r2r-platform-composite-avx2 8359 | 99 | 158 | 94 | 2987 | 105 | 4373 | 97 | 841 | 93 | jit-platform-composite-avx2 |
ReadyToRun OFF, TieredCompilation ON: TOTAL | % | RUNTIME | % | APPHOST | % | WEBHOST | % | APP | % | MODE ================================================================================== 6063 | 100 | 133 | 100 | 1965 | 100 | 3362 | 100 | 603 | 100 | baseline 6524 | 107 | 148 | 111 | 2081 | 105 | 3654 | 108 | 641 | 106 | r2r 7548 | 124 | 171 | 128 | 2450 | 124 | 4219 | 125 | 708 | 117 | app-composite-avx2 7559 | 124 | 173 | 130 | 2444 | 124 | 4238 | 126 | 704 | 116 | one-big-composite-avx2 6738 | 111 | 156 | 117 | 2098 | 106 | 3843 | 114 | 641 | 106 | r2r-platform-composite-avx2 6767 | 111 | 160 | 120 | 2219 | 112 | 3747 | 111 | 641 | 106 | jit-platform-composite-avx2 |
ReadyToRun ON, TieredCompilation OFF: TOTAL | % | RUNTIME | % | APPHOST | % | WEBHOST | % | APP | % | MODE ================================================================================== 5861 | 100 | 91 | 100 | 924 | 100 | 4216 | 100 | 630 | 100 | baseline 3303 | 56 | 94 | 103 | 819 | 88 | 1957 | 46 | 433 | 68 | r2r 2747 | 46 | 127 | 139 | 794 | 85 | 1447 | 34 | 379 | 60 | app-composite-avx2 2931 | 50 | 108 | 118 | 729 | 78 | 1671 | 39 | 423 | 67 | one-big-composite-avx2 3126 | 53 | 138 | 151 | 739 | 79 | 1820 | 43 | 429 | 68 | r2r-platform-composite-avx2 6035 | 102 | 91 | 100 | 1052 | 113 | 4209 | 99 | 683 | 108 | jit-platform-composite-avx2 |
ReadyToRun ON, TieredCompilation ON: TOTAL | % | RUNTIME | % | APPHOST | % | WEBHOST | % | APP | % | MODE ================================================================================== 4251 | 100 | 77 | 100 | 731 | 100 | 2842 | 100 | 601 | 100 | baseline 2654 | 62 | 94 | 122 | 689 | 94 | 1488 | 52 | 383 | 63 | r2r 2535 | 59 | 113 | 146 | 747 | 102 | 1392 | 48 | 283 | 47 | app-composite-avx2 2543 | 59 | 109 | 141 | 615 | 84 | 1503 | 52 | 316 | 52 | one-big-composite-avx2 3008 | 70 | 104 | 135 | 714 | 97 | 1782 | 62 | 408 | 67 | r2r-platform-composite-avx2 4993 | 117 | 86 | 111 | 904 | 123 | 3347 | 117 | 656 | 109 | jit-platform-composite-avx2 |
As yet another extra instrumentation I added two more checkpoints to the server initialization routine, one past the initialization of the AppHost and another past the async initialization of the WebHost. The purpose of this was to try to observe whether the perf variations among the various build modes are evenly affecting all parts of the app initialization (which would indicate either a codegen issue like suboptimal code even in the presence of large version bubble or a systemic runtime issue e.g. slower method lookup in the longer composite R2R tables), or just some part of it (indicative of a specific issue tied to a certain construct, library or runtime functionality). |
The first run with ReadyToRun and TieredCompilation both turned off can be used as a "sanity check" - if everything works fine, there should be no variations among the different build modes as ReadyToRun code is not used. The app should be somewhat slower than normal because no methods ever get optimized by tiered rejitting. The first table demonstrates this nicely, most percentages are very close to the baseline with except for the 91% outlier but the startup time is relatively noisy, the detailed tables show almost 20% fluctuation so perhaps this run was somehow lucky. Btw the above tables are based on averages across 10 runs in each build mode. |
An interesting bit that stands out in the second run (ReadyToRun OFF, TieredCompilation ON) is the slower performance of composite build variants. This may indicate that we unnecessarily load the composite image even when ReadyToRun is turned OFF and, while a niche scenario, this might be an easy .NET 7 perf issue to fix in the CoreCLR runtime. |
For ReadyToRun ON, I believe that the perf difference between the baseline and plain r2r indicates that the startup is running lots of application code, it's not dominated by runtime code. As next step I plan to work with the diagnostic team on figuring out how to use the event pipe as Linux replacement for ETW to get to lists of methods being jitted and run in the app and using it to validate that we're not e.g. unnecessarily JITting some R2R-compiled code in the composite cases. The differences between TieredCompilation ON and OFF seems to show how composite R2R code is "almost as good as optimized tiered code but not quite". The baseline (JITted case) is much slower without tiered compilation as all methods remain unoptimized. The plain r2r mode is still limited due to assembly versioning boundaries. Composite version using the large version bubble is almost as fast without tiering as with it, but tiering still seems to be improving perf by 10% or so. |
The fundamental outliner is the last line, JITted app against composite framework. This is always slower than both the baseline and all the other composite modes and doesn't make much sense to me at this moment, I believe this to be the most important one to investigate, the good news that I believe this to be in accordance with Richard's original measurements. Apparently the time it takes to load the composite PE images doesn't seem to be contributing to the slowness, we see that the RUNTIME checkpoint (beginning of Main) is reached in basically the same time in the baseline and in the last line (91 / 91 msecs with TC off, 77 / 86 msecs with TC on) and actually much faster than in the other composite modes. In both modes we see that the slowdown is more or less evenly spread between the three later checkpoints (apphost initialization, webhost initialization, total startup time) suggestive of either a codegen or runtime issue. Interestingly enough it looks like the slowdown is even more pronounced when tiered compilation is turned on that might suggest some codegen issue (suboptimal codegen for calls to the composite framework). |
Sadly enough I was unable to reproduce the measurements on my newly installed physical Linux box as after fixing the various issues in the scripts I found out to my dismay that my Dell 5600 box from the Midori times is apparently too old and doesn't support AVX2. |
@richlander - these are my initial findings, would there be a way to share this PR with the Crossgen2 team (basically David, Manish, Ivan, maybe JanV and JanK)? I assume I'll add additional data once I manage to get it based on the initial analysis. It would be great if you could try out the JellyfinBench app on a Linux box when you have a chance to see whether we're in agreement regarding the numbers - but at the first glance I think the numbers seem somewhat similar to yours and after all getting precise numbers is hard. For now I haven't spent too much time polishing the command-line interface of the JellyfinBench app, in your case it should suffice to run it without parameters using the "dotnet" command (I'm normally using the one from my parallel runtime repo clone) from the root of the jellyfin repo. As described above, you need to populate a crossgen2 drop in the For the UseTieredCompilation and UseReadyToRun flags, these are defined as constants in the JellyfinBench source code ( Thanks Tomas |
Gist with the detailed results in case anyone fancies a mode detailed look: https://gist.github.com/trylek/9b24611e41fe79bb973f7e6a2e9ee27d |
Yes, you can share this PR with whomever you want. It's a public page on the internet. |
OK, thanks for clarifying, I probably just got confused by the fact that the GitHub web page only suggests you and me when I tried to @-mention other people in the PR. CC-ing @dotnet/crossgen-contrib right now to share this more broadly. If that doesn't work, I'll just post the link to the PR in the CG2 chat. |
I've made a few changes to the JellyfinBench app, in order to make it more versatile and simpler to run. Those changes can be found in the CoreCLRVMPerfExercises branch attached to Tomas' fork of Rich's Jellyfin repository fork. Now, after running several tests with the latest nightly builds of the .NET runtime, we acquired some promising looking results, which I'm displaying here.
|
Thanks Ivan for sharing the latest measurements. I believe they indicate that the perf outlier Rich and I observed last year in the configuration "JITted app + composite framework" is no longer there. I have also double-checked with Ivan that the "platform composite" means .NET framework + ASP.NET. While we no longer see any perf regression specific to the composite platform, according to this particular test the composite platform doesn't outperform the separate R2R-compiled framework / ASP.NET assemblies either, all diffs between the baseline and JITted app + composite framework are now below noise level. |
To achieve reproducibility and transparency of Jellyfin benchmarking I have written a simple C# app JellyfinBench that uses docker commands to build and run the app in various modes. I'm including my initial results in the PR thread below.
Thanks
Tomas