Skip to content

Conversation

trylek
Copy link

@trylek trylek commented Aug 29, 2021

To achieve reproducibility and transparency of Jellyfin benchmarking I have written a simple C# app JellyfinBench that uses docker commands to build and run the app in various modes. I'm including my initial results in the PR thread below.

Thanks

Tomas

@trylek
Copy link
Author

trylek commented Aug 29, 2021

For now I have modified the PublishJellyfinServer.sh in three ways:

  1. It doesn't pull down an official drop of Crossgen2, it rather uses a manual drop that must be put under the crossgen2 folder in the Jellyfin repo clone. The preview version was crashing during framework R2R build due to some problem in opening portable PDB debug information in the MSIL files; a freshly built tar.gz from the runtime repo installers seems to work fine.

  2. The script looks for the file nuget-cache.tar in the root folder of the Jellyfin repo clone and unpacks its contents relative to the directory root /. As a preparatory step I deleted everything from the Dockerfile script past the call to PublishJellyfinServer.sh, I built the image (docker build .), I opened the container with the bash shell and I manually did tar -cf nuget-cache.tar /root/.nuget in the root folder; this removes the need to repeatedly download all the packages during project restoration. On my box it speeds up the build about 10 times (it reduced its average running time from 450 s or 7.5 minutes to about 45 seconds).

  3. All parameters affecting the app build mode are defined as ARGitems so that they can be passed to docker image build using --build-arg, this is what the JellyfinBench tool does. This doesn't include the UseTieredCompilation and UseReadyToRun flags, these are directly passed as the environment variables COMPlus_TieredCompilation and COMPlus_ReadyToRun using the --env command-line argument to the docker run command.

@trylek
Copy link
Author

trylek commented Aug 29, 2021

ReadyToRun OFF, TieredCompilation OFF:

TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 8407 | 100 |     167 | 100 |    2840 | 100 |    4498 | 100 |     902 | 100 | baseline
 7679 |  91 |     158 |  94 |    2736 |  96 |    3974 |  88 |     811 |  89 | r2r
 8377 |  99 |     165 |  98 |    2934 | 103 |    4420 |  98 |     858 |  95 | app-composite-avx2
 8419 | 100 |     185 | 110 |    3016 | 106 |    4362 |  96 |     856 |  94 | one-big-composite-avx2
 8257 |  98 |     161 |  96 |    2939 | 103 |    4329 |  96 |     828 |  91 | r2r-platform-composite-avx2
 8359 |  99 |     158 |  94 |    2987 | 105 |    4373 |  97 |     841 |  93 | jit-platform-composite-avx2

@trylek
Copy link
Author

trylek commented Aug 29, 2021

ReadyToRun OFF, TieredCompilation ON:

TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 6063 | 100 |     133 | 100 |    1965 | 100 |    3362 | 100 |     603 | 100 | baseline
 6524 | 107 |     148 | 111 |    2081 | 105 |    3654 | 108 |     641 | 106 | r2r
 7548 | 124 |     171 | 128 |    2450 | 124 |    4219 | 125 |     708 | 117 | app-composite-avx2
 7559 | 124 |     173 | 130 |    2444 | 124 |    4238 | 126 |     704 | 116 | one-big-composite-avx2
 6738 | 111 |     156 | 117 |    2098 | 106 |    3843 | 114 |     641 | 106 | r2r-platform-composite-avx2
 6767 | 111 |     160 | 120 |    2219 | 112 |    3747 | 111 |     641 | 106 | jit-platform-composite-avx2

@trylek
Copy link
Author

trylek commented Aug 29, 2021

ReadyToRun ON, TieredCompilation OFF:

TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 5861 | 100 |      91 | 100 |     924 | 100 |    4216 | 100 |     630 | 100 | baseline
 3303 |  56 |      94 | 103 |     819 |  88 |    1957 |  46 |     433 |  68 | r2r
 2747 |  46 |     127 | 139 |     794 |  85 |    1447 |  34 |     379 |  60 | app-composite-avx2
 2931 |  50 |     108 | 118 |     729 |  78 |    1671 |  39 |     423 |  67 | one-big-composite-avx2
 3126 |  53 |     138 | 151 |     739 |  79 |    1820 |  43 |     429 |  68 | r2r-platform-composite-avx2
 6035 | 102 |      91 | 100 |    1052 | 113 |    4209 |  99 |     683 | 108 | jit-platform-composite-avx2

@trylek
Copy link
Author

trylek commented Aug 29, 2021

ReadyToRun ON, TieredCompilation ON:

TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 4251 | 100 |      77 | 100 |     731 | 100 |    2842 | 100 |     601 | 100 | baseline
 2654 |  62 |      94 | 122 |     689 |  94 |    1488 |  52 |     383 |  63 | r2r
 2535 |  59 |     113 | 146 |     747 | 102 |    1392 |  48 |     283 |  47 | app-composite-avx2
 2543 |  59 |     109 | 141 |     615 |  84 |    1503 |  52 |     316 |  52 | one-big-composite-avx2
 3008 |  70 |     104 | 135 |     714 |  97 |    1782 |  62 |     408 |  67 | r2r-platform-composite-avx2
 4993 | 117 |      86 | 111 |     904 | 123 |    3347 | 117 |     656 | 109 | jit-platform-composite-avx2

@trylek
Copy link
Author

trylek commented Aug 29, 2021

As yet another extra instrumentation I added two more checkpoints to the server initialization routine, one past the initialization of the AppHost and another past the async initialization of the WebHost. The purpose of this was to try to observe whether the perf variations among the various build modes are evenly affecting all parts of the app initialization (which would indicate either a codegen issue like suboptimal code even in the presence of large version bubble or a systemic runtime issue e.g. slower method lookup in the longer composite R2R tables), or just some part of it (indicative of a specific issue tied to a certain construct, library or runtime functionality).

@trylek
Copy link
Author

trylek commented Aug 29, 2021

The first run with ReadyToRun and TieredCompilation both turned off can be used as a "sanity check" - if everything works fine, there should be no variations among the different build modes as ReadyToRun code is not used. The app should be somewhat slower than normal because no methods ever get optimized by tiered rejitting. The first table demonstrates this nicely, most percentages are very close to the baseline with except for the 91% outlier but the startup time is relatively noisy, the detailed tables show almost 20% fluctuation so perhaps this run was somehow lucky. Btw the above tables are based on averages across 10 runs in each build mode.

@trylek
Copy link
Author

trylek commented Aug 29, 2021

An interesting bit that stands out in the second run (ReadyToRun OFF, TieredCompilation ON) is the slower performance of composite build variants. This may indicate that we unnecessarily load the composite image even when ReadyToRun is turned OFF and, while a niche scenario, this might be an easy .NET 7 perf issue to fix in the CoreCLR runtime.

@trylek
Copy link
Author

trylek commented Aug 29, 2021

For ReadyToRun ON, I believe that the perf difference between the baseline and plain r2r indicates that the startup is running lots of application code, it's not dominated by runtime code. As next step I plan to work with the diagnostic team on figuring out how to use the event pipe as Linux replacement for ETW to get to lists of methods being jitted and run in the app and using it to validate that we're not e.g. unnecessarily JITting some R2R-compiled code in the composite cases.

The differences between TieredCompilation ON and OFF seems to show how composite R2R code is "almost as good as optimized tiered code but not quite". The baseline (JITted case) is much slower without tiered compilation as all methods remain unoptimized. The plain r2r mode is still limited due to assembly versioning boundaries. Composite version using the large version bubble is almost as fast without tiering as with it, but tiering still seems to be improving perf by 10% or so.

@trylek
Copy link
Author

trylek commented Aug 29, 2021

The fundamental outliner is the last line, JITted app against composite framework. This is always slower than both the baseline and all the other composite modes and doesn't make much sense to me at this moment, I believe this to be the most important one to investigate, the good news that I believe this to be in accordance with Richard's original measurements.

Apparently the time it takes to load the composite PE images doesn't seem to be contributing to the slowness, we see that the RUNTIME checkpoint (beginning of Main) is reached in basically the same time in the baseline and in the last line (91 / 91 msecs with TC off, 77 / 86 msecs with TC on) and actually much faster than in the other composite modes.

In both modes we see that the slowdown is more or less evenly spread between the three later checkpoints (apphost initialization, webhost initialization, total startup time) suggestive of either a codegen or runtime issue. Interestingly enough it looks like the slowdown is even more pronounced when tiered compilation is turned on that might suggest some codegen issue (suboptimal codegen for calls to the composite framework).

@trylek
Copy link
Author

trylek commented Aug 29, 2021

Sadly enough I was unable to reproduce the measurements on my newly installed physical Linux box as after fixing the various issues in the scripts I found out to my dismay that my Dell 5600 box from the Midori times is apparently too old and doesn't support AVX2.

@trylek
Copy link
Author

trylek commented Aug 29, 2021

@richlander - these are my initial findings, would there be a way to share this PR with the Crossgen2 team (basically David, Manish, Ivan, maybe JanV and JanK)? I assume I'll add additional data once I manage to get it based on the initial analysis. It would be great if you could try out the JellyfinBench app on a Linux box when you have a chance to see whether we're in agreement regarding the numbers - but at the first glance I think the numbers seem somewhat similar to yours and after all getting precise numbers is hard.

For now I haven't spent too much time polishing the command-line interface of the JellyfinBench app, in your case it should suffice to run it without parameters using the "dotnet" command (I'm normally using the one from my parallel runtime repo clone) from the root of the jellyfin repo. As described above, you need to populate a crossgen2 drop in the crossgen2 subfolder of the jellyfin repo clone and pack the nuget cache. Maybe this could be somehow semi-automated in the PublishJellyfinServer.sh script but I haven't tried it out yet.

For the UseTieredCompilation and UseReadyToRun flags, these are defined as constants in the JellyfinBench source code (Program.cs), you need to edit it to alter their values. Similarly you can edit the file to change the list or parameters of the build modes to use and / or the number of iterations to execute. I used 10 and 50 in the development.

Thanks

Tomas

@trylek
Copy link
Author

trylek commented Aug 29, 2021

Gist with the detailed results in case anyone fancies a mode detailed look:

https://gist.github.com/trylek/9b24611e41fe79bb973f7e6a2e9ee27d

@richlander
Copy link
Owner

Yes, you can share this PR with whomever you want. It's a public page on the internet.

@trylek
Copy link
Author

trylek commented Aug 31, 2021

OK, thanks for clarifying, I probably just got confused by the fact that the GitHub web page only suggests you and me when I tried to @-mention other people in the PR. CC-ing @dotnet/crossgen-contrib right now to share this more broadly. If that doesn't work, I'll just post the link to the PR in the CG2 chat.

@ivdiazsa
Copy link

I've made a few changes to the JellyfinBench app, in order to make it more versatile and simpler to run. Those changes can be found in the CoreCLRVMPerfExercises branch attached to Tomas' fork of Rich's Jellyfin repository fork.

Now, after running several tests with the latest nightly builds of the .NET runtime, we acquired some promising looking results, which I'm displaying here.

=======
TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 5260 | 100 |     126 | 100 |    1687 | 100 |    2819 | 100 |     628 | 100 | baseline
 5231 |  99 |     129 | 102 |    1697 | 100 |    2774 |  98 |     631 | 100 | r2r
 5223 |  99 |     120 |  95 |    1698 | 100 |    2775 |  98 |     630 | 100 | app-composite-avx2
 5220 |  99 |     122 |  96 |    1684 |  99 |    2787 |  98 |     627 |  99 | one-big-composite-avx2
 5242 |  99 |     125 |  99 |    1690 | 100 |    2798 |  99 |     629 | 100 | r2r-platform-composite-avx2
 5225 |  99 |     125 |  99 |    1670 |  98 |    2806 |  99 |     624 |  99 | jit-platform-composite-avx2
 3313 |  62 |      63 |  50 |     612 |  36 |    2211 |  78 |     427 |  67 | baseline-usereadytorun
 1644 |  31 |      68 |  53 |     395 |  23 |     949 |  33 |     232 |  36 | r2r-usereadytorun
 1333 |  25 |      66 |  52 |     344 |  20 |     740 |  26 |     183 |  29 | app-composite-avx2-usereadytorun
 1302 |  24 |      67 |  53 |     327 |  19 |     733 |  26 |     175 |  27 | one-big-composite-avx2-usereadytorun
 1305 |  24 |      63 |  50 |     333 |  19 |     736 |  26 |     173 |  27 | r2r-platform-composite-avx2-usereadytorun
 3321 |  63 |      70 |  55 |     613 |  36 |    2209 |  78 |     429 |  68 | jit-platform-composite-avx2-usereadytorun
 4106 |  78 |     113 |  89 |    1312 |  77 |    2201 |  78 |     480 |  76 | baseline-usetieredcompilation
 4144 |  78 |     117 |  92 |    1346 |  79 |    2197 |  77 |     484 |  77 | r2r-usetieredcompilation
 4109 |  78 |     117 |  92 |    1338 |  79 |    2173 |  77 |     481 |  76 | app-composite-avx2-usetieredcompilation
 4096 |  77 |     113 |  89 |    1332 |  78 |    2172 |  77 |     479 |  76 | one-big-composite-avx2-usetieredcompilation
 4107 |  78 |     118 |  93 |    1337 |  79 |    2171 |  77 |     481 |  76 | r2r-platform-composite-avx2-usetieredcompilation
 4119 |  78 |     116 |  92 |    1315 |  77 |    2206 |  78 |     482 |  76 | jit-platform-composite-avx2-usetieredcompilation
 2561 |  48 |      62 |  49 |     506 |  29 |    1657 |  58 |     336 |  53 | baselineusereadytorun-and-tieredcompilation
 1553 |  29 |      67 |  53 |     379 |  22 |     900 |  31 |     207 |  32 | r2rusereadytorun-and-tieredcompilation
 1296 |  24 |      66 |  52 |     329 |  19 |     733 |  26 |     168 |  26 | app-composite-avx2usereadytorun-and-tieredcompilation
 1267 |  24 |      66 |  52 |     322 |  19 |     716 |  25 |     163 |  25 | one-big-composite-avx2usereadytorun-and-tieredcompilation
 1290 |  24 |      70 |  55 |     326 |  19 |     727 |  25 |     167 |  26 | r2r-platform-composite-avx2usereadytorun-and-tieredcompilation
 2573 |  48 |      68 |  53 |     504 |  29 |    1665 |  59 |     336 |  53 | jit-platform-composite-avx2usereadytorun-and-tieredcompilation

@trylek
Copy link
Author

trylek commented Mar 28, 2022

Thanks Ivan for sharing the latest measurements. I believe they indicate that the perf outlier Rich and I observed last year in the configuration "JITted app + composite framework" is no longer there. I have also double-checked with Ivan that the "platform composite" means .NET framework + ASP.NET. While we no longer see any perf regression specific to the composite platform, according to this particular test the composite platform doesn't outperform the separate R2R-compiled framework / ASP.NET assemblies either, all diffs between the baseline and JITted app + composite framework are now below noise level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants