Support for perf measurements on the Jellyfin app & initial findings #3

trylek · 2021-08-29T19:19:13Z

To achieve reproducibility and transparency of Jellyfin benchmarking I have written a simple C# app JellyfinBench that uses docker commands to build and run the app in various modes. I'm including my initial results in the PR thread below.

Thanks

Tomas

1) Use a pre-recorded nuget cache to speed up project restores (reduces docker build running time from 450 to 45 secs on my box). 2) Add two more application startup checkpoints (apphost & webhost initialization). 3) Improve JellyfinBench by adding a brief statistics summary. Thanks Tomas

…tion

trylek · 2021-08-29T19:30:13Z

For now I have modified the PublishJellyfinServer.sh in three ways:

It doesn't pull down an official drop of Crossgen2, it rather uses a manual drop that must be put under the crossgen2 folder in the Jellyfin repo clone. The preview version was crashing during framework R2R build due to some problem in opening portable PDB debug information in the MSIL files; a freshly built tar.gz from the runtime repo installers seems to work fine.
The script looks for the file nuget-cache.tar in the root folder of the Jellyfin repo clone and unpacks its contents relative to the directory root /. As a preparatory step I deleted everything from the Dockerfile script past the call to PublishJellyfinServer.sh, I built the image (docker build .), I opened the container with the bash shell and I manually did tar -cf nuget-cache.tar /root/.nuget in the root folder; this removes the need to repeatedly download all the packages during project restoration. On my box it speeds up the build about 10 times (it reduced its average running time from 450 s or 7.5 minutes to about 45 seconds).
All parameters affecting the app build mode are defined as ARGitems so that they can be passed to docker image build using --build-arg, this is what the JellyfinBench tool does. This doesn't include the UseTieredCompilation and UseReadyToRun flags, these are directly passed as the environment variables COMPlus_TieredCompilation and COMPlus_ReadyToRun using the --env command-line argument to the docker run command.

trylek · 2021-08-29T19:32:04Z

ReadyToRun OFF, TieredCompilation OFF:

TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 8407 | 100 |     167 | 100 |    2840 | 100 |    4498 | 100 |     902 | 100 | baseline
 7679 |  91 |     158 |  94 |    2736 |  96 |    3974 |  88 |     811 |  89 | r2r
 8377 |  99 |     165 |  98 |    2934 | 103 |    4420 |  98 |     858 |  95 | app-composite-avx2
 8419 | 100 |     185 | 110 |    3016 | 106 |    4362 |  96 |     856 |  94 | one-big-composite-avx2
 8257 |  98 |     161 |  96 |    2939 | 103 |    4329 |  96 |     828 |  91 | r2r-platform-composite-avx2
 8359 |  99 |     158 |  94 |    2987 | 105 |    4373 |  97 |     841 |  93 | jit-platform-composite-avx2

trylek · 2021-08-29T19:32:36Z

ReadyToRun OFF, TieredCompilation ON:

TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 6063 | 100 |     133 | 100 |    1965 | 100 |    3362 | 100 |     603 | 100 | baseline
 6524 | 107 |     148 | 111 |    2081 | 105 |    3654 | 108 |     641 | 106 | r2r
 7548 | 124 |     171 | 128 |    2450 | 124 |    4219 | 125 |     708 | 117 | app-composite-avx2
 7559 | 124 |     173 | 130 |    2444 | 124 |    4238 | 126 |     704 | 116 | one-big-composite-avx2
 6738 | 111 |     156 | 117 |    2098 | 106 |    3843 | 114 |     641 | 106 | r2r-platform-composite-avx2
 6767 | 111 |     160 | 120 |    2219 | 112 |    3747 | 111 |     641 | 106 | jit-platform-composite-avx2

trylek · 2021-08-29T19:33:08Z

ReadyToRun ON, TieredCompilation OFF:

TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 5861 | 100 |      91 | 100 |     924 | 100 |    4216 | 100 |     630 | 100 | baseline
 3303 |  56 |      94 | 103 |     819 |  88 |    1957 |  46 |     433 |  68 | r2r
 2747 |  46 |     127 | 139 |     794 |  85 |    1447 |  34 |     379 |  60 | app-composite-avx2
 2931 |  50 |     108 | 118 |     729 |  78 |    1671 |  39 |     423 |  67 | one-big-composite-avx2
 3126 |  53 |     138 | 151 |     739 |  79 |    1820 |  43 |     429 |  68 | r2r-platform-composite-avx2
 6035 | 102 |      91 | 100 |    1052 | 113 |    4209 |  99 |     683 | 108 | jit-platform-composite-avx2

trylek · 2021-08-29T19:33:29Z

ReadyToRun ON, TieredCompilation ON:

TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 4251 | 100 |      77 | 100 |     731 | 100 |    2842 | 100 |     601 | 100 | baseline
 2654 |  62 |      94 | 122 |     689 |  94 |    1488 |  52 |     383 |  63 | r2r
 2535 |  59 |     113 | 146 |     747 | 102 |    1392 |  48 |     283 |  47 | app-composite-avx2
 2543 |  59 |     109 | 141 |     615 |  84 |    1503 |  52 |     316 |  52 | one-big-composite-avx2
 3008 |  70 |     104 | 135 |     714 |  97 |    1782 |  62 |     408 |  67 | r2r-platform-composite-avx2
 4993 | 117 |      86 | 111 |     904 | 123 |    3347 | 117 |     656 | 109 | jit-platform-composite-avx2

trylek · 2021-08-29T19:38:01Z

As yet another extra instrumentation I added two more checkpoints to the server initialization routine, one past the initialization of the AppHost and another past the async initialization of the WebHost. The purpose of this was to try to observe whether the perf variations among the various build modes are evenly affecting all parts of the app initialization (which would indicate either a codegen issue like suboptimal code even in the presence of large version bubble or a systemic runtime issue e.g. slower method lookup in the longer composite R2R tables), or just some part of it (indicative of a specific issue tied to a certain construct, library or runtime functionality).

trylek · 2021-08-29T19:42:52Z

The first run with ReadyToRun and TieredCompilation both turned off can be used as a "sanity check" - if everything works fine, there should be no variations among the different build modes as ReadyToRun code is not used. The app should be somewhat slower than normal because no methods ever get optimized by tiered rejitting. The first table demonstrates this nicely, most percentages are very close to the baseline with except for the 91% outlier but the startup time is relatively noisy, the detailed tables show almost 20% fluctuation so perhaps this run was somehow lucky. Btw the above tables are based on averages across 10 runs in each build mode.

trylek · 2021-08-29T19:44:46Z

An interesting bit that stands out in the second run (ReadyToRun OFF, TieredCompilation ON) is the slower performance of composite build variants. This may indicate that we unnecessarily load the composite image even when ReadyToRun is turned OFF and, while a niche scenario, this might be an easy .NET 7 perf issue to fix in the CoreCLR runtime.

trylek · 2021-08-29T19:53:15Z

For ReadyToRun ON, I believe that the perf difference between the baseline and plain r2r indicates that the startup is running lots of application code, it's not dominated by runtime code. As next step I plan to work with the diagnostic team on figuring out how to use the event pipe as Linux replacement for ETW to get to lists of methods being jitted and run in the app and using it to validate that we're not e.g. unnecessarily JITting some R2R-compiled code in the composite cases.

The differences between TieredCompilation ON and OFF seems to show how composite R2R code is "almost as good as optimized tiered code but not quite". The baseline (JITted case) is much slower without tiered compilation as all methods remain unoptimized. The plain r2r mode is still limited due to assembly versioning boundaries. Composite version using the large version bubble is almost as fast without tiering as with it, but tiering still seems to be improving perf by 10% or so.

trylek · 2021-08-29T20:02:43Z

The fundamental outliner is the last line, JITted app against composite framework. This is always slower than both the baseline and all the other composite modes and doesn't make much sense to me at this moment, I believe this to be the most important one to investigate, the good news that I believe this to be in accordance with Richard's original measurements.

Apparently the time it takes to load the composite PE images doesn't seem to be contributing to the slowness, we see that the RUNTIME checkpoint (beginning of Main) is reached in basically the same time in the baseline and in the last line (91 / 91 msecs with TC off, 77 / 86 msecs with TC on) and actually much faster than in the other composite modes.

In both modes we see that the slowdown is more or less evenly spread between the three later checkpoints (apphost initialization, webhost initialization, total startup time) suggestive of either a codegen or runtime issue. Interestingly enough it looks like the slowdown is even more pronounced when tiered compilation is turned on that might suggest some codegen issue (suboptimal codegen for calls to the composite framework).

trylek · 2021-08-29T20:05:40Z

Sadly enough I was unable to reproduce the measurements on my newly installed physical Linux box as after fixing the various issues in the scripts I found out to my dismay that my Dell 5600 box from the Midori times is apparently too old and doesn't support AVX2.

trylek · 2021-08-29T20:14:00Z

@richlander - these are my initial findings, would there be a way to share this PR with the Crossgen2 team (basically David, Manish, Ivan, maybe JanV and JanK)? I assume I'll add additional data once I manage to get it based on the initial analysis. It would be great if you could try out the JellyfinBench app on a Linux box when you have a chance to see whether we're in agreement regarding the numbers - but at the first glance I think the numbers seem somewhat similar to yours and after all getting precise numbers is hard.

For now I haven't spent too much time polishing the command-line interface of the JellyfinBench app, in your case it should suffice to run it without parameters using the "dotnet" command (I'm normally using the one from my parallel runtime repo clone) from the root of the jellyfin repo. As described above, you need to populate a crossgen2 drop in the crossgen2 subfolder of the jellyfin repo clone and pack the nuget cache. Maybe this could be somehow semi-automated in the PublishJellyfinServer.sh script but I haven't tried it out yet.

For the UseTieredCompilation and UseReadyToRun flags, these are defined as constants in the JellyfinBench source code (Program.cs), you need to edit it to alter their values. Similarly you can edit the file to change the list or parameters of the build modes to use and / or the number of iterations to execute. I used 10 and 50 in the development.

Thanks

Tomas

trylek · 2021-08-29T20:26:00Z

Gist with the detailed results in case anyone fancies a mode detailed look:

https://gist.github.com/trylek/9b24611e41fe79bb973f7e6a2e9ee27d

richlander · 2021-08-31T17:02:35Z

Yes, you can share this PR with whomever you want. It's a public page on the internet.

trylek · 2021-08-31T17:39:22Z

OK, thanks for clarifying, I probably just got confused by the fact that the GitHub web page only suggests you and me when I tried to @-mention other people in the PR. CC-ing @dotnet/crossgen-contrib right now to share this more broadly. If that doesn't work, I'll just post the link to the PR in the CG2 chat.

ivdiazsa · 2022-03-28T17:50:50Z

I've made a few changes to the JellyfinBench app, in order to make it more versatile and simpler to run. Those changes can be found in the CoreCLRVMPerfExercises branch attached to Tomas' fork of Rich's Jellyfin repository fork.

Now, after running several tests with the latest nightly builds of the .NET runtime, we acquired some promising looking results, which I'm displaying here.

=======
TOTAL |  %  | RUNTIME |  %  | APPHOST |  %  | WEBHOST |  %  |   APP   |  %  | MODE
==================================================================================
 5260 | 100 |     126 | 100 |    1687 | 100 |    2819 | 100 |     628 | 100 | baseline
 5231 |  99 |     129 | 102 |    1697 | 100 |    2774 |  98 |     631 | 100 | r2r
 5223 |  99 |     120 |  95 |    1698 | 100 |    2775 |  98 |     630 | 100 | app-composite-avx2
 5220 |  99 |     122 |  96 |    1684 |  99 |    2787 |  98 |     627 |  99 | one-big-composite-avx2
 5242 |  99 |     125 |  99 |    1690 | 100 |    2798 |  99 |     629 | 100 | r2r-platform-composite-avx2
 5225 |  99 |     125 |  99 |    1670 |  98 |    2806 |  99 |     624 |  99 | jit-platform-composite-avx2
 3313 |  62 |      63 |  50 |     612 |  36 |    2211 |  78 |     427 |  67 | baseline-usereadytorun
 1644 |  31 |      68 |  53 |     395 |  23 |     949 |  33 |     232 |  36 | r2r-usereadytorun
 1333 |  25 |      66 |  52 |     344 |  20 |     740 |  26 |     183 |  29 | app-composite-avx2-usereadytorun
 1302 |  24 |      67 |  53 |     327 |  19 |     733 |  26 |     175 |  27 | one-big-composite-avx2-usereadytorun
 1305 |  24 |      63 |  50 |     333 |  19 |     736 |  26 |     173 |  27 | r2r-platform-composite-avx2-usereadytorun
 3321 |  63 |      70 |  55 |     613 |  36 |    2209 |  78 |     429 |  68 | jit-platform-composite-avx2-usereadytorun
 4106 |  78 |     113 |  89 |    1312 |  77 |    2201 |  78 |     480 |  76 | baseline-usetieredcompilation
 4144 |  78 |     117 |  92 |    1346 |  79 |    2197 |  77 |     484 |  77 | r2r-usetieredcompilation
 4109 |  78 |     117 |  92 |    1338 |  79 |    2173 |  77 |     481 |  76 | app-composite-avx2-usetieredcompilation
 4096 |  77 |     113 |  89 |    1332 |  78 |    2172 |  77 |     479 |  76 | one-big-composite-avx2-usetieredcompilation
 4107 |  78 |     118 |  93 |    1337 |  79 |    2171 |  77 |     481 |  76 | r2r-platform-composite-avx2-usetieredcompilation
 4119 |  78 |     116 |  92 |    1315 |  77 |    2206 |  78 |     482 |  76 | jit-platform-composite-avx2-usetieredcompilation
 2561 |  48 |      62 |  49 |     506 |  29 |    1657 |  58 |     336 |  53 | baselineusereadytorun-and-tieredcompilation
 1553 |  29 |      67 |  53 |     379 |  22 |     900 |  31 |     207 |  32 | r2rusereadytorun-and-tieredcompilation
 1296 |  24 |      66 |  52 |     329 |  19 |     733 |  26 |     168 |  26 | app-composite-avx2usereadytorun-and-tieredcompilation
 1267 |  24 |      66 |  52 |     322 |  19 |     716 |  25 |     163 |  25 | one-big-composite-avx2usereadytorun-and-tieredcompilation
 1290 |  24 |      70 |  55 |     326 |  19 |     727 |  25 |     167 |  26 | r2r-platform-composite-avx2usereadytorun-and-tieredcompilation
 2573 |  48 |      68 |  53 |     504 |  29 |    1665 |  59 |     336 |  53 | jit-platform-composite-avx2usereadytorun-and-tieredcompilation

trylek · 2022-03-28T18:06:05Z

Thanks Ivan for sharing the latest measurements. I believe they indicate that the perf outlier Rich and I observed last year in the configuration "JITted app + composite framework" is no longer there. I have also double-checked with Ivan that the "platform composite" means .NET framework + ASP.NET. While we no longer see any perf regression specific to the composite platform, according to this particular test the composite platform doesn't outperform the separate R2R-compiled framework / ASP.NET assemblies either, all diffs between the baseline and JITted app + composite framework are now below noise level.

trylek added 11 commits August 19, 2021 22:12

Update Jellyfin to facilitate perf measurements

4e05f46

Add benchmarking app to facilitate reproducible Jellyfin measurements

1539aeb

Output timing for runtime and app startup; improve publishing script

871a51e

Fix propagation of the AVX2 flag

a3d33ab

Increase number of iterations to 50 to reduce standard deviation

d877aeb

Split build / run logging; fix version bubble for AVX2; case sensitivity

3dcc6c9

Mark JellyfinBench as targeting net6.0

81d5910

Fix input bubble for AVX2 builds; fix missing [[ / ]] in script

1a963b9

Logging improvements in the JellyfinBench app

2a8bbc6

Add support for measuring with / without ReadyToRun and TieredCompila…

4c5c10e

…tion

trylek mentioned this pull request Sep 13, 2021

Tracking issue for Core Runtime HouseKeeping Items dotnet/runtime#58120

Closed

24 tasks

trylek added 3 commits January 24, 2022 19:00

Adapt the JellyfinBench tool for measuring ASP.NET on Windows

de1d3de

Add support for warmup iterations

e2f63ec

Revert change for measuring ASP.NET; update Dockerfile to .NET 7

7230ac3

Support for perf measurements on the Jellyfin app & initial findings #3

Are you sure you want to change the base?

Support for perf measurements on the Jellyfin app & initial findings #3

Conversation

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

trylek commented Aug 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trylek commented Aug 29, 2021

Uh oh!

richlander commented Aug 31, 2021

Uh oh!

trylek commented Aug 31, 2021

Uh oh!

ivdiazsa commented Mar 28, 2022

Uh oh!

trylek commented Mar 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trylek commented Aug 29, 2021 •

edited

Loading

trylek commented Aug 29, 2021 •

edited

Loading