-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Lookup compiled functions by FuncKey
at runtime
#11630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Lookup compiled functions by FuncKey
at runtime
#11630
Conversation
This treats compiled functions homogeneously, removing the need to add new metadata tables to places like `CompiledModuleInfo` whenever we add a new kind of function, and simplifying the process of constructing the metadata for a final, linked compilation artifact. This also paves the way to doing gc-sections in our linking (getting smaller code sizes, removing functions that have been inlined into every caller, and etc...) as we no longer assume that certain types of function index spaces are dense. This does, however, replace a couple operations that were previously O(1) table lookups with O(log n) binary searches. And, notably, some of these are on the `VMFuncRef`-creation path, and therefore on the force-initialization-of-a-lazy-funcref-table-slot path, when we look up a Wasm function and its trampolines. Our call-indirect micro-benchmarks show that indirect calling every funcref once in a table of 64Ki slots went from taking ~2.6ms to ~3.8ms (a +46% slowdown). Note that this edge case is both synthetic and the worst-case scenario for this commit's change: we are measuring, as much as we can, *only* the force-initialization-of-a-lazy-funcref-table-slot path. All other call-indirect benchmarks are within the noise, which is what we would expect. Also, the size of `.cwasm`s is slightly larger: `spidermonkey.wasm`'s `.cwasm` size went from `19_750_632` bytes to `19785872` bytes, which is a 1.78% increase. Ultimately, I believe that the simplification and possibility of doing gc-sections in the future is worth the downsides. That said, if others feel differently, there are some things we could try to improve the situation, although most things I can think of off the top of my head (e.g. LEB128s and delta encoding, making certain `FuncKey` kinds' index spaces dense) will improve one of code size or lookup times while pessimizing the other. I'm sure we could come up with something given enough effort though. <details> <summary>call-indirect micro-benchmarks results</summary> ``` call-indirect/same-callee/table-init-lazy/65536-calls time: [144.14 µs 145.26 µs 146.56 µs] thrpt: [447.15 Melem/s 451.15 Melem/s 454.68 Melem/s] change: time: [−5.5066% −3.6611% −1.9130%] (p = 0.00 < 0.05) thrpt: [+1.9503% +3.8002% +5.8275%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe call-indirect/different-callees/table-init-lazy/65536-calls time: [3.8128 ms 3.8433 ms 3.8763 ms] thrpt: [16.907 Melem/s 17.052 Melem/s 17.188 Melem/s] change: time: [+43.064% +46.066% +49.080%] (p = 0.00 < 0.05) thrpt: [−32.922% −31.538% −30.101%] Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 7 (7.00%) high mild call-indirect/same-callee/table-init-strict/65536-calls time: [130.27 µs 131.66 µs 133.40 µs] thrpt: [491.26 Melem/s 497.75 Melem/s 503.09 Melem/s] change: time: [−6.4798% −4.1871% −1.8965%] (p = 0.00 < 0.05) thrpt: [+1.9332% +4.3701% +6.9288%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe call-indirect/different-callees/table-init-strict/65536-calls time: [176.22 µs 178.49 µs 180.99 µs] thrpt: [362.10 Melem/s 367.18 Melem/s 371.90 Melem/s] change: time: [−18.431% −15.397% −12.330%] (p = 0.00 < 0.05) thrpt: [+14.064% +18.200% +22.595%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) high mild 3 (3.00%) high severe ``` </details>
It is required at runtime not just compile time now.
Subscribe to Label Action
This issue or pull request has been labeled: "wasmtime:api", "winch"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice how this turned out, thanks for pushing on this!
One more benchmark though before merging I now realize -- component instantiation. IIRC we confirmed core wasm instantiation was largely unaffected by this but reading over this I'm remembering that component instantiation, when it hits various initializers for trampolines/builtins/etc, will do the FuncKey
lookup now. Can you make a simple-ish spidermonkey.wasm
component and compare before/after the instantiation numbers?
/// This module's index. | ||
pub module_index: Option<StaticModuleIndex>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this expand on what a None
case means?
(also if it's purely to be able to use Module::default
I think it'd be ok to remove that construction method)
This treats compiled functions homogeneously, removing the need to add new metadata tables to places like
CompiledModuleInfo
whenever we add a new kind of function, and simplifying the process of constructing the metadata for a final, linked compilation artifact. This also paves the way to doing gc-sections in our linking (getting smaller code sizes, removing functions that have been inlined into every caller, and etc...) as we no longer assume that certain types of function index spaces are dense.This does, however, replace a couple operations that were previously O(1) table lookups with O(log n) binary searches. And, notably, some of these are on the
VMFuncRef
-creation path, and therefore on theforce-initialization-of-a-lazy-funcref-table-slot path, when we look up a Wasm function and its trampolines. Our call-indirect micro-benchmarks show that indirect calling every funcref once in a table of 64Ki slots went from taking ~2.6ms to ~3.8ms (a +46% slowdown). Note that this edge case is both synthetic and the worst-case scenario for this commit's change: we are measuring, as much as we can, only the force-initialization-of-a-lazy-funcref-table-slot path. All other call-indirect benchmarks are within the noise, which is what we would expect.
Also, the size of
.cwasm
s is slightly larger:spidermonkey.wasm
's.cwasm
size went from19_750_632
bytes to19_785_872
bytes, which is a 1.78% increase.Ultimately, I believe that the simplification and possibility of doing gc-sections in the future is worth the downsides. That said, if others feel differently, there are some things we could try to improve the situation, although most things I can think of off the top of my head (e.g. LEB128s and delta encoding, making certain
FuncKey
kinds' index spaces dense) will improve one of code size or lookup times while pessimizing the other. I'm sure we could come up with something given enough effort though.call-indirect micro-benchmarks results