Skip to content

Conversation

@redzic
Copy link
Collaborator

@redzic redzic commented Dec 21, 2021

This should improve the codegen of width_log2, height_log2, tx_size, and is_rect_tx_allowed by removing checks of the invalid variant, and it also forces callers of subsampled_size and subsize to check that the return value was not an invalid block size. If there is any performance-critical code where it is not desirable to check for the invalid block size, then it is still possible to use unsafe to remove the check.

@shssoichiro
Copy link
Collaborator

Is it possible to have a benchmark to see what impact (if any) this has on overall performance?

@redzic
Copy link
Collaborator Author

redzic commented Dec 22, 2021

@shssoichiro Yes it's possible, but how should I do it for something like this? Would hawktracer be suitable, or AWCY?

@shssoichiro
Copy link
Collaborator

shssoichiro commented Dec 22, 2021

If I'm going to get a picture of whole binary performance, I'll usually use hyperfine. AWCY isn't the best at giving a consistent encoding speed across runs, and hawktracer is great for profiling, or if you're targeting a specific part of the codebase. But given that this touches a few different areas, I'd make a release binary from master, and a release binary with these changes, and benchmark them against each other in hyperfine, probably with a couple of different videos.

Personally I think this change does make the code more idiomatic for Rust, so I'm leaning toward merging even if the performance improvement is negligible. But I'd like to see if there is a measurable improvement as well.

@lu-zero
Copy link
Collaborator

lu-zero commented Dec 22, 2021

@shssoichiro Yes it's possible, but how should I do it for something like this? Would hawktracer be suitable, or AWCY?

https://github.com/lu-zero/speed-levels-rs gives you a mean to automate some of the hyperfine workflow.

Copy link
Collaborator

@tdaede tdaede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! BLOCK_INVALID is an old leftover from aomenc / the AV1 spec that made it easier to implement some of the context stuff but was pretty inelegant (and still is, in aomenc).

Is there any place we would actually want to handle the error for subsampled_size rather than just .unwrap()ing? Possibly we could just assert and not return a Result.

@redzic
Copy link
Collaborator Author

redzic commented Dec 22, 2021

@shssoichiro There seems to be a small but measurable performance improvement with this PR.

Tested on 3 different videos, this PR was always slightly faster.

rav1e_master is the current master, and rav1e_bsize is this (now rebased onto current master) PR.

out7.y4m: 5 seconds, 256x256 resolution, yuv420p

out3.y4m: 5 seconds, 240x240 resolution, yuv420p

outx.y4m: 5 seconds, 1920x1080 resolution, yuv420p

redzic@threadripper ~/P/rav1e (master)> hyperfine "./rav1e_master ~/out7.y4m -o out.ivf -y"
Benchmark 1: ./rav1e_master ~/out7.y4m -o out.ivf -y
  Time (mean ± σ):      8.168 s ±  0.057 s    [User: 30.287 s, System: 5.900 s]
  Range (min … max):    8.084 s …  8.298 s    10 runs

redzic@threadripper ~/P/rav1e (master)> hyperfine "./rav1e_bsize ~/out7.y4m -o out.ivf -y"
Benchmark 1: ./rav1e_bsize ~/out7.y4m -o out.ivf -y
  Time (mean ± σ):      8.134 s ±  0.043 s    [User: 30.155 s, System: 6.092 s]
  Range (min … max):    8.079 s …  8.194 s    10 runs

redzic@threadripper ~/P/rav1e (master)> hyperfine "./rav1e_bsize ~/out3.y4m -o out.ivf -y"
Benchmark 1: ./rav1e_bsize ~/out3.y4m -o out.ivf -y
  Time (mean ± σ):     21.774 s ±  0.049 s    [User: 51.109 s, System: 7.758 s]
  Range (min … max):   21.692 s … 21.850 s    10 runs

redzic@threadripper ~/P/rav1e (master)> hyperfine "./rav1e_master ~/out3.y4m -o out.ivf -y"
Benchmark 1: ./rav1e_master ~/out3.y4m -o out.ivf -y
  Time (mean ± σ):     21.973 s ±  0.061 s    [User: 51.372 s, System: 7.676 s]
  Range (min … max):   21.850 s … 22.064 s    10 runs

redzic@threadripper ~/P/rav1e (master)> hyperfine "./rav1e_master ~/outx.y4m -s 10 -o out.ivf -y"
Benchmark 1: ./rav1e_master ~/outx.y4m -s 10 -o out.ivf -y
  Time (mean ± σ):     79.930 s ±  0.410 s    [User: 116.721 s, System: 7.363 s]
  Range (min … max):   79.299 s … 80.539 s    10 runs

redzic@threadripper ~/P/rav1e (master)> hyperfine "./rav1e_bsize ~/outx.y4m -s 10 -o out.ivf -y"
Benchmark 1: ./rav1e_bsize ~/outx.y4m -s 10 -o out.ivf -y
  Time (mean ± σ):     79.733 s ±  0.361 s    [User: 116.573 s, System: 7.520 s]
  Range (min … max):   78.820 s … 80.015 s    10 runs

Copy link
Collaborator

@shssoichiro shssoichiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! 🚀

@shssoichiro
Copy link
Collaborator

@tdaede Did you have any other concerns before merging?

@redzic
Copy link
Collaborator Author

redzic commented Dec 23, 2021

Is there any place we would actually want to handle the error for subsampled_size rather than just .unwrap()ing? Possibly we could just assert and not return a Result.

I guess we sort of already "handle" the error with the .expect("invalid block size for this subsampling mode"), but I can shift the panic to the functions themselves rather than the caller if we prefer that

@tdaede
Copy link
Collaborator

tdaede commented Dec 24, 2021

I have no preference, the current way is fine. Just thought possibly not using Result might provide slightly better codegen if we didn't see a perf improvement as is.

@shssoichiro shssoichiro merged commit 5a5711e into xiph:master Dec 24, 2021
@redzic redzic deleted the bsize-refactor branch December 25, 2021 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants