Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sqrt performance in lighting shaders by exploiting floating point binary representation #92129

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rainydaysavings
Copy link

@rainydaysavings rainydaysavings commented May 19, 2024

These changes aim to improve sqrt computation performance, the first commit is based upon a presentation of Michal Drobot that can be found here.
The new function, named sqrt_IEEE_int_approximation, is quite similar to the infamous FISR algorithm. Albeit, for sqrt itself.

The second commit was intended to replace the use of the built-in length function for V_GGX and V_GGX_anisotropic, and use the aforementioned approximation. These, in turn, are based upon "Course Notes: Moving Frostbite to PBR", found here.

I've made several runs (n=5) comparing performance before and after these changes. Using godot-benchmarks. These results are based upon the Rendering/Lights and Meshes benchmarks:

Benchmark Results

I believe this new sqrt approximation could be taken advantage of in other shader includes, but have had a hard time understanding where best to place it. As of now, it resides under /servers/rendering/renderer_rd/shaders/scene_forward_lights_inc.glsl. Given this, I would appreciate any comments regarding a more general approach.

Outlined in Low Level Optimizations for GCN by Michal Drobot in 2014
The improved functions are based upon the work found in "Course Notes: Moving Frostbite to PBR", page 12.
Found here: https://media.contentapi.ea.com/content/dam/eacom/frostbite/files/course-notes-moving-frostbite-to-pbr-v32.pdf
@rainydaysavings rainydaysavings requested a review from a team as a code owner May 19, 2024 19:02
@AThousandShips
Copy link
Member

I've made several runs (n=5) comparing performance before and after these changes.

What is n here? And what is the "test number"?

I don't know that this data is conclusive, also are the results for test 9 identical? Or is the optimized result covering the original? Please also add the raw data so the results can be analysed more closely, a bar graph is pretty limited

@rainydaysavings
Copy link
Author

What is n here? And what is the "test number"?

There are 9 different benchmarks under the Light and Meshes tests, I have made 5 different runs over those 9 for the original version and 5 other runs over the same 9 benchmarks for mine. The results were then averaged out.

The results for benchmark number 9 were the same up to 2 decimal points.

Please also add the raw data so the results can be analysed more closely, a bar graph is pretty limited

I'll run both versions later today as done here and with a profiler, and will share the raw data here. I understand the bar graph isn't that great.

@AThousandShips
Copy link
Member

So reading the document this comes from, they use "RME", what does that refer to? It's not as far as I can find an established term to describe errors (is it maybe RMSE, root square mean error?)

Is the domain safe here and is it guaranteed to not produce degenerate or invalid results within this use case?

@rainydaysavings
Copy link
Author

So reading the document this comes from, they use "RME", what does that refer to? It's not as far as I can find an established term to describe errors (is it maybe RMSE, root square mean error?)

I believe it stands for Mean Relative Error. More can be seen here.

Is the domain safe here and is it guaranteed to not produce degenerate or invalid results within this use case?

If there's a difference, it is imperceptible. From my testing. I'll also try to provide comparisons.
Regarding degenerate or invalid results, it'll always stay close to the real sqrt.

The biggest concern is error accumulation, so one must use it selectively.

P.S. I've realised I've made an error in a previous reply, there are 13 benchmarks, not 9.

@AThousandShips
Copy link
Member

As long as the error is small enough and reliable enough it should be safe, but the square root is extra sensitive I suspect as the IEEE standard requires it to be exact, which I assume GPUs are expected to follow, so as long as the algorithms used for lighting are lenient enough this shouldn't be a problem, assuming it's correct for the domain used here, but getting some data on that the method is considered reliable would be good

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sqrt_IEEE_int_approximation carefully seems like a nice improvement, but I am unsure about the changes to the GGX functions. We carefully selected those variants of the GGX approximation because there were so cheap. Your new versions appear to produce way more instructions (I assume they are more accurate as a result). (see for some history #51716)

The Filament docs have a good explanation for why we use the optimized V_GGX instead of the more correct term. https://google.github.io/filament/Filament.md.html#materialsystem/specularbrdf/geometricshadowing(specularg)

Comment on lines +28 to +31
float Lambda_GGXV = NdotL * sqrt_IEEE_int_approximation((-NdotV * alpha + NdotV) * NdotV + alpha);
float Lambda_GGXL = NdotV * sqrt_IEEE_int_approximation((-NdotL * alpha + NdotL) * NdotL + alpha);

return 0.5 / (Lambda_GGXV + Lambda_GGXL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it uses way more instructions than the old version, are you sure that this is faster?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry to not have shared benchmarks in the meantime, hope I can today.

As for the V_GGX implementation, you're correct, it wouldn't make much sense if it were. To be honest, I started out by trying to use the version you see and only then I tried to approximate the sqrt. Even though this is backwards in their commit order.

This V_GGX implementation might be hurting the sqrt approximation gains, if anything. The original V_GGX implementation seems to have been pretty well thought out, and more performant.

As promised I'll provide further testing today.

I've disassembled the shader on my end to check on this, resulting in the following:
V_SmithGGXCorrelated_spirv_comp.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the focus of the PR should be on the main change, the rewrite of the algorithm, or split it into two separate ones if the sqrt changes can be independent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, indeed we end up with a lot more instructions.

I suspect that in the short term we will split this file in two and have a separate implementation for mobile and desktop. These variants that you have added are probably more appropriate for desktop as we want maximum quality on desktop (a couple instructions won't cost much). But on mobile we absolutely need to reduce the instruction count as much as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants