Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NHWC support for group normalization #126635

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ZelboK
Copy link
Contributor

@ZelboK ZelboK commented May 19, 2024

Fixes #111824

Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels.

cc: @mikaylagawarecki

@ZelboK ZelboK requested a review from eqy as a code owner May 19, 2024 03:14
Copy link

pytorch-bot bot commented May 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126635

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 2 Pending, 1 Unrelated Failure

As of commit c9565ce with merge base e3230f8 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

auto cur_b = b[cur_sample * C + cur_channel];
Y[index] = (static_cast<T_ACC>(X[index]) + cur_b) * cur_a;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure how to achieve this behavior using tensor iterators and gpu_kernel - so it may be the case that this is unneeded.


for (int64_t c = 0; c < group_channels; c++) {
val = welford_op.reduce(val, static_cast<T_ACC>(X[index + c]), index + c);
}
Copy link
Contributor Author

@ZelboK ZelboK May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kernel uses a different indexing strategy that will work for rNHWC tensors and uses welfords algorithm. Aside from that, the logic is very similar.

case MemoryFormat::ChannelsLast: {
ApplyScaleBiasNHWCKernel<T><<<N * G, num_threads, 0, cuda_stream>>>(X_data, Y_data, N, height, width, C, D*HxW, a_data, b_data);
C10_CUDA_KERNEL_LAUNCH_CHECK();
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know how to do this with a tensor iterator. IF someone could show me how to that would be great.

@ZelboK
Copy link
Contributor Author

ZelboK commented May 19, 2024

I intend on adding tests for this in test_nn.py for example - I would just like an initial glance over the the general code before I proceed further, in case I missed something obvious.

@ZelboK ZelboK changed the title Add NHWC suppor for grup normalization. New kernels needed Add NHWC support for group normalization May 19, 2024
@drhead
Copy link

drhead commented May 19, 2024

Really looking forward to this! I've found that at least in some cases a naive implementation is actually faster than the existing native group norm kernel when using channels last format, and I'd love to see how much better a proper channels last kernel does.

@ZelboK
Copy link
Contributor Author

ZelboK commented May 20, 2024

Really looking forward to this! I've found that at least in some cases a naive implementation is actually faster than the existing native group norm kernel when using channels last format, and I'd love to see how much better a proper channels last kernel does.

Interested to know the context? What GPU, architecture are you on? If possible could you give me a minimal reproducible example of a naive implementation outperforming native so I could take a look?

@drhead
Copy link

drhead commented May 20, 2024

Interested to know the context? What GPU, architecture are you on? If possible could you give me a minimal reproducible example of a naive implementation outperforming native so I could take a look?

This was on a 3090, using Stable Diffusion 1.5 in inference mode -- I'm not sure that it would be easy make a minimal reproducible example because I think it is at least partially dependent on having operations dispatched fairly far ahead of how fast they execute on the GPU. But to summarize, I first made sure that every aten::copy_ and every blocking operation was gone from the main inference loop. After that and switching the model to channels_last memory format, the only aten::copy_ ops remaining were right before the group norm kernel to switch to contiguous, and at the next conv layer following a group norm to switch back to channels last. After switching to a naive implementation which allowed me to avoid the copy operations, I noticed around a 5% overall increase in speed. I have not tested training with this setup.

@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupNorm & InstanceNorm does not handle channels_last correctly
4 participants