Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter #126634

yifuwang · 2024-05-19T03:08:42Z

Stack from ghstack (oldest at bottom):

Improve the scheduling for fused_matmul_reduce_scatter #127455
force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation #127454
Allow overriding per-dim group options via _MeshEnv.set_dim_group_options #126599
Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases #126598
-> Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter #126634

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

…uce_scatter [ghstack-poisoned]

pytorch-bot · 2024-05-19T03:08:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126634

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

✅ You can merge normally! (3 Unrelated Failures)

As of commit 29e6b1f with merge base ff65b18 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / macos-13-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (similar failure)
Process completed with exit code 1.
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh) (matched macos rule in flaky-rules.json)
File doesn't exist

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…_matmul_reduce_scatter" ## Context See context [here](#122163). ## This PR Introduces `cuda_p2p` based `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops which performs micro-pipelining TP for `all-gather -> matmul` and `matmul -> reduce-scatter` respectively. Fusion vs. decomposition - in principle, the micro-pipelining is achieved via decomposition. However, in practice, today Inductor can't deal with the decomposed patterns well. So instead performing decomposition in Inductor, we fuse the patterns to be decomposed and dispatch them to corresponding operators that handle decomposition + micropipelining. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

Chillee

I do feel like we should be able to provide a higher level API for this 🤔 It would be nice if it could be the same API for both allgather and reduce_scatter.

torch/distributed/_cuda_p2p/__init__.py

wanchaol

looks awesome!

wanchaol · 2024-05-21T04:40:32Z

torch/distributed/_cuda_p2p/__init__.py

+
+
+@contextmanager
+def test_with_non_cuda_p2p_group():


nit: these test utils should move to the torch.testing package instead?

wanchaol · 2024-05-21T04:43:41Z

torch/distributed/_cuda_p2p/__init__.py

+    ag_shape = list(A_shard.shape)
+    ag_shape[gather_dim] *= group_size
+    ag_out = A_shard.new_empty(ag_shape)
+    return ag_out, [ag_out @ B for B in Bs]


for meta formulas, wondering if this matmul would actually incur computation or just call the matmul meta kernel(i guess it's the later one?)

Yeah we are calling the meta kernels to deduce device, shape, and strides.

…_matmul_reduce_scatter" ## Context See context [here](#122163). ## This PR Introduces `cuda_p2p` based `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops which performs micro-pipelining TP for `all-gather -> matmul` and `matmul -> reduce-scatter` respectively. Fusion vs. decomposition - in principle, the micro-pipelining is achieved via decomposition. However, in practice, today Inductor can't deal with the decomposed patterns well. So instead performing decomposition in Inductor, we fuse the patterns to be decomposed and dispatch them to corresponding operators that handle decomposition + micropipelining. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

…uce_scatter ghstack-source-id: cfada01c278b4ed552914d073147c77aa29e6a04 Pull Request resolved: #126634

…_matmul_reduce_scatter" ## Context See context [here](#122163). ## This PR Introduces `cuda_p2p` based `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops which performs micro-pipelining TP for `all-gather -> matmul` and `matmul -> reduce-scatter` respectively. Fusion vs. decomposition - in principle, the micro-pipelining is achieved via decomposition. However, in practice, today Inductor can't deal with the decomposed patterns well. So instead performing decomposition in Inductor, we fuse the patterns to be decomposed and dispatch them to corresponding operators that handle decomposition + micropipelining. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

…_matmul_reduce_scatter" [ghstack-poisoned]

yifuwang · 2024-05-30T06:44:28Z

@pytorchbot merge

pytorchmergebot · 2024-05-30T06:46:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…uce_scatter (#126634) Pull Request resolved: #126634 Approved by: https://github.com/Chillee, https://github.com/wanchaol (cherry picked from commit 1071437)

Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_red…

0c9833f

…uce_scatter [ghstack-poisoned]

yifuwang mentioned this pull request May 19, 2024

Introduce ProcessGroupCudaP2P #122163

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 19, 2024

This was referenced May 19, 2024

Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases #126598

Closed

Allow overriding per-dim group options via _MeshEnv.set_dim_group_options #126599

Closed

yifuwang requested review from wanchaol and Chillee May 20, 2024 21:18

yifuwang marked this pull request as ready for review May 20, 2024 21:18

Chillee approved these changes May 21, 2024

View reviewed changes

torch/distributed/_cuda_p2p/__init__.py Show resolved Hide resolved

torch/distributed/_cuda_p2p/__init__.py Show resolved Hide resolved

wanchaol approved these changes May 21, 2024

View reviewed changes

yifuwang added 3 commits May 21, 2024 13:26

yifuwang added a commit that referenced this pull request May 21, 2024

Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_red…

9bfdbcf

…uce_scatter ghstack-source-id: cfada01c278b4ed552914d073147c77aa29e6a04 Pull Request resolved: #126634

yifuwang added 2 commits May 24, 2024 01:19

This was referenced May 29, 2024

force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation #127454

Open

Improve the scheduling for fused_matmul_reduce_scatter #127455

Open

yifuwang added 2 commits May 29, 2024 14:43

Update on "Introduce cuda_p2p based fused_all_gather_matmul and fused…

184a451

…_matmul_reduce_scatter" [ghstack-poisoned]

Update on "Introduce cuda_p2p based fused_all_gather_matmul and fused…

29e6b1f

…_matmul_reduce_scatter" [ghstack-poisoned]

yifuwang added the topic: not user facing topic category label May 29, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 30, 2024

pytorchmergebot added the merging label May 30, 2024

pytorchmergebot added the Merged label May 30, 2024

pytorchmergebot closed this in 1071437 May 30, 2024

pytorchmergebot removed the merging label May 30, 2024

yifuwang mentioned this pull request May 30, 2024

Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter #127556

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter #126634

Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter #126634

yifuwang commented May 19, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 19, 2024 •

edited

Chillee left a comment

wanchaol left a comment

wanchaol May 21, 2024

wanchaol May 21, 2024

yifuwang May 21, 2024

yifuwang commented May 30, 2024

pytorchmergebot commented May 30, 2024



		@contextmanager
		def test_with_non_cuda_p2p_group():

Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter #126634

Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter #126634

Conversation

yifuwang commented May 19, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented May 19, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126634

❗ 1 Active SEVs

✅ You can merge normally! (3 Unrelated Failures)

Chillee left a comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

wanchaol May 21, 2024

Choose a reason for hiding this comment

wanchaol May 21, 2024

Choose a reason for hiding this comment

yifuwang May 21, 2024

Choose a reason for hiding this comment

yifuwang commented May 30, 2024

pytorchmergebot commented May 30, 2024

Merge started

yifuwang commented May 19, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented May 19, 2024 •

edited