Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c10d] Fix stuck after onCompletionhook exception was caught #126666

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

HOOLoLo
Copy link

@HOOLoLo HOOLoLo commented May 20, 2024

Currently, the thread runhookloop will not clean the completedWorkList_ if the onCompletionHook raise an exception.
It leads to the main thread which called waitForPendingWorks will be stucked.

Test Case:

  def test_on_completion_hook_exception(self):
        pg = self._get_process_group()

        def hook(work_info: torch._C._distributed_c10d.WorkInfo):
            raise RuntimeError("hook error")

        pg._register_on_completion_hook(hook)
        tensor = torch.ones([2, 3]).cuda(self.rank) * self.rank
        pg.broadcast([tensor]).wait()

        # N.B.: destroy_process_group is necessary to wait for
        # all pending works to finish.
        c10d.destroy_process_group(pg)

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented May 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126666

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c835056 with merge base 5fb11cd (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 20, 2024
@HOOLoLo HOOLoLo changed the title Fix stuck after onCompletionhook exception was caught [c10d] Fix stuck after onCompletionhook exception was caught May 20, 2024
@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024
@HOOLoLo
Copy link
Author

HOOLoLo commented May 21, 2024

Currently, the thread runhookloop will not clean the completedWorkList_ if the onCompletionHook raise an exception. It leads to the main thread which called waitForPendingWorks will be stucked.

Test Case:

  def test_on_completion_hook_exception(self):
        pg = self._get_process_group()

        def hook(work_info: torch._C._distributed_c10d.WorkInfo):
            raise RuntimeError("hook error")

        pg._register_on_completion_hook(hook)
        tensor = torch.ones([2, 3]).cuda(self.rank) * self.rank
        pg.broadcast([tensor]).wait()

        # N.B.: destroy_process_group is necessary to wait for
        # all pending works to finish.
        c10d.destroy_process_group(pg)

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

This pr mv try catch into the for loop to deal with each completedwork, and always erase it whether it catch exception or not.
But I don't know why abort comm after a hook exception was caught.
What is going to do when a hook raise exception? kill the process? or just show an information

@wconstab
Copy link
Contributor

@shuqiangzhang can you review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants