You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS and version: Linux SMP x86_64 x86_64 x86_64 GNU/Linux
GCC version: 7.3.1
CMake version: 3.14
Bug report:
import horovod.torch as hvd
import time
worker_1_process_set = hvd.ProcessSet([1])
worker_2_process_set = hvd.ProcessSet([0, 2])
hvd.init(process_sets="dynamic")
hvd.add_process_set(worker_1_process_set)
hvd.add_process_set(worker_2_process_set)
@hvd.elastic.run
def main(state):
rank = hvd.rank()
size = hvd.size()
if rank == 0:
while True:
print(f"Sleeping for 1 second: {rank}", flush=True)
time.sleep(1)
elif rank == 1:
while True:
print(f"Sleeping for 1 second: {rank}", flush=True)
time.sleep(1)
elif rank == 2:
while True:
print(f"Sleeping for 1 second: {rank}", flush=True)
time.sleep(1)
if __name__ == '__main__':
print(f"Initialized with rank {hvd.rank()}", flush=True)
# Initialize the TorchState
state = hvd.elastic.TorchState()
print(f"Running main with rank {hvd.rank()}", flush=True)
main(state)
print(f"Finished running main with rank {hvd.rank()}", flush=True)
print(f"Joined with rank {hvd.rank()}", flush=True)
I am running the code above using elastic horovod and using process sets as described above. I am using the following command to run all 3 workers on a single node. After killing one of the processes from a terminal, all the remaining processes are killed. If I do the same workflow using the same command BUT WITHOUT using process sets, after terminating only one process the remaining 2 workers are not terminated. Basically, while using process sets with elastic horovod I was expecting that one worker failure would not terminate the remaining processes as it's happening in the log below. However, for some reason when I dont use process sets, the remaining workers stay alive as expected. What could be the reason here? Is this a bug or am i missing something while using the process sets? Please help
(horovod-setup) (miniconda3) [pgadikar@ip-10-20-1-15 experiments]$ horovodrun -np 3 --min-np 2 --host-discovery-script discover-hosts.sh --elastic-timeout 5 --network-interfaces eth0,lo python mast
er-child-exp.py
[1]<stdout>:Initialized with rank 1
[1]<stdout>:Running main with rank 1
[2]<stdout>:Initialized with rank 2
[2]<stdout>:Running main with rank 2
[0]<stdout>:Initialized with rank 0
[0]<stdout>:Running main with rank 0
[1]<stdout>:Sleeping for 1 second: 1
[2]<stdout>:Sleeping for 1 second: 2
[0]<stdout>:Sleeping for 1 second: 0
[2]<stderr>:[2024-02-07 04:16:27.910743: E /tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/horovod/common/operations.cc:697] [2]: Horovod background loop uncaught exception: [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:20903: Connection reset by peer
[0]<stderr>:[2024-02-07 04:16:27.910752: E /tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/horovod/common/operations.cc:697] [0]: Horovod background loop uncaught exception: [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:49541: Connection reset by peer
[2]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[0]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[2]<stderr>: what(): [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:20903: Connection reset by peer
[0]<stderr>: what(): [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:49541: Connection reset by peer
Process 1 exit with status code 143.
Process 2 exit with status code 134.
Process 0 exit with status code 134.
ERROR:root:failure count == 3 -> stop running
Traceback (most recent call last):
File "/home/pgadikar/miniconda3/envs/horovod-setup/bin/horovodrun", line 8, in <module>
sys.exit(run_commandline())
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 837, in run_commandline
_run(args)
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 825, in _run
return _run_elastic(args)
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 738, in _run_elastic
return gloo_run_elastic(settings, env, args.run_func if args.run_func else args.command, executable)
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/gloo_run.py", line 380, in gloo_run_elastic
return launch_gloo_elastic(command_or_func, exec_command, settings, env, get_common_interfaces, rendezvous, executable)
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/gloo_run.py", line 351, in launch_gloo_elastic
raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: ip-10-20-1-15.us-east-2.compute.internal[1]
Exit code: 143
(horovod-setup) (miniconda3) [pgadikar@ip-10-20-1-15 experiments]$
The text was updated successfully, but these errors were encountered:
Environment:
Bug report:
I am running the code above using elastic horovod and using process sets as described above. I am using the following command to run all 3 workers on a single node. After killing one of the processes from a terminal, all the remaining processes are killed. If I do the same workflow using the same command BUT WITHOUT using process sets, after terminating only one process the remaining 2 workers are not terminated. Basically, while using process sets with elastic horovod I was expecting that one worker failure would not terminate the remaining processes as it's happening in the log below. However, for some reason when I dont use process sets, the remaining workers stay alive as expected. What could be the reason here? Is this a bug or am i missing something while using the process sets? Please help
Similar issues:
The text was updated successfully, but these errors were encountered: