Skip to content

Unknown: ncclCommInitRank failed: unhandled system error #4053

@Scaramouch33

Description

@Scaramouch33

Hello, I am using horovod docker for training deep learning models:

ENV:nvcr.io/nvidia/tensorflow:20.06-tf1-py3

WARNING:tensorflow:From /dockerdata/trainer/apd_model.py:112: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:201: The name tf.train.get_global_step is deprecated. Please use tf.compat.v1.train.get_global_step instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:214: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:67: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:75: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

2024-07-08 00:49:01.389541: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2024-07-08 00:49:01.389991: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x19757f70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:01.390197: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:67: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:75: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

2024-07-08 00:49:03.231094: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2024-07-08 00:49:03.231519: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a5db340 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:03.231650: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2024-07-08 00:49:03.235593: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2024-07-08 00:49:03.685146: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a5dc130 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:03.685417: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2024-07-08 00:49:03.690708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2024-07-08 00:49:03.690950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2024-07-08 00:49:03.695503: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2024-07-08 00:49:03.697520: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2024-07-08 00:49:03.698148: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2024-07-08 00:49:03.702433: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2024-07-08 00:49:03.703722: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2024-07-08 00:49:03.704088: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2024-07-08 00:49:03.709970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2024-07-08 00:49:03.710164: I te2024-07-08 00:49:05.856798: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1648] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnknownError'>, 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[node HorovodBroadcast_global_step_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[HorovodBroadcast_global_step_0/_1471]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[node HorovodBroadcast_global_step_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'HorovodBroadcast_global_step_0':
File "train.py", line 41, in
BENCH.run()
File "/dockerdata/trainer/apd_benchmark.py", line 193, in run
self.benchmark_run()
File "/dockerdata/trainer/apd_benchmark.py", line 223, in benchmark_run
bcast_global_variables_op = hvd.broadcast_global_variables(0)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py", line 296, in broadcast_global_variables
return broadcast_variables(_global_variables(), root_rank)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 56, in broadcast_variables
return broadcast_group(variables, root_rank)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
for var in variables])
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in
for var in variables])
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 331, in broadcast
ignore_name_scope=ignore_name_scope)
File "", line 567, in horovod_broadcast
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

ope=ignore_name_scope)
File "", line 567, in horovod_broadcast
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

Exception in thread cache_3:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType
Exception in thread cache_0:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType

Exception in thread cache_1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType
Exception in thread cache_2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions