Unknown: ncclCommInitRank failed: unhandled system error

Hello, I am using horovod docker for training deep learning models:

ENV:nvcr.io/nvidia/tensorflow:20.06-tf1-py3

WARNING:tensorflow:From /dockerdata/trainer/apd_model.py:112: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:201: The name tf.train.get_global_step is deprecated. Please use tf.compat.v1.train.get_global_step instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:214: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:67: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:75: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

2024-07-08 00:49:01.389541: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2024-07-08 00:49:01.389991: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x19757f70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:01.390197: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:67: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:75: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

2024-07-08 00:49:03.231094: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2024-07-08 00:49:03.231519: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a5db340 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:03.231650: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2024-07-08 00:49:03.235593: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2024-07-08 00:49:03.685146: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a5dc130 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:03.685417: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2024-07-08 00:49:03.690708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2024-07-08 00:49:03.690950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2024-07-08 00:49:03.695503: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2024-07-08 00:49:03.697520: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2024-07-08 00:49:03.698148: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2024-07-08 00:49:03.702433: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2024-07-08 00:49:03.703722: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2024-07-08 00:49:03.704088: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2024-07-08 00:49:03.709970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2024-07-08 00:49:03.710164: I te2024-07-08 00:49:05.856798: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1648] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnknownError'>, 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
         [[node HorovodBroadcast_global_step_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[HorovodBroadcast_global_step_0/_1471]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
         [[node HorovodBroadcast_global_step_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'HorovodBroadcast_global_step_0':
  File "train.py", line 41, in <module>
    BENCH.run()
  File "/dockerdata/trainer/apd_benchmark.py", line 193, in run
    self.benchmark_run()
  File "/dockerdata/trainer/apd_benchmark.py", line 223, in benchmark_run
    bcast_global_variables_op = hvd.broadcast_global_variables(0)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 296, in broadcast_global_variables
    return broadcast_variables(_global_variables(), root_rank)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 56, in broadcast_variables
    return broadcast_group(variables, root_rank)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
    for var in variables])
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in <listcomp>
    for var in variables])
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 331, in broadcast
    ignore_name_scope=ignore_name_scope)
  File "<string>", line 567, in horovod_broadcast
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

ope=ignore_name_scope)
  File "<string>", line 567, in horovod_broadcast
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Exception in thread cache_3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
    self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType
Exception in thread cache_0:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
    self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType


Exception in thread cache_1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
    self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType
Exception in thread cache_2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
    self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unknown: ncclCommInitRank failed: unhandled system error #4053

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unknown: ncclCommInitRank failed: unhandled system error #4053

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions