-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Hello, I am using horovod docker for training deep learning models:
ENV:nvcr.io/nvidia/tensorflow:20.06-tf1-py3
WARNING:tensorflow:From /dockerdata/trainer/apd_model.py:112: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:201: The name tf.train.get_global_step is deprecated. Please use tf.compat.v1.train.get_global_step instead.
WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:214: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.
WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:67: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:75: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.
2024-07-08 00:49:01.389541: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2024-07-08 00:49:01.389991: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x19757f70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:01.390197: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:67: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /dockerdata/trainer/apd_benchmark.py:75: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.
2024-07-08 00:49:03.231094: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2024-07-08 00:49:03.231519: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a5db340 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:03.231650: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2024-07-08 00:49:03.235593: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2024-07-08 00:49:03.685146: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a5dc130 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-07-08 00:49:03.685417: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2024-07-08 00:49:03.690708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2024-07-08 00:49:03.690950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2024-07-08 00:49:03.695503: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2024-07-08 00:49:03.697520: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2024-07-08 00:49:03.698148: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2024-07-08 00:49:03.702433: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2024-07-08 00:49:03.703722: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2024-07-08 00:49:03.704088: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2024-07-08 00:49:03.709970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2024-07-08 00:49:03.710164: I te2024-07-08 00:49:05.856798: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1648] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnknownError'>, 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[node HorovodBroadcast_global_step_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[HorovodBroadcast_global_step_0/_1471]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[node HorovodBroadcast_global_step_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'HorovodBroadcast_global_step_0':
File "train.py", line 41, in
BENCH.run()
File "/dockerdata/trainer/apd_benchmark.py", line 193, in run
self.benchmark_run()
File "/dockerdata/trainer/apd_benchmark.py", line 223, in benchmark_run
bcast_global_variables_op = hvd.broadcast_global_variables(0)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py", line 296, in broadcast_global_variables
return broadcast_variables(_global_variables(), root_rank)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 56, in broadcast_variables
return broadcast_group(variables, root_rank)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
for var in variables])
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in
for var in variables])
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 331, in broadcast
ignore_name_scope=ignore_name_scope)
File "", line 567, in horovod_broadcast
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()
ope=ignore_name_scope)
File "", line 567, in horovod_broadcast
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()
Exception in thread cache_3:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType
Exception in thread cache_0:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType
Exception in thread cache_1:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType
Exception in thread cache_2:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/dockerdata/trainer/diskcache/cache.py", line 146, in _cache
self.LOG_FN("[rank %d] cache file queue len is %d" % ((thread_idx % self._consumers), cache_queue.qsize()))
TypeError: %d format: a number is required, not NoneType