[CUDA12] set_device change #94864

Aidyn-A · 2023-02-14T23:13:17Z

This PR adds workaround for CUDA 12 cudaSetDevice change which will always create primary context on target device. So operations like this:

import torch
x = torch.randn(1, device="cuda:1")

would always create primary context on on device cuda:1 because it is creating a tensor on it and on device cuda:0 because the destructor of CUDA Device guard calls cudaSetDevice(0).
After this PR the CUDA Device guard will not call cudaSetDevice(0) if primary context does not exist on cuda:0.

cc @ezyang @gchanan

pytorch-bot · 2023-02-14T23:13:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94864

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 88d7b8b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

c10/cuda/impl/CUDAGuardImpl.h

Aidyn-A · 2023-02-15T17:27:22Z

cc @atalman @malfet @ngimel @ptrblck

ptrblck · 2023-02-16T09:55:03Z

From offline discussion:
Linking against the driver API might break use cases where no driver is installed and e.g. the import could fail.
From ATenNVRTC.h:

// ATen does not directly link to either libnvrtc or libcuda because they
// require libcuda to be installed, yet we want our GPU build to work on CPU
// machines as long as CUDA is not initialized.

so this PR would most likely need to use a lazy loading approach.

…ce_v3

Aidyn-A · 2023-04-07T21:19:23Z

@pytorchbot merge

pytorchmergebot · 2023-04-07T21:21:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-04-07T21:21:26Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

ngimel · 2023-04-07T21:46:17Z

I've unlinked the PR internally, so next merge attempt should succeed, but let's not do it before weekend.

ngimel · 2023-04-10T17:21:15Z

@pytorchbot merge

pytorchmergebot · 2023-04-10T17:23:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-04-10T17:23:17Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

Aidyn-A · 2023-04-10T17:28:59Z

@pytorchbot merge

pytorchmergebot · 2023-04-10T17:31:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

awgu · 2023-04-14T16:01:38Z

Hi @Aidyn-A!

Soon after this PR landed, some long-standing FSDP unit tests have become flaky (#99011, #98821). I am not entirely sure the connection, but we see SIGABRT or SIGSEGV without much information. I am able to reproduce it after many runs, e.g.:

Exception raised from c10_cuda_check_implementation at /fsx/users/andgu/work/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):766
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fd198b6f9ec in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)767
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fd198b3365a in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)768
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7fd198c0575c in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)769
frame #3: c10::cuda::SetDevice(int) + 0x51 (0x7fd198c05af1 in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)

More stacktrace (run on 8 GPUs)

INFO:torch.distributed.fsdp.flat_param:FSDP FlatParameter address alignment created 3 numel of padding (462 vs. 459)
INFO:torch.distributed.fsdp.flat_param:FSDP FlatParameter world size divisibility created 2 numel of padding
.dist init r=3, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
dist init r=1, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
dist init r=2, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
dist init r=0, world=8
dist init r=5, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5
dist init r=4, world=8
dist init r=6, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 6
dist init r=7, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 7
INFO:torch.distributed.distributed_c10d:Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 4
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 5
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 6
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 7
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /fsx/users/andgu/work/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fd198b6f9ec in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fd198b3365a in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7fd198c0575c in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::SetDevice(int) + 0x51 (0x7fd198c05af1 in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
frame #4: std::_Sp_counted_ptr_inplace<std::vector<at::cuda::CUDAEvent, std::allocator<at::cuda::CUDAEvent> >, std::allocator<std::vector<at::cuda::CUDAEvent, std::allocator<at::cuda::CUDAEvent> > >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xba (0x7fd199aa8e7a in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #5: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1ae31c258 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x135 (0x7fd199a80315 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24a (0x7fd199a8ca7a in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8c (0x7fd199a8cccc in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #10: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #11: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570285: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: <unknown function> + 0x12d0431 (0x7fd1a3404431 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x468a7 (0x7fd1dfdf88a7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: on_exit + 0 (0x7fd1dfdf8a60 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x1168d7 (0x564eafa698d7 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #6: <unknown function> + 0x116903 (0x564eafa69903 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #7: <unknown function> + 0x116952 (0x564eafa69952 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #8: PyRun_SimpleStringFlags + 0x4d (0x564eafa6adae in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #9: <unknown function> + 0x118fdf (0x564eafa6bfdf in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #10: Py_BytesMain + 0x39 (0x564eafbaa729 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #11: __libc_start_main + 0xf3 (0x7fd1dfdd6083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x1e6995 (0x564eafb39995 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
SIGABRT(6), PID: 2570285, Thread 2570365: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570366: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570367: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570368: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570369: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570370: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570371: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570372: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570373: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570374: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570375: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570376: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570377: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570378: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570379: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570380: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570381: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570382: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570383: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570384: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570385: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570386: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570387: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570388: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570389: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570390: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570391: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570392: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570393: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570394: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570395: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570396: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570397: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570398: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570399: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570400: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570401: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570402: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570403: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570404: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570405: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570406: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570407: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570408: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570409: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570410: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570411: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570412: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570413: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570414: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570415: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570416: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570417: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570418: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570419: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570420: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570421: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570422: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570423: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570424: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570425: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570426: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570427: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570816: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: __poll + 0x4f (0x7fd1dfec499f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x292ec9 (0x7fd1af24dec9 in /lib/x86_64-linux-gnu/libcuda.so)
frame #4: <unknown function> + 0x34d9ab (0x7fd1af3089ab in /lib/x86_64-linux-gnu/libcuda.so)
frame #5: <unknown function> + 0x2957f8 (0x7fd1af2507f8 in /lib/x86_64-linux-gnu/libcuda.so)
frame #6: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570825: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: c10::FatalSignalHandler::fatalSignalHandler(int) + 0x152 (0x7fd198b76a62 in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: gsignal + 0xcb (0x7fd1dfdf500b in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: abort + 0x12b (0x7fd1dfdd4859 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: __gnu_cxx::__verbose_terminate_handler() + 0xbc (0x7fd1b0dbd84a in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #6: <unknown function> + 0xabf47 (0x7fd1b0dbbf47 in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0xab3a5 (0x7fd1b0dbb3a5 in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #8: __gxx_personality_v0 + 0x348 (0x7fd1b0dbbbd8 in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #9: <unknown function> + 0x10bef (0x7fd1b0e94bef in /lib/x86_64-linux-gnu/libgcc_s.so.1)
frame #10: _Unwind_Resume + 0x12a (0x7fd1b0e955aa in /lib/x86_64-linux-gnu/libgcc_s.so.1)
frame #11: <unknown function> + 0x13b3d (0x7fd198bccb3d in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
frame #12: c10::cuda::SetDevice(int) + 0x51 (0x7fd198c05af1 in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
frame #13: std::_Sp_counted_ptr_inplace<std::vector<at::cuda::CUDAEvent, std::allocator<at::cuda::CUDAEvent> >, std::allocator<std::vector<at::cuda::CUDAEvent, std::allocator<at::cuda::CUDAEvent> > >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xba (0x7fd199aa8e7a in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #14: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1ae31c258 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #15: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x135 (0x7fd199a80315 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #16: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24a (0x7fd199a8ca7a in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #17: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8c (0x7fd199a8cccc in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #18: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #19: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #20: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570836: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: __poll + 0x4f (0x7fd1dfec499f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x292ec9 (0x7fd1af24dec9 in /lib/x86_64-linux-gnu/libcuda.so)
frame #4: <unknown function> + 0x34d9ab (0x7fd1af3089ab in /lib/x86_64-linux-gnu/libcuda.so)
frame #5: <unknown function> + 0x2957f8 (0x7fd1af2507f8 in /lib/x86_64-linux-gnu/libcuda.so)
frame #6: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570903: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: __poll + 0x4f (0x7fd1dfec499f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x308e343 (0x7fd19bcb8343 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570910: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x308d950 (0x7fd19bcb7950 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570940: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570944: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570950: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570955: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570962: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570971: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570978: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570985: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
Fdist init r=3, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
dist init r=2, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
dist init r=1, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
dist init r=4, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
dist init r=0, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
dist init r=6, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 6
dist init r=5, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5
dist init r=7, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 7
INFO:torch.distributed.distributed_c10d:Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 6
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 5
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 7
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 4
.                                                                                                 [100%]
======================================================================= FAILURES =======================================================================
______________________________________________ TestNoGrad.test_transformer_no_grad_mixed_precision_False _______________________________________________
Traceback (most recent call last):
  File "/fsx/users/andgu/work/pytorch/torch/testing/_internal/common_distributed.py", line 541, in wrapper
    self._join_processes(fn)
  File "/fsx/users/andgu/work/pytorch/torch/testing/_internal/common_distributed.py", line 760, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/fsx/users/andgu/work/pytorch/torch/testing/_internal/common_distributed.py", line 815, in _check_return_codes
    self.assertEqual(
  File "/fsx/users/andgu/work/pytorch/torch/testing/_internal/common_utils.py", line 3031, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Scalars are not equal!
Expected 0 but got -6.
Absolute difference: 6
Relative difference: inf
Expect process 1 exit code to match Process 0 exit code of 0, but got -6
=============================================================== short test summary info ================================================================
FAILED [11.3519s] test/distributed/fsdp/test_fsdp_core.py::TestNoGrad::test_transformer_no_grad_mixed_precision_False - AssertionError: Scalars are n...
================================================ 1 failed, 7 passed, 52 deselected in 119.87s (0:01:59) ================================================

I wonder if there could be any conflict with ProcessGroupNCCL.

Perhaps, one possible remediation is to revert this PR for now? cc: @ezyang @ngimel

ngimel · 2023-04-14T16:05:11Z

@pytorchbot revert

pytorch-bot · 2023-04-14T16:05:13Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message, -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

ngimel · 2023-04-14T16:05:42Z

@pytorchbot revert -m "causes flaky fsdp failures" -c weird

pytorchmergebot · 2023-04-14T16:07:33Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-04-14T16:07:40Z

Reverting PR 94864 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 69eef5a4bec822943d08322da728db8f6787d4fa returned non-zero exit code 1

Auto-merging .lintrunner.toml
CONFLICT (content): Merge conflict in .lintrunner.toml
Auto-merging aten/src/ATen/native/cudnn/Conv_v8.cpp
Auto-merging c10/cuda/CUDACachingAllocator.cpp
Auto-merging torch/csrc/cuda/Module.cpp
error: could not revert 69eef5a4bec... [CUDA12] set_device change (#94864)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".

Details for Dev Infra team

Raised by workflow job

This reverts commit 69eef5a.

ngimel · 2023-04-14T16:11:56Z

#99162, @awgu is it possible to reproduce your failures? We would need to work with you to reland this PR

conditionally set device

8b64f89

pytorchbot added the open source label Feb 14, 2023

Aidyn-A added 2 commits February 14, 2023 15:23

fix condition

c82b7af

cmake c10: link against cuda

9cf1845

Aidyn-A mentioned this pull request Feb 15, 2023

[DO NOT MERGE] [CUDA12] Conditionally set device in device guard #91219

Closed

Aidyn-A added 3 commits February 14, 2023 17:17

properly link against libcuda.so

24c3cdd

fix lint, bazel build and allgather test

bccab7d

add error checking

5c18d44

jjsjann123 added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 15, 2023

Aidyn-A marked this pull request as ready for review February 15, 2023 17:05

Aidyn-A requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol and fegin as code owners February 15, 2023 17:05

Aidyn-A changed the title ~~[wip] CUDA12 set_device change~~ CUDA12 set_device change Feb 15, 2023

crcrpar reviewed Feb 15, 2023

View reviewed changes

c10/cuda/impl/CUDAGuardImpl.h Outdated Show resolved Hide resolved

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 16, 2023

Aidyn-A added 3 commits February 19, 2023 15:04

dynamically load caffe2_nvrtc

8fedf84

remove warns

4998a4f

Merge branch 'master' into cuda12_device_guard_conditionally_set_devi…

6d9a726

…ce_v3

Aidyn-A marked this pull request as draft February 20, 2023 02:02

Aidyn-A added 2 commits February 19, 2023 19:14

Lazily initialize CUDADriverAPI

9f68035

try to use both caffe2_nvrtc and libcuda

719f082

Aidyn-A marked this pull request as ready for review April 7, 2023 21:18

pytorchmergebot added the merging label Apr 7, 2023

pytorchmergebot removed the merging label Apr 10, 2023

pytorchmergebot added the merging label Apr 10, 2023

pytorchmergebot closed this in 69eef5a Apr 10, 2023

Aidyn-A mentioned this pull request Apr 12, 2023

[CUDA 12] Add primary context tests #98986

Closed

ngimel pushed a commit that referenced this pull request Apr 14, 2023

Revert "[CUDA12] set_device change (#94864)"

3f913d1

This reverts commit 69eef5a.

ngimel mentioned this pull request Apr 14, 2023

Revert "[CUDA12] set_device change (#94864)" #99162

Closed

This was referenced Aug 1, 2024

[CUDA12] use MaybeSetDevice in cuda device guard setDevice #132398

Closed

[CUDA12] use MaybeSetDevice in cuda device guard setDevice AlibabaPAI/pytorch#1

Merged

[CUDA12] set_device change #94864

[CUDA12] set_device change #94864

Uh oh!

Conversation

Aidyn-A commented Feb 14, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94864

✅ No Failures

Uh oh!

Uh oh!

Aidyn-A commented Feb 15, 2023

Uh oh!

ptrblck commented Feb 16, 2023

Uh oh!

Aidyn-A commented Apr 7, 2023

Uh oh!

pytorchmergebot commented Apr 7, 2023

Merge started

Uh oh!

pytorchmergebot commented Apr 7, 2023

Merge failed

Uh oh!

ngimel commented Apr 7, 2023

Uh oh!

ngimel commented Apr 10, 2023

Uh oh!

pytorchmergebot commented Apr 10, 2023

Merge started

Uh oh!

pytorchmergebot commented Apr 10, 2023

Merge failed

Uh oh!

Aidyn-A commented Apr 10, 2023

Uh oh!

pytorchmergebot commented Apr 10, 2023

Merge started

Uh oh!

awgu commented Apr 14, 2023

Uh oh!

ngimel commented Apr 14, 2023

Uh oh!

pytorch-bot bot commented Apr 14, 2023

Uh oh!

ngimel commented Apr 14, 2023

Uh oh!

pytorchmergebot commented Apr 14, 2023

Uh oh!

pytorchmergebot commented Apr 14, 2023

Reverting PR 94864 failed

Uh oh!

ngimel commented Apr 14, 2023

Uh oh!

Uh oh!

Aidyn-A commented Feb 14, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 14, 2023 •

edited

Loading