Skip to content

Conversation

Aidyn-A
Copy link
Collaborator

@Aidyn-A Aidyn-A commented Feb 14, 2023

This PR adds workaround for CUDA 12 cudaSetDevice change which will always create primary context on target device. So operations like this:

import torch
x = torch.randn(1, device="cuda:1")

would always create primary context on on device cuda:1 because it is creating a tensor on it and on device cuda:0 because the destructor of CUDA Device guard calls cudaSetDevice(0).
After this PR the CUDA Device guard will not call cudaSetDevice(0) if primary context does not exist on cuda:0.

cc @ezyang @gchanan

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 14, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94864

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 88d7b8b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jjsjann123 jjsjann123 added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 15, 2023
@Aidyn-A Aidyn-A marked this pull request as ready for review February 15, 2023 17:05
@Aidyn-A Aidyn-A changed the title [wip] CUDA12 set_device change CUDA12 set_device change Feb 15, 2023
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Feb 15, 2023

cc @atalman @malfet @ngimel @ptrblck

@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 16, 2023
@ptrblck
Copy link
Collaborator

ptrblck commented Feb 16, 2023

From offline discussion:
Linking against the driver API might break use cases where no driver is installed and e.g. the import could fail.
From ATenNVRTC.h:

// ATen does not directly link to either libnvrtc or libcuda because they
// require libcuda to be installed, yet we want our GPU build to work on CPU
// machines as long as CUDA is not initialized.

so this PR would most likely need to use a lazy loading approach.

@Aidyn-A Aidyn-A marked this pull request as draft February 20, 2023 02:02
@Aidyn-A Aidyn-A marked this pull request as ready for review April 7, 2023 21:18
@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Apr 7, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@ngimel
Copy link
Collaborator

ngimel commented Apr 7, 2023

I've unlinked the PR internally, so next merge attempt should succeed, but let's not do it before weekend.

@ngimel
Copy link
Collaborator

ngimel commented Apr 10, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@Aidyn-A
Copy link
Collaborator Author

Aidyn-A commented Apr 10, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@awgu
Copy link
Collaborator

awgu commented Apr 14, 2023

Hi @Aidyn-A!

Soon after this PR landed, some long-standing FSDP unit tests have become flaky (#99011, #98821). I am not entirely sure the connection, but we see SIGABRT or SIGSEGV without much information. I am able to reproduce it after many runs, e.g.:

Exception raised from c10_cuda_check_implementation at /fsx/users/andgu/work/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):766
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fd198b6f9ec in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)767
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fd198b3365a in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)768
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7fd198c0575c in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)769
frame #3: c10::cuda::SetDevice(int) + 0x51 (0x7fd198c05af1 in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
More stacktrace (run on 8 GPUs)
INFO:torch.distributed.fsdp.flat_param:FSDP FlatParameter address alignment created 3 numel of padding (462 vs. 459)
INFO:torch.distributed.fsdp.flat_param:FSDP FlatParameter world size divisibility created 2 numel of padding
.dist init r=3, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
dist init r=1, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
dist init r=2, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
dist init r=0, world=8
dist init r=5, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5
dist init r=4, world=8
dist init r=6, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 6
dist init r=7, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 7
INFO:torch.distributed.distributed_c10d:Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 4
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 5
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 6
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 7
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /fsx/users/andgu/work/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fd198b6f9ec in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fd198b3365a in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7fd198c0575c in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::SetDevice(int) + 0x51 (0x7fd198c05af1 in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
frame #4: std::_Sp_counted_ptr_inplace<std::vector<at::cuda::CUDAEvent, std::allocator<at::cuda::CUDAEvent> >, std::allocator<std::vector<at::cuda::CUDAEvent, std::allocator<at::cuda::CUDAEvent> > >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xba (0x7fd199aa8e7a in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #5: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1ae31c258 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x135 (0x7fd199a80315 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24a (0x7fd199a8ca7a in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8c (0x7fd199a8cccc in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #10: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #11: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570285: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: <unknown function> + 0x12d0431 (0x7fd1a3404431 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x468a7 (0x7fd1dfdf88a7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: on_exit + 0 (0x7fd1dfdf8a60 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x1168d7 (0x564eafa698d7 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #6: <unknown function> + 0x116903 (0x564eafa69903 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #7: <unknown function> + 0x116952 (0x564eafa69952 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #8: PyRun_SimpleStringFlags + 0x4d (0x564eafa6adae in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #9: <unknown function> + 0x118fdf (0x564eafa6bfdf in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #10: Py_BytesMain + 0x39 (0x564eafbaa729 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #11: __libc_start_main + 0xf3 (0x7fd1dfdd6083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x1e6995 (0x564eafb39995 in /fsx/users/andgu/conda/envs/pytorch/bin/python)
SIGABRT(6), PID: 2570285, Thread 2570365: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570366: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570367: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570368: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570369: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570370: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570371: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570372: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570373: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570374: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570375: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570376: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570377: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570378: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570379: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570380: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570381: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570382: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570383: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570384: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570385: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570386: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570387: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570388: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570389: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570390: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570391: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570392: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570393: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570394: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570395: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570396: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570397: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570398: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570399: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570400: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570401: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570402: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570403: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570404: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570405: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570406: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570407: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570408: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570409: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570410: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570411: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570412: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570413: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570414: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570415: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570416: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570417: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570418: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570419: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570420: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570421: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570422: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570423: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570424: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570425: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570426: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570427: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x339deb (0x7fd183465deb in /data/home/andgu/.local/lib/python3.8/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570816: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: __poll + 0x4f (0x7fd1dfec499f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x292ec9 (0x7fd1af24dec9 in /lib/x86_64-linux-gnu/libcuda.so)
frame #4: <unknown function> + 0x34d9ab (0x7fd1af3089ab in /lib/x86_64-linux-gnu/libcuda.so)
frame #5: <unknown function> + 0x2957f8 (0x7fd1af2507f8 in /lib/x86_64-linux-gnu/libcuda.so)
frame #6: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570825: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: c10::FatalSignalHandler::fatalSignalHandler(int) + 0x152 (0x7fd198b76a62 in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: gsignal + 0xcb (0x7fd1dfdf500b in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: abort + 0x12b (0x7fd1dfdd4859 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: __gnu_cxx::__verbose_terminate_handler() + 0xbc (0x7fd1b0dbd84a in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #6: <unknown function> + 0xabf47 (0x7fd1b0dbbf47 in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0xab3a5 (0x7fd1b0dbb3a5 in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #8: __gxx_personality_v0 + 0x348 (0x7fd1b0dbbbd8 in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #9: <unknown function> + 0x10bef (0x7fd1b0e94bef in /lib/x86_64-linux-gnu/libgcc_s.so.1)
frame #10: _Unwind_Resume + 0x12a (0x7fd1b0e955aa in /lib/x86_64-linux-gnu/libgcc_s.so.1)
frame #11: <unknown function> + 0x13b3d (0x7fd198bccb3d in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
frame #12: c10::cuda::SetDevice(int) + 0x51 (0x7fd198c05af1 in /fsx/users/andgu/work/pytorch/torch/lib/libc10_cuda.so)
frame #13: std::_Sp_counted_ptr_inplace<std::vector<at::cuda::CUDAEvent, std::allocator<at::cuda::CUDAEvent> >, std::allocator<std::vector<at::cuda::CUDAEvent, std::allocator<at::cuda::CUDAEvent> > >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xba (0x7fd199aa8e7a in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #14: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1ae31c258 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #15: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x135 (0x7fd199a80315 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #16: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24a (0x7fd199a8ca7a in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #17: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8c (0x7fd199a8cccc in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #18: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #19: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #20: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570836: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: __poll + 0x4f (0x7fd1dfec499f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x292ec9 (0x7fd1af24dec9 in /lib/x86_64-linux-gnu/libcuda.so)
frame #4: <unknown function> + 0x34d9ab (0x7fd1af3089ab in /lib/x86_64-linux-gnu/libcuda.so)
frame #5: <unknown function> + 0x2957f8 (0x7fd1af2507f8 in /lib/x86_64-linux-gnu/libcuda.so)
frame #6: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570903: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: __poll + 0x4f (0x7fd1dfec499f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x308e343 (0x7fd19bcb8343 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570910: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_wait + 0x216 (0x7fd1dffb3376 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x308d950 (0x7fd19bcb7950 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570940: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570944: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570950: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570955: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570962: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570971: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570978: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
SIGABRT(6), PID: 2570285, Thread 2570985: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8b (0x7fd198b7651b in /fsx/users/andgu/work/pytorch/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14420 (0x7fd1dffb8420 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: pthread_cond_timedwait + 0x271 (0x7fd1dffb37d1 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x13839a (0x564eafa8b39a in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #4: PyEval_RestoreThread + 0x2b (0x564eafa8b41b in /fsx/users/andgu/conda/envs/pytorch/bin/python)
frame #5: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0xe5 (0x7fd1ae6be885 in /fsx/users/andgu/work/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0xc819d (0x7fd1b0dd819d in /fsx/users/andgu/conda/envs/pytorch/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x8609 (0x7fd1dffac609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x43 (0x7fd1dfed1133 in /lib/x86_64-linux-gnu/libc.so.6)
Fdist init r=3, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
dist init r=2, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
dist init r=1, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
dist init r=4, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
dist init r=0, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
dist init r=6, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 6
dist init r=5, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5
dist init r=7, world=8
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 7
INFO:torch.distributed.distributed_c10d:Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 6
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 5
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 7
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 4
.                                                                                                 [100%]
======================================================================= FAILURES =======================================================================
______________________________________________ TestNoGrad.test_transformer_no_grad_mixed_precision_False _______________________________________________
Traceback (most recent call last):
  File "/fsx/users/andgu/work/pytorch/torch/testing/_internal/common_distributed.py", line 541, in wrapper
    self._join_processes(fn)
  File "/fsx/users/andgu/work/pytorch/torch/testing/_internal/common_distributed.py", line 760, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/fsx/users/andgu/work/pytorch/torch/testing/_internal/common_distributed.py", line 815, in _check_return_codes
    self.assertEqual(
  File "/fsx/users/andgu/work/pytorch/torch/testing/_internal/common_utils.py", line 3031, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Scalars are not equal!
Expected 0 but got -6.
Absolute difference: 6
Relative difference: inf
Expect process 1 exit code to match Process 0 exit code of 0, but got -6
=============================================================== short test summary info ================================================================
FAILED [11.3519s] test/distributed/fsdp/test_fsdp_core.py::TestNoGrad::test_transformer_no_grad_mixed_precision_False - AssertionError: Scalars are n...
================================================ 1 failed, 7 passed, 52 deselected in 119.87s (0:01:59) ================================================

I wonder if there could be any conflict with ProcessGroupNCCL.

Perhaps, one possible remediation is to revert this PR for now? cc: @ezyang @ngimel

@ngimel
Copy link
Collaborator

ngimel commented Apr 14, 2023

@pytorchbot revert

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 14, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message, -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@ngimel
Copy link
Collaborator

ngimel commented Apr 14, 2023

@pytorchbot revert -m "causes flaky fsdp failures" -c weird

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 94864 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 69eef5a4bec822943d08322da728db8f6787d4fa returned non-zero exit code 1

Auto-merging .lintrunner.toml
CONFLICT (content): Merge conflict in .lintrunner.toml
Auto-merging aten/src/ATen/native/cudnn/Conv_v8.cpp
Auto-merging c10/cuda/CUDACachingAllocator.cpp
Auto-merging torch/csrc/cuda/Module.cpp
error: could not revert 69eef5a4bec... [CUDA12] set_device change (#94864)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
Details for Dev Infra team Raised by workflow job

ngimel pushed a commit that referenced this pull request Apr 14, 2023
@ngimel
Copy link
Collaborator

ngimel commented Apr 14, 2023

#99162, @awgu is it possible to reproduce your failures? We would need to work with you to reland this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged merging module: bc-breaking Related to a BC-breaking change open source release notes: jit release notes category Reverted topic: bc breaking topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.