-
Notifications
You must be signed in to change notification settings - Fork 25.2k
enable channels last for reflection padding on CPU #102518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102518
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 8227be8: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32 ``` The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 86.821 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.806 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.367 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32 ``` The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 86.821 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.806 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.367 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32 ``` The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 86.821 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.806 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.367 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32 ``` The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 86.821 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.806 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.367 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
@@ -3028,14 +3028,20 @@ def module_inputs_torch_nn_ConstantPad3d(module_info, device, dtype, requires_gr | |||
module_inputs_func=module_inputs_torch_nn_ReflectionPad2d, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check that these test inputs covers good channels last specific test cases?
I'm worried that they might work well for generic strided inputs, but didn't take into account also exercising the channels last format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now, all the memory format related tests has been re-structured into test_memory_format
from test_modules.py
, it will check:
- output memory format (given nhwc input will get nhwc output)
- mixed memory format (some OPs can have different memory format for input and weight)
- correctness, compare the output of contiguous and channels last
So if I intentionally let the channels last kernel generate wrong
output, it will fail the following test cases such as below:
(pytorch-mingfei) [mingfeim@mlt-skx091 test]$ python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
F
======================================================================
FAIL: test_memory_format_nn_ReflectionPad2d_cpu_float32 (__main__.TestModuleCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 414, in instantiated_test
result = test(self, **param_kwargs)
File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_modules.py", line 118, in test_wrapper
return test(*args, **kwargs)
File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_cuda.py", line 162, in wrapped
return f(*args, **kwargs)
File "test_modules.py", line 695, in test_memory_format
self.assertEqual(outputs, desired_outputs, rtol=rtol, atol=atol)
File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3100, in assertEqual
raise error_metas[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 1008 / 1296 (77.8%)
Greatest absolute difference: 16.82485580444336 at index (0, 0, 1, 1) (up to 1e-05 allowed)
Greatest relative difference: 143.67701721191406 at index (0, 1, 0, 3) (up to 1e-05 allowed)
----------------------------------------------------------------------
Ran 1 test in 0.013s
FAILED (failures=1)
(pytorch-mingfei) [mingfeim@mlt-skx091 test]$ python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
F
======================================================================
FAIL: test_memory_format_nn_ReflectionPad3d_cpu_float32 (__main__.TestModuleCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 414, in instantiated_test
result = test(self, **param_kwargs)
File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_modules.py", line 118, in test_wrapper
return test(*args, **kwargs)
File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_cuda.py", line 162, in wrapped
return f(*args, **kwargs)
File "test_modules.py", line 695, in test_memory_format
self.assertEqual(outputs, desired_outputs, rtol=rtol, atol=atol)
File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3100, in assertEqual
raise error_metas[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 1650 / 1944 (84.9%)
Greatest absolute difference: 17.179780960083008 at index (0, 0, 5, 1, 3) (up to 1e-05 allowed)
Greatest relative difference: 3617.848876953125 at index (1, 1, 3, 5, 0) (up to 1e-05 allowed)
----------------------------------------------------------------------
Ran 1 test in 0.012s
FAILED (failures=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpuhrsch could you please help review this one , thanks !
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32 ``` The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 86.821 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.806 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.367 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
@cpuhrsch could you please help review this one again ? Thanks! The current test cases are good enough to cover the correctness of channels last memory format test. |
@@ -258,7 +258,12 @@ void reflection_pad2d_out_template( | |||
if (ndim == 3) { | |||
output.resize_({nplane, output_h, output_w}); | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it enough to only update this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resize_()
has an default argument of memory_format=at::MemoryFormat::Contiguous
.
- for 3-dim inputs, aka. NCW, no channels last for this yet (current rule is channels last only applies to 4-dim and 5-dim tensor at the moment). Therefore we don't have to explicitly pass a memory format argument for 3-dim inputs. So even inputs are given as NWC, it will be treated as a non-contiguous NCW and getting an NCW output.
- for 4-dim inputs, aka. NCHW and 5-dim inputs aka. NCDHW,
memory_format
argument should be given as the same value of the input memory format, which is a rule of the memory format propagation: NCHW input gets a NCHW output; NHWC input gets an NHWC output.
@CaoE - Do you plan to review this as well? |
LGTM. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:
The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.
single core inference
single socket inference
cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10