Skip to content

Conversation

mingfeima
Copy link
Collaborator

@mingfeima mingfeima commented May 30, 2023

Stack from ghstack (oldest at bottom):

Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:

python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

single core inference

(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 16.806 ms

single socket inference

(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented May 30, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102518

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8227be8:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 30, 2023
mingfeima added a commit that referenced this pull request May 30, 2023
ghstack-source-id: e66928b
Pull Request resolved: #102518
@mingfeima mingfeima marked this pull request as draft May 30, 2023 07:25
@mingfeima mingfeima added the topic: not user facing topic category label May 30, 2023
cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@mingfeima mingfeima added the ciflow/trunk Trigger trunk jobs on your pull request label May 31, 2023
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
```

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 16.806 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@mingfeima mingfeima requested a review from cpuhrsch June 1, 2023 01:36
@mingfeima mingfeima marked this pull request as ready for review June 1, 2023 01:37
mingfeima added 3 commits June 6, 2023 09:57
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
```

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 16.806 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
```

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 16.806 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
```

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 16.806 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@@ -3028,14 +3028,20 @@ def module_inputs_torch_nn_ConstantPad3d(module_info, device, dtype, requires_gr
module_inputs_func=module_inputs_torch_nn_ReflectionPad2d,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check that these test inputs covers good channels last specific test cases?

I'm worried that they might work well for generic strided inputs, but didn't take into account also exercising the channels last format.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, all the memory format related tests has been re-structured into test_memory_format from test_modules.py, it will check:

  • output memory format (given nhwc input will get nhwc output)
  • mixed memory format (some OPs can have different memory format for input and weight)
  • correctness, compare the output of contiguous and channels last

So if I intentionally let the channels last kernel generate wrong output, it will fail the following test cases such as below:

(pytorch-mingfei) [mingfeim@mlt-skx091 test]$ python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
F
======================================================================
FAIL: test_memory_format_nn_ReflectionPad2d_cpu_float32 (__main__.TestModuleCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 414, in instantiated_test
    result = test(self, **param_kwargs)
  File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_modules.py", line 118, in test_wrapper
    return test(*args, **kwargs)
  File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_cuda.py", line 162, in wrapped
    return f(*args, **kwargs)
  File "test_modules.py", line 695, in test_memory_format
    self.assertEqual(outputs, desired_outputs, rtol=rtol, atol=atol)
  File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3100, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 1008 / 1296 (77.8%)
Greatest absolute difference: 16.82485580444336 at index (0, 0, 1, 1) (up to 1e-05 allowed)
Greatest relative difference: 143.67701721191406 at index (0, 1, 0, 3) (up to 1e-05 allowed)

----------------------------------------------------------------------
Ran 1 test in 0.013s

FAILED (failures=1)
(pytorch-mingfei) [mingfeim@mlt-skx091 test]$ python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
F
======================================================================
FAIL: test_memory_format_nn_ReflectionPad3d_cpu_float32 (__main__.TestModuleCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 414, in instantiated_test
    result = test(self, **param_kwargs)
  File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_modules.py", line 118, in test_wrapper
    return test(*args, **kwargs)
  File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_cuda.py", line 162, in wrapped
    return f(*args, **kwargs)
  File "test_modules.py", line 695, in test_memory_format
    self.assertEqual(outputs, desired_outputs, rtol=rtol, atol=atol)
  File "/home/mingfeim/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3100, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 1650 / 1944 (84.9%)
Greatest absolute difference: 17.179780960083008 at index (0, 0, 5, 1, 3) (up to 1e-05 allowed)
Greatest relative difference: 3617.848876953125 at index (1, 1, 3, 5, 0) (up to 1e-05 allowed)

----------------------------------------------------------------------
Ran 1 test in 0.012s

FAILED (failures=1)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpuhrsch could you please help review this one , thanks !

@mingfeima mingfeima requested a review from cpuhrsch July 3, 2023 02:02
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
```

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 16.806 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms
```

cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@mingfeima
Copy link
Collaborator Author

@cpuhrsch could you please help review this one again ? Thanks! The current test cases are good enough to cover the correctness of channels last memory format test.

@@ -258,7 +258,12 @@ void reflection_pad2d_out_template(
if (ndim == 3) {
output.resize_({nplane, output_h, output_w});
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it enough to only update this branch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resize_() has an default argument of memory_format=at::MemoryFormat::Contiguous.

  • for 3-dim inputs, aka. NCW, no channels last for this yet (current rule is channels last only applies to 4-dim and 5-dim tensor at the moment). Therefore we don't have to explicitly pass a memory format argument for 3-dim inputs. So even inputs are given as NWC, it will be treated as a non-contiguous NCW and getting an NCW output.
  • for 4-dim inputs, aka. NCHW and 5-dim inputs aka. NCDHW, memory_format argument should be given as the same value of the input memory format, which is a rule of the memory format propagation: NCHW input gets a NCHW output; NHWC input gets an NHWC output.

@cpuhrsch
Copy link
Contributor

@CaoE - Do you plan to review this as well?

@mingfeima mingfeima requested review from cpuhrsch and CaoE July 13, 2023 01:14
@CaoE
Copy link
Collaborator

CaoE commented Jul 13, 2023

LGTM.

@cpuhrsch
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/mingfeima/121/head branch July 17, 2023 14:17
@cpuhrsch cpuhrsch added release notes: nn release notes category and removed topic: not user facing topic category labels Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: nn release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants