Tags · simlmx/pytorch

ciflow/trunk/91500

Update on "optimize segment_reduce forward and backward path on CPU"






cc jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Jan 3, 2023
282b7e2
zip
tar.gz

ciflow/trunk/91499

Update on "unify reduction types from different operators: scatter, s…

…catter_reduce, segment_reduce"


The target of this PR is to unify `ReductionType` for reduce operators so that we have the same set of reduce utils for `init`, or `update` for vectorization.

[ghstack-poisoned]

Jan 3, 2023
e8270aa
zip
tar.gz

ciflow/trunk/91440

coalesce in to_dense_backward for CSC inputs

Jan 3, 2023
e1a7750
zip
tar.gz

ciflow/trunk/91433

Update on "inductor: add conv+hardsigmoid fusion for cpu path"

cc jgong5 mingfeima sanchitintel ashokei jingxu10 mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang Guobing-Chen chunyuan-w zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

Jan 3, 2023
2def6e6
zip
tar.gz

ciflow/trunk/90978

Update on "optimize sampled_addmm performance on CPU (SparseCSR)"


### Target and background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* #nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```




[ghstack-poisoned]

Jan 3, 2023
e75ded4
zip
tar.gz

ciflow/trunk/90750

Update on "[Inductor] add missing ops for cpp vectorization overrides"


For micro-benchmark, aten.elu.default and aten.elu_backward.default have poor performance with inductor compared to eager. The main reason is lack of the vectorization. With adding missing ops for cpp vectorization overrides, the vectorization could be successfully applied.

Performance data for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
page
	{margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl63
	{mso-number-format:Percent;}
.xl64
	{color:gray;}
-->
</head>

<body link="#0563C1" vlink="#954F72">



op | speedup_old | RSD (3) | speedup_new | RSD (3) | increased_performance
-- | -- | -- | -- | -- | --
aten.elu.default | 0.205947276 | 1.73% | 0.995302802 | 4.76% | 383.28%
aten.elu_backward.default | 0.336280639 | 0.58% | 1.69473642 | 1.96% | 403.96%



</body>

</html>


The new supported ops for cpp vectorization overrides:
- eq
- ne
- lt
- gt
- le
- ge


cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

Jan 3, 2023
f143648
zip
tar.gz

ciflow/trunk/90730

Update on "add __hash__ to FunctionSchema"

This PR adds __hash__ to FunctionSchema pybind binding, so that
it could be used for things like dict indexing

[ghstack-poisoned]

Jan 3, 2023
8cbf671
zip
tar.gz

ciflow/trunk/90177

Update on "Implement hybrid compressed sparse to/from dense conversio…

…ns."

cc nikitaved pearu cpuhrsch amjames bhosmer gujinghui PenghuiCheng XiaobingSuper jianyuh jgong5 mingfeima sanchitintel ashokei jingxu10 min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen

[ghstack-poisoned]

Jan 3, 2023
de49989
zip
tar.gz

ciflow/trunk/87586

Update on "optimize gather performance for gnn usage on CPU"


On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. 

This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension.

Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format.

for more details, please refer to pyg-team/pytorch_geometric#4891 (comment)

cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Jan 3, 2023
31c363a
zip
tar.gz

ciflow/trunk/82703

Update on "optimize scatter_add performance for gnn usage on CPU"


### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Jan 3, 2023
f92cab3
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ciflow/trunk/91500

ciflow/trunk/91499

ciflow/trunk/91440

ciflow/trunk/91433

ciflow/trunk/90978

ciflow/trunk/90750

ciflow/trunk/90730

ciflow/trunk/90177

ciflow/trunk/87586

ciflow/trunk/82703

Tags: simlmx/pytorch