Tags: simlmx/pytorch
Tags
Update on "optimize segment_reduce forward and backward path on CPU" cc jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Update on "unify reduction types from different operators: scatter, s… …catter_reduce, segment_reduce" The target of this PR is to unify `ReductionType` for reduce operators so that we have the same set of reduce utils for `init`, or `update` for vectorization. [ghstack-poisoned]
Update on "inductor: add conv+hardsigmoid fusion for cpu path" cc jgong5 mingfeima sanchitintel ashokei jingxu10 mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang Guobing-Chen chunyuan-w zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Update on "optimize sampled_addmm performance on CPU (SparseCSR)" ### Target and background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * #nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` [ghstack-poisoned]
Update on "[Inductor] add missing ops for cpp vectorization overrides" For micro-benchmark, aten.elu.default and aten.elu_backward.default have poor performance with inductor compared to eager. The main reason is lack of the vectorization. With adding missing ops for cpp vectorization overrides, the vectorization could be successfully applied. Performance data for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl63 {mso-number-format:Percent;} .xl64 {color:gray;} --> </head> <body link="#0563C1" vlink="#954F72"> op | speedup_old | RSD (3) | speedup_new | RSD (3) | increased_performance -- | -- | -- | -- | -- | -- aten.elu.default | 0.205947276 | 1.73% | 0.995302802 | 4.76% | 383.28% aten.elu_backward.default | 0.336280639 | 0.58% | 1.69473642 | 1.96% | 403.96% </body> </html> The new supported ops for cpp vectorization overrides: - eq - ne - lt - gt - le - ge cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Update on "add __hash__ to FunctionSchema" This PR adds __hash__ to FunctionSchema pybind binding, so that it could be used for things like dict indexing [ghstack-poisoned]
Update on "Implement hybrid compressed sparse to/from dense conversio… …ns." cc nikitaved pearu cpuhrsch amjames bhosmer gujinghui PenghuiCheng XiaobingSuper jianyuh jgong5 mingfeima sanchitintel ashokei jingxu10 min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen [ghstack-poisoned]
Update on "optimize gather performance for gnn usage on CPU" On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to pyg-team/pytorch_geometric#4891 (comment) cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Update on "optimize scatter_add performance for gnn usage on CPU" ### Motivation of this PR This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered. `Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations. To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value. ### Algorithm Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized. This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to: * convert memory format from `COO` to `CSR` * do spmm reduce ### Perf improvement The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit. CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz ` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**: * breakdown before ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::scatter_add_ 49.00% 37.797s 49.00% 37.797s 41.445ms 912 aten::index_select 19.74% 15.223s 19.74% 15.227s 6.678ms 2280 aten::linear 0.01% 5.706ms 15.04% 11.602s 12.721ms 912 aten::addmm 6.62% 5.108s 7.92% 6.112s 13.403ms 456 aten::matmul 0.00% 2.339ms 7.10% 5.475s 12.006ms 456 ``` * breakdown after ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::index_select 32.41% 14.677s 32.42% 14.681s 6.439ms 2280 aten::linear 0.01% 6.665ms 26.43% 11.968s 13.123ms 912 aten::addmm 11.76% 5.328s 13.76% 6.232s 13.667ms 456 aten::scatter_add_ 13.22% 5.989s 13.22% 5.989s 6.566ms 912 aten::matmul 0.01% 2.303ms 12.63% 5.720s 12.543ms 456 ``` cc @VitalyFedyunin jgong5 @XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
PreviousNext