Skip to content

Conversation

davidberard98
Copy link
Contributor

@davidberard98 davidberard98 commented Apr 19, 2023

Stack from ghstack:

Background: Prior to this PR, traces for PT2 w/ inductor don't contain connections between CUDA kernels and the CPU launch site. This PR adds those connections.

Details: Triton kernels launched by inductor use cuLaunchKernel instead of cudaLaunchKernel. cuLaunchKernel is part of the driver API, while cudaLaunchKernel is part of the runtime API. In order to support cuLaunchKernel, we added support in kineto (pytorch/kineto#752) to also start listening to driver events; hence why we need to update the kineto submodule.

After the change in kineto, we just need to turn this on in the PyTorch repo by adding CUDA_DRIVER activity type to the CPU and CUDA activity type lists; then

Testing: Added test/inductor/test_profiler.py to check for cuLaunchKernel in json trace files.

Also, I ran this test:

import torch

x = torch.rand((2, 2), device='cuda')

def fn(x):
    return x.relu()

fn_c = torch.compile(fn)
fn_c(x)

with torch.profiler.profile(with_stack=True) as prof:
    fn_c(x)

prof.export_chrome_trace("relu_profile.json")

which generated this chrometrace:
Screenshot 2023-04-18 at 2 58 25 PM

in which you can see flows between a cuLaunchKernel on the CPU side, and the triton kernel on the GPU.

Kineto Updates: To get the kineto-side changes required for cupti driver events, this PR updates the kineto pin. In that updated kineto submodule, we also have:

cc @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb @dzhulgakov @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

…te kineto submodule

**Background**: Prior to this PR, traces for PT2 w/ inductor don't contain connections between CUDA kernels and the CPU launch site. This PR adds those connections.

**Details**: Triton kernels launched by inductor use cuLaunchKernel instead of cudaLaunchKernel. cuLaunchKernel is part of the driver API, while cudaLaunchKernel is part of the runtime API. In order to support cuLaunchKernel, we added support in kineto (pytorch/kineto#752) to also start listening to driver events; hence why we need to update the kineto submodule.

After the change in kineto, we just need to turn this on in the PyTorch  repo by adding CUDA_DRIVER activity type to the CPU and CUDA activity type lists; then

**Testing**: Added test/inductor/test_profiler.py to check for `cuLaunchKernel` in json trace files.

Also, I ran this test:

```python
import torch

x = torch.rand((2, 2), device='cuda')

def fn(x):
    return x.relu()

fn_c = torch.compile(fn)
fn_c(x)

with torch.profiler.profile(with_stack=True) as prof:
    fn_c(x)

prof.export_chrome_trace("relu_profile.json")
```

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Apr 19, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99571

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 26d50ae:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

davidberard98 added a commit that referenced this pull request Apr 19, 2023
…te kineto submodule

**Background**: Prior to this PR, traces for PT2 w/ inductor don't contain connections between CUDA kernels and the CPU launch site. This PR adds those connections.

**Details**: Triton kernels launched by inductor use cuLaunchKernel instead of cudaLaunchKernel. cuLaunchKernel is part of the driver API, while cudaLaunchKernel is part of the runtime API. In order to support cuLaunchKernel, we added support in kineto (pytorch/kineto#752) to also start listening to driver events; hence why we need to update the kineto submodule.

After the change in kineto, we just need to turn this on in the PyTorch  repo by adding CUDA_DRIVER activity type to the CPU and CUDA activity type lists; then

**Testing**: Added test/inductor/test_profiler.py to check for `cuLaunchKernel` in json trace files.

Also, I ran this test:

```python
import torch

x = torch.rand((2, 2), device='cuda')

def fn(x):
    return x.relu()

fn_c = torch.compile(fn)
fn_c(x)

with torch.profiler.profile(with_stack=True) as prof:
    fn_c(x)

prof.export_chrome_trace("relu_profile.json")
```

ghstack-source-id: 85fe6b4
Pull Request resolved: #99571
@davidberard98 davidberard98 added ciflow/trunk Trigger trunk jobs on your pull request release notes: profiler release notes category oncall: profiler profiler-related issues (cpu, gpu, kineto) labels Apr 19, 2023
Copy link
Member

@aaronenyeshi aaronenyeshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for adding the flow fix for cuLaunchKernel and updating the Kineto submodule!

@davidberard98
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/davidberard98/184/head branch June 8, 2023 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged merging module: inductor oncall: profiler profiler-related issues (cpu, gpu, kineto) release notes: profiler release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants