I would like to use record_function to mark the GPU-side execution time of both the forward and backward passes during training. However, in the collected profiling results, I found that only the forward pass was correctly annotated on the stream, while the backward pass only showed up in the CPU-side profiler. Could you tell me how to apply record_function so that it is correctly associated with the GPU stream during the backward pass?
