run into dead loop when tuning the tma persistent kernel

I'm try running `matmul_kernel_tma_persistent` this kernel with various configs, but it often get into dead loop with GPU 100% utilization. 

Is that a bug?