A tracing infrastructure for heterogeneous computing applications. We curently have backend for OpenCL, CUDA and L0.
The build system is a classical autotool based system.
As a alternative, one can use spack to install THAPI.
THAPI package is not yet in upstream spack, in the mean time please follow https://github.com/argonne-lcf/THAPI-spack.
Packages:
- babeltrace2,- libbabeltrace2-dev
- liblttng-ust-dev
- lttng-tools
- ruby,- ruby-dev
- libffi,- libffi-dev
babletrace2 should be patched before install, see: https://github.com/Kerilk/spack/tree/develop/var/spack/repos/builtin/packages/babeltrace2
Optional packages:
- binutils-devor- libiberty-devfor demangling depending on platforms (- demangle.h)
Ruby Gems:
- cast-to-yaml
- nokogiri
- babeltrace2
Optional Gem:
- opencl_ruby_ffi
The tracer can be heavily tuned and each event can be monitored independently from others, but for convenience a series of default presets are defined in the tracer_opencl.sh script:
tracer_opencl.sh [options] [--] <application> <application-arguments>
  --help                        Show this screen
  --version                     Print the version string
  -l, --lightweight             Filter out som high traffic functions
  -p, --profiling               Enable profiling
  -s, --source                  Dump program sources to disk
  -a, --arguments               Dump argument and kernel infos
  -b, --build                   Dump program build infos
  -h, --host-profile            Gather precise host profiling information
  -d, --dump                    Dump kernels input and output to disk
  -i, --iteration VALUE         Dump inputs and outputs for kernel with enqueue counter VALUE
  -s, --iteration-start VALUE   Dump inputs and outputs for kernels starting with enqueue counter VALUE
  -e, --iteration-end VALUE     Dump inputs and outputs for kernels until enqueue counter VALUE
  -v, --visualize               Visualize trace on thefly
  --devices                     Dump devices information
Traces can be viewed using babeltrace, babeltrace2 or babeltrace_opencl. The later should give more structured information at the cost of speed.
Similarly to OpenCL, a wrapper script with presets is provided, tracer_ze.sh:
tracer_ze.sh [options] [--] <application> <application-arguments>
  --help                        Show this screen
  --version                     Print the version string
  -b, --build                   Dump module build infos
  -p, --profiling               Enable profiling
  -v, --visualize               Visualize trace on thefly
  --properties                  Dump drivers and devices properties
Traces can be viewed using babeltrace, babeltrace2 or babeltrace_ze. The later should give more structured information at the cost of speed.
Similarly to OpenCL, a wrapper script with presets is provided, tracer_cuda.sh:
 tracer_cuda.sh [options] [--] <application> <application-arguments>
  --help                        Show this screen
  --version                     Print the version string
  --cudart                      Trace CUDA runtime on top of CUDA driver
  -a, --arguments               Extract argument infos and values
  -p, --profiling               Enable profiling
  -e, --exports                 Trace export functions
  -v, --visualize               Visualize trace on thefly
  --properties                  Dump devices infos
Traces can be viewed using babeltrace, babeltrace2 or babeltrace_cuda. The later should give more structured information at the cost of speed
iprof is another wrapper around the OpenCL, Level Zero, and CUDA tracers. It gives aggregated profiling information.
Usage: iprof [options]
    -m, --tracing-mode=MODE          Define the category of events traced
        --traced-ranks=RANK          Select with MPI rank will be traced.
                                     Use -1 to mean all ranks.
                                     Default: -1
        --[no-]profile               Device activities will not profiled
    -b, --backend BACKEND            Select which and how backends' need to handled.
                                     Format: backend_name[:backend_level],...
                                     Default: omp:2,cl:1,ze:1,cuda:1,hip:1
    -r, --replay [PATH]              Replay traces for post-morten analysis
    -t, --trace                      Pretty print the trace
    -l, --timeline                   Dump a timeline of the trace.
                                     This will create a 'out.pftrace' file that can be opened in perfetto: https://ui.perfetto.dev/#!/viewer
    -j, --json                       The tally will be dumped as json
    -e, --extended                   The tally will be printed for each Hostname / Process / Thread / Device
    -k, --kernel-verbose             The tally will report kernels execution time with SIMD width and global/local sizes
        --max-name-size SIZE         Maximum size allowed for kernels names.
                                     Use -1 to mean no limit.
                                     Default: 80
        --metadata                   Display trace Metadata
    -v, --version                    Display THAPI version
        --debug [LEVEL]              Level of debug [default 0]
                                                      __
For complaints, praises, or bug reports please use: <(o )___
   https://github.com/argonne-lcf/THAPI              ( ._> /
   or send email to {apl,bvideau}@anl.gov             `---'
Programming model specific variants exist: clprof.sh, zeprof.sh, and cuprof.sh.
tapplencourt> iprof ./a.out
API calls | 1 Hostnames | 1 Processes | 1 Threads
                         Name |     Time | Time(%) | Calls |  Average |      Min |      Max | Failed |
     cuDevicePrimaryCtxRetain |  54.64ms |  51.77% |     1 |  54.64ms |  54.64ms |  54.64ms |      0 |
         cuMemcpyDtoHAsync_v2 |  24.11ms |  22.85% |     1 |  24.11ms |  24.11ms |  24.11ms |      0 |
 cuDevicePrimaryCtxRelease_v2 |  18.16ms |  17.20% |     1 |  18.16ms |  18.16ms |  18.16ms |      0 |
           cuModuleLoadDataEx |   4.73ms |   4.48% |     1 |   4.73ms |   4.73ms |   4.73ms |      0 |
               cuModuleUnload |   1.30ms |   1.23% |     1 |   1.30ms |   1.30ms |   1.30ms |      0 |
               cuLaunchKernel |   1.05ms |   0.99% |     1 |   1.05ms |   1.05ms |   1.05ms |      0 |
                cuMemAlloc_v2 | 970.60us |   0.92% |     1 | 970.60us | 970.60us | 970.60us |      0 |
               cuStreamCreate | 402.21us |   0.38% |    32 |  12.57us |   1.58us | 183.49us |      0 |
           cuStreamDestroy_v2 | 103.36us |   0.10% |    32 |   3.23us |   2.81us |   8.80us |      0 |
              cuMemcpyDtoH_v2 |  36.17us |   0.03% |     1 |  36.17us |  36.17us |  36.17us |      0 |
         cuMemcpyHtoDAsync_v2 |  13.11us |   0.01% |     1 |  13.11us |  13.11us |  13.11us |      0 |
          cuStreamSynchronize |   8.77us |   0.01% |     1 |   8.77us |   8.77us |   8.77us |      0 |
              cuCtxSetCurrent |   5.47us |   0.01% |     9 | 607.78ns | 220.00ns |   1.74us |      0 |
         cuDeviceGetAttribute |   2.71us |   0.00% |     3 | 903.33ns | 490.00ns |   1.71us |      0 |
   cuDevicePrimaryCtxGetState |   2.70us |   0.00% |     1 |   2.70us |   2.70us |   2.70us |      0 |
                cuCtxGetLimit |   2.30us |   0.00% |     2 |   1.15us | 510.00ns |   1.79us |      0 |
         cuModuleGetGlobal_v2 |   2.24us |   0.00% |     2 |   1.12us | 440.00ns |   1.80us |      1 |
                       cuInit |   1.65us |   0.00% |     1 |   1.65us |   1.65us |   1.65us |      0 |
          cuModuleGetFunction |   1.61us |   0.00% |     1 |   1.61us |   1.61us |   1.61us |      0 |
           cuFuncGetAttribute |   1.00us |   0.00% |     1 |   1.00us |   1.00us |   1.00us |      0 |
               cuCtxGetDevice | 850.00ns |   0.00% |     1 | 850.00ns | 850.00ns | 850.00ns |      0 |
cuDevicePrimaryCtxSetFlags_v2 | 670.00ns |   0.00% |     1 | 670.00ns | 670.00ns | 670.00ns |      0 |
                  cuDeviceGet | 640.00ns |   0.00% |     1 | 640.00ns | 640.00ns | 640.00ns |      0 |
             cuDeviceGetCount | 460.00ns |   0.00% |     1 | 460.00ns | 460.00ns | 460.00ns |      0 |
                        Total | 105.54ms | 100.00% |    98 |                                       1 |
Device profiling | 1 Hostnames | 1 Processes | 1 Threads | 1 Device pointers
                Name |    Time | Time(%) | Calls | Average |     Min |     Max |
  test_target__teams | 25.14ms |  99.80% |     1 | 25.14ms | 25.14ms | 25.14ms |
     cuMemcpyDtoH_v2 | 24.35us |   0.10% |     1 | 24.35us | 24.35us | 24.35us |
cuMemcpyDtoHAsync_v2 | 18.14us |   0.07% |     1 | 18.14us | 18.14us | 18.14us |
cuMemcpyHtoDAsync_v2 |  8.77us |   0.03% |     1 |  8.77us |  8.77us |  8.77us |
               Total | 25.19ms | 100.00% |     4 |
Explicit memory traffic | 1 Hostnames | 1 Processes | 1 Threads
                Name |  Byte | Byte(%) | Calls | Average |   Min |   Max |
cuMemcpyHtoDAsync_v2 | 4.00B |  44.44% |     1 |   4.00B | 4.00B | 4.00B |
cuMemcpyDtoHAsync_v2 | 4.00B |  44.44% |     1 |   4.00B | 4.00B | 4.00B |
     cuMemcpyDtoH_v2 | 1.00B |  11.11% |     1 |   1.00B | 1.00B | 1.00B |
               Total | 9.00B | 100.00% |     3 |