-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Summary
I need to iterate over a multi-dimensional 3D array (with dimensions sizex, sizey, sizez) on the GPU using CUDA. The iteration should cover the full Cartesian product of the index space:
(x, y, z) ∈ [0, sizex) × [0, sizey) × [0, sizez)
Each CUDA thread should process multiple points in this 3D space, but not all threads start from the beginning. Instead, each thread starts at its own threadIdx.x and iterates over a flattened 1D index space in strides of blockDim.x. This is a standard CUDA pattern for splitting work across threads:
for (int idx = threadIdx.x; idx < total; idx += blockDim.x) {
// Convert flat index `idx` to (x, y, z)
}The code below shows how to convert this flattened index (idx) back to the 3D coordinates (x, y, z) using known dimensions:
static constexpr int sizex = ns;
static constexpr int sizey = ns;
static constexpr int sizez = ns;
static constexpr int plane = sizex * sizey;
static constexpr int total = plane * sizez;
for (int idx = threadIdx.x; idx < total; idx += blockDim.x) {
const int z = idx / plane;
const int rem1 = idx % plane;
const int y = rem1 / sizex;
const int x = rem1 % sizex;
// (x, y, z) is the 3D index
}Proposed Abstraction
I want to generalize this loop using a utility function like:
for [x,y,z] in ndrange(threadIdx.x, blockDim.x, {sizex, sizey, sizez}) {
// do struff
}
Question
Would it be feasible or advisable to encapsulate this 3D index mapping logic into a reusable function like cartesianProduct() for clarity and reusability in CUDA kernels?