Skip to content

cuda ndrange #13

@DiamonDinoia

Description

@DiamonDinoia

Summary

I need to iterate over a multi-dimensional 3D array (with dimensions sizex, sizey, sizez) on the GPU using CUDA. The iteration should cover the full Cartesian product of the index space:

(x, y, z) ∈ [0, sizex) × [0, sizey) × [0, sizez)

Each CUDA thread should process multiple points in this 3D space, but not all threads start from the beginning. Instead, each thread starts at its own threadIdx.x and iterates over a flattened 1D index space in strides of blockDim.x. This is a standard CUDA pattern for splitting work across threads:

for (int idx = threadIdx.x; idx < total; idx += blockDim.x) {
    // Convert flat index `idx` to (x, y, z)
}

The code below shows how to convert this flattened index (idx) back to the 3D coordinates (x, y, z) using known dimensions:

static constexpr int sizex = ns;
static constexpr int sizey = ns;
static constexpr int sizez = ns;
static constexpr int plane = sizex * sizey;
static constexpr int total = plane * sizez;

for (int idx = threadIdx.x; idx < total; idx += blockDim.x) {
    const int z   = idx / plane;
    const int rem1 = idx % plane;
    const int y   = rem1 / sizex;
    const int x   = rem1 % sizex;
    // (x, y, z) is the 3D index
}

Proposed Abstraction

I want to generalize this loop using a utility function like:

for [x,y,z] in ndrange(threadIdx.x, blockDim.x, {sizex, sizey, sizez}) {
// do struff 
}


Question

Would it be feasible or advisable to encapsulate this 3D index mapping logic into a reusable function like cartesianProduct() for clarity and reusability in CUDA kernels?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions