How fast can we get integer multiplication on an H100?
A lot of time and effort has been spent optimizing matmuls on GPUs (see here). And rightfully, so they're probably the most important computation to ever be performed on a GPU.
But what about integer multiplication? Multiplying numbers millions of digits long is still pretty important (in cryptography, for example), but it's not being actively optimized on modern GPUs. At least not publicly.
So, the goal of this repo is to serve as a benchmark for integer multiplication and to try and multiply million digit numbers as fast as we possibly can.
We are given two positive numbers uint32 chunks in little-endian format. This lets us store and perform operations on numbers of arbitrary length.
In grade school, we're taught that to multiply two numbers
In our case, we'll need to multiply every uint32 chunk of uint32 chunk of
To distribute this work across threads, we can have each thread compute the product of one uint32 chunk of uint64 array. Then we can have a single thread perform sequential carry propagation.
For example, if we had a block with just 4 threads, it might look like this:
While it technically works, and achieves ~ 0.002 seconds for 19-digit multiplication, there are several problems with this approach:
-
$O(n^2)$ is just too slow. - We need to use the
atomicAddoperation, meaning multiple threads needing to accumulate their products to the same chunk need to wait for each other. - We have to do carry propagation, which is a sequential operation.
- It doesn't work for
$n \geq 20$ digits! We're accumulatinguint32products into auint64word. At 20 digits, we'll need twouint32chunks to represent$A$ and$B$ . Computing one of the chunks of$C$ will require an operation like$A[0]\cdot B[1] + A[1] \cdot B[0]$ , which can overflow auint64. Of course we can try to use different and bigger types, maybe 128-bit integers, but that only pushes the problem further down the line.
This is a fundamental problem with the naive multiplication kernel that cannot be fixed unless we incorporate intermediate carry propagation, further reducing the parallel processing here to the point where it doesn't even make sense to use a GPU. So, we move on to better, more interesting multiplication algorithms:
Coming soon!
Coming soon!