[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]
cuBLAS dispatches an inefficient kernel for every batched FP32 workload, from 256×256 to 8192×8192×8. It only uses ~40% of the available compute on RTX GPUs. Tested with RTX 5090, but likely all RTX non-Pro GPUs are affected.
I tested with the latest CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03. Previous versions are even worse.
I wrote a simple, yet efficient kernel and compared it to cuBLAS across a variety of workloads.
Batched perf vs cuBLAS on 5090 (>100% means my kernel is faster):
| Size | B=4 | B=8 | B=16 |
|---|---|---|---|
| 256 | 91% | 80% | 90% |
| 512 | 120% | 153% | 135% |
| 1024 | 137% | 142% | 142% |
| 2048 | 158% | 155% | 157% |
| 4096 | 157% | 162% | 170% |
| 8192 | 158% | 152% | 148% |
cuBLAS uses a proper kernel on other GPUs. RTX GPUs clearly receive less love from NVIDIA:
- Pro 6000: escalates through three tile sizes, reaches 73% FMA (Fused Multiply-Add pipe)
- H200: best implementation, mixes CUTLASS and xmma families, reaches 82% FMA
An in-depth analysis with full NCU profiling data across all three GPUs, a deep-dive into SASS scheduling explaining the remaining 5% single-mode gap between my kernel and a proper cuBLAS SGEMM, and repro scripts are available in the article linked below.
Besides the bug, the article covers a simple TMA (tensor memory accelerator) double-buffer kernel that beats cuBLAS by 46-65% in batched mode on the 5090 and achieves 80-120% of the performance of a properly selected kernel, making it a nice technique for writing simple yet very performant kernels.
VS Proper Pro6000 kernel:
| Size | B=4 | B=8 | B=16 |
|---|---|---|---|
| 256 | 87% | 95% | 77% |
| 512 | 102% | 124% | 101% |
| 1024 | 101% | 104% | 96% |
| 2048 | 90% | 102% | 93% |
| 4096 | 93% | 93% | 93% |
| 8192 | 94% | 95% | 95% |
VS Proper H200 kernel:
| Size | B=4 | B=8 | B=16 |
|---|---|---|---|
| 256 | 85% | 104% | 77% |
| 512 | 105% | 97% | 88% |
| 1024 | 87% | 89% | 89% |
| 2048 | 89% | 90% | 92% |
| 4096 | 91% | 89% | 90% |
| 8192 | 88% | 87% | 87% |
Double buffer pipeline visualization:
Tile 0: [load buf0] [wait] [compute buf0 + load buf1] Tile 1: [wait buf1] [compute buf1 + load buf0] Tile 2: [wait buf0] [compute buf0 + load buf1] ... Simplified kernel source:
__global__ __launch_bounds__(256) void fused_matmul( const __grid_constant__ CUtensorMap A_tma, const __grid_constant__ CUtensorMap B_tma, float* C) { extern __shared__ __align__(128) char dsmem[]; float* smem = (float*)dsmem; // Two mbarriers for double-buffer synchronization uint64_t* mbar = (uint64_t*)(dsmem + 2 * STAGE * 4); // Shared memory addresses for TMA targets const int as0 = __cvta_generic_to_shared(&smem[0]); const int bs0 = __cvta_generic_to_shared(&smem[A_SIZE]); const int as1 = __cvta_generic_to_shared(&smem[STAGE]); const int bs1 = __cvta_generic_to_shared(&smem[STAGE + A_SIZE]); // Thread identity int tid = threadIdx.y * 32 + threadIdx.x; int tr = threadIdx.y * TM, tc = threadIdx.x * 4; int bm = blockIdx.y * BM, bn = blockIdx.x * BN; // Initialize mbarriers (thread 0 only) if (tid == 0) { mbarrier_init(mbar[0]); mbarrier_init(mbar[1]); } __syncthreads(); float c[TM][4] = {}; // Accumulators // Pre-load first tile if (tid == 0) { mbarrier_expect_tx(mbar[0], BYTES); tma_load_2d(as0, &A_tma, /*k=*/0, bm, mbar[0]); tma_load_2d(bs0, &B_tma, bn, /*k=*/0, mbar[0]); } for (int t = 0; t < K/BK; t++) { int s = t % 2; // Current buffer // Wait for current tile's TMA to complete mbarrier_wait(mbar[s], phase[s]); // Start loading NEXT tile (overlaps with compute) if (tid == 0 && t + 1 < nt) { tma_load_2d(next_buf_a, &A_tma, next_k, bm, next_mbar); tma_load_2d(next_buf_b, &B_tma, bn, next_k, next_mbar); } // Compute: all 256 threads do FMA from shared memory float* As = &smem[s * STAGE]; float* Bs = &smem[s * STAGE + A_SIZE]; #pragma unroll for (int kk = 0; kk < BK; kk++) { float b0 = Bs[kk*BN+tc], b1 = Bs[kk*BN+tc+1], ...; for (int i = 0; i < TM; i++) { float a = As[(tr+i)*BK+kk]; c[i][0] += a * b0; c[i][1] += a * b1; // ... 4 FMAs per row } } __syncthreads(); } // Write results to global memory for (int i = 0; i < TM; i++) store_row(C, bm+tr+i, bn+tc, c[i]); The full article is available here
Repo with repro scripts and benchmark data
[link] [comments]
Want to read more?
Check out the full article on the original site