[Study] CUDA Kernel Optimization — Memory Access Patterns
· One min read
While studying deep learning inference optimization, I explored how memory access patterns in CUDA kernels dramatically affect performance.
Key Concepts
Coalesced Memory Access
When threads within a warp access contiguous addresses in global memory, the GPU coalesces them into a single transaction. Strided (non-contiguous) access multiplies the number of transactions and tanks bandwidth efficiency.
Shared Memory Tiling
Shared Memory is on-chip SRAM physically co-located with L1 cache. By loading data in tiles, we drastically reduce round-trips to global memory.
__global__ void matmul_tiled(float *A, float *B, float *C, int N) {
__shared__ float sA[TILE][TILE];
__shared__ float sB[TILE][TILE];
// ...
}
Benchmark Results
| Implementation | Throughput (GFLOPS) |
|---|---|
| Naive (global memory) | 42 |
| Coalesced access | 198 |
| + Shared memory tiling | 573 |
Just adding shared memory tiling yielded a ~13.6× speedup.
Next Goals
- Analyze bank conflicts and test padding strategies
- Explore
__ldg()read-only cache - Minimize warp divergence patterns
