Skip to main content

[Study] CUDA Kernel Optimization — Memory Access Patterns

· One min read
hwkim-dev
Developer

While studying deep learning inference optimization, I explored how memory access patterns in CUDA kernels dramatically affect performance.

Key Concepts

Coalesced Memory Access

When threads within a warp access contiguous addresses in global memory, the GPU coalesces them into a single transaction. Strided (non-contiguous) access multiplies the number of transactions and tanks bandwidth efficiency.

Shared Memory Tiling

Shared Memory is on-chip SRAM physically co-located with L1 cache. By loading data in tiles, we drastically reduce round-trips to global memory.

__global__ void matmul_tiled(float *A, float *B, float *C, int N) {
__shared__ float sA[TILE][TILE];
__shared__ float sB[TILE][TILE];
// ...
}

Benchmark Results

ImplementationThroughput (GFLOPS)
Naive (global memory)42
Coalesced access198
+ Shared memory tiling573

Just adding shared memory tiling yielded a ~13.6× speedup.

Next Goals

  • Analyze bank conflicts and test padding strategies
  • Explore __ldg() read-only cache
  • Minimize warp divergence patterns