[Study] CUDA Kernel Optimization — Memory Access Patterns

April 14, 2026 · One min read

Developer

While studying deep learning inference optimization, I explored how memory access patterns in CUDA kernels dramatically affect performance.

Key Concepts

Coalesced Memory Access

When threads within a warp access contiguous addresses in global memory, the GPU coalesces them into a single transaction. Strided (non-contiguous) access multiplies the number of transactions and tanks bandwidth efficiency.

Shared Memory Tiling

Shared Memory is on-chip SRAM physically co-located with L1 cache. By loading data in tiles, we drastically reduce round-trips to global memory.

__global__ void matmul_tiled(float *A, float *B, float *C, int N) {
    __shared__ float sA[TILE][TILE];
    __shared__ float sB[TILE][TILE];
    // ...
}

Benchmark Results

Implementation	Throughput (GFLOPS)
Naive (global memory)	42
Coalesced access	198
+ Shared memory tiling	573

Just adding shared memory tiling yielded a ~13.6× speedup.

Next Goals

Analyze bank conflicts and test padding strategies
Explore __ldg() read-only cache
Minimize warp divergence patterns

Key Concepts​

Coalesced Memory Access​

Shared Memory Tiling​

Benchmark Results​

Next Goals​

Key Concepts

Coalesced Memory Access

Shared Memory Tiling

Benchmark Results

Next Goals