Memory Hierarchy

The pccx v002 memory subsystem is a four-level hierarchy: host DDR4 → Weight Buffer / L2 Cache → L1 / Constant Cache → PE registers. Each level is sized to match bandwidth with the next and to prevent data starvation in the compute cores.

        flowchart TB
  DDR[("Host DDR4<br/>19.2 GB/s")]
  subgraph ext["External 250 MHz"]
    HP["AXI HP2 / HP3<br/>256 bit/clk × 2"]
  end
  subgraph core["Internal 400 MHz"]
    WB["Weight Buffer<br/>URAM FIFO"]
    L2[("L2 Cache<br/>URAM ~1.75 MB")]
    L1["L1 Cache<br/>per-core BRAM"]
    CC["Constant Cache<br/>BRAM"]
    L0["L0 Accumulator<br/>DSP48E2 P-reg"]
  end
  DDR --> HP
  HP -->|weights| WB
  HP -->|activations| L2
  WB --> L1
  L2 --> L1
  L2 --> CC
  L1 --> L0
    

1. Hierarchy

Level

Media

Capacity (KV260)

Peak Bandwidth

Purpose

L0 Register

FF

Inside DSP48E2

48 bit / clk / DSP

Accumulator

L1 Cache

BRAM

A few KB per core

32 element / clk

GEMV activation / result staging

Constant Cache

BRAM

A few KB per core

16 bit × N / clk

ISA shape/size pointers, scale factors

L2 Cache

URAM

1.75 MB (114,688 × 128-bit; ~50 of 64 URAM)

256 bit × 2 / clk (both slices)

Activations, KV cache, intermediate results

Weight Buffer

URAM (FIFO)

4 × 64 KB (4 HP ports, 4096 deep each)

128 bit/clk per HP port @ 250 MHz

INT4 weight stream

Host DDR4

External DRAM

4 × 512 Mb × 16-bit

19.2 GB/s

Model weights, inputs, token outputs

2. Bandwidth Matching

2.1 Weight Path

Goal: the HP ports must deliver enough weight bandwidth to feed the GEMM systolic array each cycle.

  • Systolic array: 32 × 32 = 1,024 DSP at 400 MHz (one grid, cascade split at row 16 into two 32 × 16 sub-chains).

  • With W4A8 dual-channel packing, 1 DSP = 2 MAC, so 2,048 MAC/clk.

  • Weight demand: 2,048 × 4 bit = 8,192 bit/clk @ 400 MHz.

  • Supply: HP0 + HP1 deliver 2 × 128 bit/clk @ 250 MHz (= 64 Gbit/s total raw), which normalises to ~160 bit/clk @ 400 MHz downstream of the CDC FIFO.

The gap is closed by weight reuse (Weight Stationary): the GEMM systolic array preloads weights once and reuses them for hundreds to thousands of cycles; the Weight Buffer only prefetches. See GEMM Core (Systolic Array) for the exact reuse pattern.

2.2 Activation Path

Goal: L2 cache must satisfy concurrent activation reads from GEMM, GEMV, and SFU.

  • L2 cache ports: dual-port URAM — ACP DMA on Port A, NPU compute on Port B, both 128-bit wide per cycle.

  • Peak slice-side demand: 4 GEMV cores × 32 INT8 elements/clk = 128 INT8 elem/clk total. A single 128-bit URAM read supplies 16 INT8 elements per cycle, so the GEMV broadcast path (activation is reused across all 4 cores) works within a single port.

2.3 Host ↔ Device Path

Goal: load model weights during prefill, and support KV cache updates plus token output during decoding.

  • DMA via AXI ACP port. Capped by host DDR4’s 19.2 GB/s.

  • At ~20 tokens/s the host ↔ device traffic is dominated by KV cache updates and new token writes — well within the budget.

3. Cache Operating Policy

3.1 L2 Cache: Central Shared Buffer

L2 cache runs as a software-managed scratchpad — there is no hardware replacement policy. Addresses are named directly in the instruction stream (MEMCPY dest_addr, GEMM src_addr). Benefits:

  • Predictable latency (no tag matching, no miss handling).

  • The compiler can lay out data statically and route around interconnect contention.

3.2 Constant Cache: ISA Pointer Backing Store

The ISA references shape / size metadata through 6-bit shape_ptr_addr and size_ptr_addr fields. These pointers index into the Constant Cache’s 64 entries, which are preloaded by MEMSET. See Per-Instruction Encoding for the encoding.

3.3 Weight Buffer: Streaming FIFO

The Weight Buffer is implemented as a circular FIFO that absorbs the timing difference between HP port prefetch and core consumption. It supports both GEMM’s Weight Stationary reuse and GEMV’s Weight Streaming pattern via bank-level interleaving.

4. Preventing Data Starvation

Pipeline stalls are avoided with double-buffering throughout:

  • GEMM activations: ping-pong buffers between L2 and PE.

  • GEMV activations: bank-split L1 cache for simultaneous read/write.

  • Weights: ping-pong FIFO inside the Weight Buffer.

The design targets 100% busy-rate for every compute core under ideal conditions. Measured utilization will be reported in the Implementation section once synthesis results come in.