Top-Level Architecture¶

Figure 1 **Figure 1.** pccx v002 top-level block diagram. Shows the data path between host DDR4 and the device-side L2 cache, the clock domain crossings, and the placement of each heterogeneous compute core.¶

pccx v002 system context — PS, PL NPU, and external DRAM — Figure 2 **Figure 2.** System context. The AXI-Lite command path (`CMD_IN`) carries 64-bit instructions from the host driver into the Control Unit + Dispatcher; the dual-direction AXI-Lite `STAT_OUT` reports async completion. Four 128-bit AXI-HP ports plus one ACP port move the bulk data (weights, activations, KV cache) between external DDR4 and the AXI Interconnect. Inside the PL, the Dispatcher fans μops out over a single trunk to the GEMM, GEMV, SFU, and DMA back-ends, which read and write through the L1 Weight Buffer and the L2 cache.¶

1. Design Principles¶

The top level is organized around four principles.

Clock domain separation — The AXI infrastructure (250 MHz) is decoupled from the core logic (400 MHz) so that AXI protocol timing never constrains the core frequency.
Decoupled dataflow — Instruction decode (front-end) and execution (back-end) are separated by an asynchronous FIFO, eliminating pipeline stalls due to dispatch latency.
Separated weight and activation paths — Weights stream directly from host DDR4 into the weight buffer, while activations flow from the central L2 cache. The two streams never collide on the interconnect.
Centralized L2 cache — GEMM, GEMV, and SFU are placed physically close to L2, eliminating the cost of reshuffling activations between layers.

2. Major Blocks¶

Block	Domain	Role
Host Interface	250 MHz	AXI-Lite (control registers), AXI HP ports (data streaming), AXI ACP (host-cache DMA). Bridged to the 400 MHz domain through a CDC FIFO.
Device Interface	400 MHz	Core-side AXI slave. Bridges onto the Weight Bus and the L2 cache R/W ports.
Decoder / Dispatcher	400 MHz	Decodes 64-bit instructions and emits per-opcode control μops. See Instruction Encoding.
Control Unit	400 MHz	Global scheduler. Handles dependency resolution, dispatch ordering, and completion notification.
Weight Buffer	400 MHz	Broadcasts INT4 weights streamed in from the HP ports to the GEMM / GEMV cores.
L2 Cache	400 MHz	Central shared cache. Stores activations, intermediate results, and the KV cache.
Constant Cache	400 MHz	Stores small constants — shape / size pointers, scale factors — written by MEMSET.
Systolic Array (32 × 32, cascade break @ row 16)	400 MHz	GEMM-only 2D systolic array. Single instance; two 32 × 16 sub-chains share the grid. See GEMM Core (Systolic Array).
GEMV Core (×4)	400 MHz	Four parallel cores, each a 32-MAC LUT pipeline + 5-stage reduction tree. See GEMV Core.
SFU (×1)	400 MHz	A single scalar BF16 pipeline (1 element / clk) for Softmax, GELU, RMSNorm, RoPE, and friends. See SFU Core (Complex Vector Operations).

3. Clock Domain Strategy¶

3.1 Why Split the Domains¶

The KV260 AXI interconnect closes timing comfortably at 250 MHz, but the core logic — when all DSP48E2 pipeline stages are enabled — can reach 400 MHz. Separating the two lets each subsystem run at its natural rate:

AXI domain (250 MHz): Host, DMA, AXI-Lite control register access.
Core domain (400 MHz): Dispatcher, Control Unit, every compute core, L2 cache.

3.2 CDC (Clock Domain Crossing)¶

The domain boundary is crossed through an asynchronous FIFO. Each AXI HP port delivers bursts at burst_length × AXI_clock bandwidth, sized so that the 400 MHz core side is never starved. See the bandwidth matching analysis in Memory Hierarchy.

4. Data Path Summary¶

        flowchart LR
  DDR[(Host DDR4)] -->|HP2/HP3| WB[Weight Buffer]
  DDR -->|ACP DMA| L2[(L2 Cache<br/>Central URAM)]
  WB --> SA[Systolic Array<br/>32×32]
  WB --> GV[GEMV Core<br/>×4]
  L2 --> SA
  L2 --> GV
  L2 <--> SFU[SFU<br/>×1]
  GV <-. direct FIFO .-> SFU
  SA --> L2
  GV --> L2
  SFU --> L2
  DISP[Decoder / Dispatcher] -->|μop| SA
  DISP -->|μop| GV
  DISP -->|μop| SFU
  DISP -->|MEMSET| CC[Constant Cache]

Instruction	Primary Data Path	Notes
GEMM	Weight Buffer → Systolic Array ← L2	Prefill stage. Weights and activations meet inside the array.
GEMV	Weight Buffer → GEMV Core ← L2	Decoding stage. Four GEMV cores partition across batch / head.
CVO (SFU)	L2 ↔ SFU, with direct GEMV ↔ SFU FIFO	Weight path unused. Softmax / GELU / RoPE.
MEMCPY	Host DDR4 ↔ L2	ACP port. Synchronous or asynchronous.
MEMSET	Dispatcher → Constant Cache	Initializes small constants — shape / size pointers and the like.

For per-instruction dataflow detail, see Per-Instruction Dataflow.