Design Rationale: v001 → v002¶

pccx v001 reached the brink of implementation before being moved to docs/archive/experimental_v001/ rather than taped out. This page documents which architectural weaknesses pushed us to that decision and how v002 resolves each of them.

1. Core Flaws in v001¶

Flaw	Symptom
Ambiguous core roles	The boundaries between Matrix, Vector, and CVO cores were fuzzy. Some operations were redundantly supported across multiple cores, and others fit none of them cleanly.
Too many buses	Weights, activations, and intermediate results each had their own bus, crossing the fabric in different directions. The result was routing congestion and poor timing.
L2 and Global Cache overlap	The two cache levels covered overlapping responsibilities, so the same data ended up duplicated on both sides and coherence logic added a constant tax.
Inefficient HP port layout	A single systolic array was served by a single HP port. The external 250 MHz limit capped the internal 400 MHz consumption rate.
Under-utilized systolic array	The 1-DSP-per-1-MAC structure left most of DSP48E2’s bit space unused.

2. v002’s Response¶

Each flaw maps to a specific design decision in v002.

Response	Description
Three-core organization	GEMV, GEMM, and SFU are cleanly separated. Each core is wired to the L2 cache, the weight buffer, and the pipeline registers / FIFOs that suit its access pattern — no more overlapping roles.
Bus simplification	Everything collapses onto two orthogonal axes: `WEIGHT BUS` and `ACTIVATION BUS`. The two buses are physically perpendicular to avoid routing contention.
Centralized L2	Global Cache responsibilities are folded into L2, and L2 is placed in the middle of the floorplan. The upper and lower slices see it symmetrically.
Distributed HP ports	HP2 and HP3 are assigned to independent slices, eliminating the weight-supply bottleneck.
Dual-channel bit packing	1 DSP = 2 MAC (DSP48E2 W4A8 Bit Packing and Sign Recovery). Across the whole systolic array this works out to 2,048 multiplies + 2,048 accumulates per clock cycle.

3. Speedup Analysis — 3.125×¶

The theoretical throughput gain over v001 comes from three independent levers, multiplied together.

Lever	Factor	Justification
Higher internal clock	× 400 / 250 = 1.6	External AXI 250 MHz decoupled from internal core 400 MHz.
Dual HP ports	(already consumed at 400 MHz)	2 of 4 HP ports (HP2 / HP3) are independently assigned to the upper and lower slices, doubling weight-supply bandwidth.
Bit packing	× 2	1 DSP now executes 2 MACs simultaneously.

Multiplying the three levers gives 1.6 × 2 × (bottleneck removed) ≈ 3.125× effective throughput.

3.1 Load-Side Derivation¶

v001: 250 MHz × 1 HP × 1 MAC/DSP = 250 units of throughput. v002: HP2 + HP3 stack weights at 250 MHz into a buffer, which is then consumed by the internal 400 MHz domain at 2 MACs per DSP, giving 800 units of internal consumption rate.

\[\frac{800}{250} \;=\; \mathbf{3.125\,\times}\]

The external port rate didn’t change. The win is structural: buffer externally, drain quickly internally, and perform two MACs per cycle. The effective throughput seen by the systolic array is 3.125× higher.

3.2 Per-Cycle Internal Throughput¶

        flowchart LR
  subgraph ext[External 250 MHz Domain]
    HP2[AXI HP2] --> BUF[Weight Buffer<br/>CDC FIFO]
    HP3[AXI HP3] --> BUF
  end
  subgraph core[Internal 400 MHz Domain]
    BUF -->|broadcast| SA[Systolic Array<br/>32×32 · 1 DSP = 2 MAC<br/>cascade break @ row 16]
    SA --> ACC[Result Accumulator<br/>819 GMAC/s peak]
  end

The single 32 × 32 grid holds 1,024 PEs × 2 MAC = 2,048 MAC/clk. Running at 400 MHz, this yields a 819 GMAC/s theoretical peak.

4. New Trade-offs¶

The speed gain is not free. v002 accepts the following constraints.

Constraint	Description
Weight precision ceiling	Beyond W4, guard bits run out and `N_max` collapses. W5/W6 support would require a separate mode.
K-split required	Layers with K > 4,096 must be tiled by the driver / compiler.
Sign-recovery post-processing	Each PE gains a 1-bit adder and 23-bit split logic. No throughput impact, but extra area.
CDC complexity	Asynchronous 250 MHz ↔ 400 MHz FIFOs need careful design and verification.

5. Summary vs. Archived v001¶

Aspect	v001 (Archived)	v002
Design bias	GEMM-centric (prefill-optimized)	Three-core layout: GEMM · GEMV · SFU
L2 cache placement	Peripheral	Central, symmetric interconnect on both sides
Global Cache	Separate block	Absorbed into L2
Quantization	W4A16 (BF16 activations)	W4A8 (INT8 activations)
HP port	One per SA	HP2 / HP3 distributed (upper / lower slices)
DSP utilization	1 DSP = 1 MAC	1 DSP = 2 MAC
Peak throughput (400 MHz)	~320 GMAC/s	819 GMAC/s (~2.56× measured improvement expected)