Design Rationale: v001 → v002¶
pccx v001 reached the brink of implementation before being moved to
docs/archive/experimental_v001/ rather than taped out. This page
documents which architectural weaknesses pushed us to that decision and
how v002 resolves each of them.
1. Core Flaws in v001¶
Flaw |
Symptom |
|---|---|
Ambiguous core roles |
The boundaries between Matrix, Vector, and CVO cores were fuzzy. Some operations were redundantly supported across multiple cores, and others fit none of them cleanly. |
Too many buses |
Weights, activations, and intermediate results each had their own bus, crossing the fabric in different directions. The result was routing congestion and poor timing. |
L2 and Global Cache overlap |
The two cache levels covered overlapping responsibilities, so the same data ended up duplicated on both sides and coherence logic added a constant tax. |
Inefficient HP port layout |
A single systolic array was served by a single HP port. The external 250 MHz limit capped the internal 400 MHz consumption rate. |
Under-utilized systolic array |
The 1-DSP-per-1-MAC structure left most of DSP48E2’s bit space unused. |
2. v002’s Response¶
Each flaw maps to a specific design decision in v002.
Response |
Description |
|---|---|
Three-core organization |
GEMV, GEMM, and SFU are cleanly separated. Each core is wired to the L2 cache, the weight buffer, and the pipeline registers / FIFOs that suit its access pattern — no more overlapping roles. |
Bus simplification |
Everything collapses onto two orthogonal axes: |
Centralized L2 |
Global Cache responsibilities are folded into L2, and L2 is placed in the middle of the floorplan. The upper and lower slices see it symmetrically. |
Distributed HP ports |
HP2 and HP3 are assigned to independent slices, eliminating the weight-supply bottleneck. |
Dual-channel bit packing |
1 DSP = 2 MAC (DSP48E2 W4A8 Bit Packing and Sign Recovery). Across the whole systolic array this works out to 2,048 multiplies + 2,048 accumulates per clock cycle. |
3. Speedup Analysis — 3.125×¶
The theoretical throughput gain over v001 comes from three independent levers, multiplied together.
Lever |
Factor |
Justification |
|---|---|---|
Higher internal clock |
× 400 / 250 = 1.6 |
External AXI 250 MHz decoupled from internal core 400 MHz. |
Dual HP ports |
(already consumed at 400 MHz) |
2 of 4 HP ports (HP2 / HP3) are independently assigned to the upper and lower slices, doubling weight-supply bandwidth. |
Bit packing |
× 2 |
1 DSP now executes 2 MACs simultaneously. |
Multiplying the three levers gives 1.6 × 2 × (bottleneck removed) ≈ 3.125× effective throughput.
3.1 Load-Side Derivation¶
v001: 250 MHz × 1 HP × 1 MAC/DSP = 250 units of throughput. v002: HP2 + HP3 stack weights at 250 MHz into a buffer, which is then consumed by the internal 400 MHz domain at 2 MACs per DSP, giving 800 units of internal consumption rate.
The external port rate didn’t change. The win is structural: buffer externally, drain quickly internally, and perform two MACs per cycle. The effective throughput seen by the systolic array is 3.125× higher.
3.2 Per-Cycle Internal Throughput¶
flowchart LR
subgraph ext[External 250 MHz Domain]
HP2[AXI HP2] --> BUF[Weight Buffer<br/>CDC FIFO]
HP3[AXI HP3] --> BUF
end
subgraph core[Internal 400 MHz Domain]
BUF -->|broadcast| SA[Systolic Array<br/>32×32 · 1 DSP = 2 MAC<br/>cascade break @ row 16]
SA --> ACC[Result Accumulator<br/>819 GMAC/s peak]
end
The single 32 × 32 grid holds 1,024 PEs × 2 MAC = 2,048 MAC/clk. Running at 400 MHz, this yields a 819 GMAC/s theoretical peak.
4. New Trade-offs¶
The speed gain is not free. v002 accepts the following constraints.
Constraint |
Description |
|---|---|
Weight precision ceiling |
Beyond W4, guard bits run out and |
K-split required |
Layers with K > 4,096 must be tiled by the driver / compiler. |
Sign-recovery post-processing |
Each PE gains a 1-bit adder and 23-bit split logic. No throughput impact, but extra area. |
CDC complexity |
Asynchronous 250 MHz ↔ 400 MHz FIFOs need careful design and verification. |
5. Summary vs. Archived v001¶
Aspect |
v001 (Archived) |
v002 |
|---|---|---|
Design bias |
GEMM-centric (prefill-optimized) |
Three-core layout: GEMM · GEMV · SFU |
L2 cache placement |
Peripheral |
Central, symmetric interconnect on both sides |
Global Cache |
Separate block |
Absorbed into L2 |
Quantization |
W4A16 (BF16 activations) |
W4A8 (INT8 activations) |
HP port |
One per SA |
HP2 / HP3 distributed (upper / lower slices) |
DSP utilization |
1 DSP = 1 MAC |
1 DSP = 2 MAC |
Peak throughput (400 MHz) |
~320 GMAC/s |
819 GMAC/s (~2.56× measured improvement expected) |
See also
v001 details: Archive: v001 Experimental Architecture
Bit packing details: DSP48E2 W4A8 Bit Packing and Sign Recovery
KV cache strategy: KV Cache Optimization Strategy