Overview

Status Architecture Target Precision

1. Project Goal

pccx (Parallel Compute Core eXecutor) v002 is a general-purpose NPU architecture that accelerates quantized Transformer-based LLMs in a bare-metal environment, with the Xilinx Kria KV260 SoM as its primary target.

Core Design Principles

Principle

Description

Generality

The architecture is not tied to a single model (e.g., Gemma 3N E4B). A device-agnostic instruction set (ISA) and a decoupled dataflow let the same silicon host a wide range of Transformer variants.

Scalability

Systolic-array dimensions, GEMV/SFU core counts, and local cache sizes are all exposed as generate parameters so the design can be resynthesized to fit the resource budget of a different target.

Memory-centric Layout

The L2 cache is physically centered in the floorplan and serves as the shared activation source for GEMM, GEMV, and CVO. This removes the inter-layer shuffle cost that dogged v001.

2. Target Workload

The decoding stage of the target models is a GEMV-dominated workload — batch size 1, sequence length 1. Prefill, in contrast, is GEMM-dominated. pccx v002 runs both efficiently on a single architecture by physically separating the matrix core (GEMM) from the vector core (GEMV), and by placing a dedicated Special Function Unit (SFU) for Complex Vector Operations (CVO) so non-linear activations never stall the main pipeline.

Performance Targets

Metric

Target

Rationale

Decoding throughput

20 tok/s (Gemma 3N E4B)

Bandwidth-matched between L2 cache and the GEMV cores

Core clock frequency

400 MHz

DSP48E2 timing ceiling

Quantization

W4A8 (INT4 × INT8)

Best fit for integer math on KV260’s DSP48E2

SFU precision

BF16 / FP32 promotion

Numerical stability for Softmax, RMSNorm, and GELU

3. Key Differences from v001

The transition rationale and the full breakdown of the 3.125× throughput improvement are covered in Design Rationale: v001 → v002. In short:

Aspect

v001 (Archived)

v002

Design bias

GEMM-centric (prefill-optimized)

Three-core layout: GEMM · GEMV · SFU

L2 cache placement

Peripheral; overlaps with Global Cache

Central placement, Global Cache absorbed, symmetric interconnect

Quantization

W4A16 (BF16 activations)

W4A8 (INT8 activations)

Core composition

Matrix + Vector + CVO (blurred boundaries)

Matrix (32 × 32 systolic) + 4 × 32-MAC GEMV cores + 1 BF16-scalar SFU

HP port layout

One SA mapped to one port (250 MHz ceiling)

HP2 / HP3 distributed, consumed internally at 400 MHz

DSP utilization

1 DSP = 1 MAC

1 DSP = 2 MAC (dual-channel bit packing)

Theoretical speedup

× 3.125 (1.6 × 2)

See also

4. Ecosystem Layers

pccx is split into three strictly separated layers so that it stays portable across devices.

Layer

Location

Responsibility

Architecture

codes/v002/hw/rtl/

Core RTL logic and generate parameters. Defines ISA, pipeline, and scheduling. Vendor-independent.

Device

codes/v002/hw/device/

Maps the architecture onto a specific target (e.g., KV260). Fixes systolic-array size, AXI interfaces, URAM layout.

Driver

codes/v002/sw/

C/C++ hardware abstraction layer (HAL) and high-level API. Handles instruction dispatch, memory mapping, and host-device sync.

5. Documentation Map

Section

Contents

Hardware Architecture

Top-level block diagram, physical floorplan, GEMM/GEMV/SFU microarchitecture, memory hierarchy, and the DSP48E2 W4A8 bit-packing technique.

Instruction Set Architecture (ISA)

64-bit instruction format, encodings for the five opcodes (GEMV / GEMM / MEMCPY / MEMSET / CVO), and per-instruction dataflow.

Software Stack

C API overview and the instruction dispatch flow.