Archive: v001 Experimental Architecture¶

Warning

This architecture is a preliminary (v001) experimental design. Although structurally superior, it was engineered with a heavy inclination toward GEMM (Matrix) computations. Because local LLM environments are predominantly bound by GEMV (Vector) operations, this branch has been currently archived in favor of an optimized structure.

Project Overview¶

pccx (formerly uXC) is a customized SystemVerilog-based Neural Processing Unit (NPU) engineered fundamentally from the ground up to accelerate the quantized Gemma 3N E4B Large Language Model on the bare-metal Xilinx Kria KV260 FPGA (400 MHz). The architecture is meticulously designed to push the absolute physical constraints of the KV260 platform, exploiting its 1,248 DSP48E2 slices and 144 BRAMs to their functional ceiling.

Software Baseline: llm-lite (x64 CPU reference implementation)
Full-Stack Co-Design: Hardware accelerator (SystemVerilog), Trace-Driven validation model (Python), and an AXI DMA memory pipeline.

Quick Menu¶

Architecture Overview

Illustrates the internal NPU architecture, the 3-tier core system decoupled model, and the memory transition layer layout.
ISA Specification

Explains the 64-bit VLIW core, Opcode routing design, register mappings, and pipeline scheduling methodologies.
ISA Spreadsheet

Provides an internal spreadsheet-view breakdown of the overall modular ISA structure.
C API Detail

Focuses on the primary wrapping interfaces of pccx_v1_api.c and pccx_v1_api.h targeting the active NPU host controller.
RTL Source Reference

Per-core SystemVerilog deep-dive: every module under codes/v001/ rendered inline with syntax highlighting and collapsible dropdowns.

Quantization Strategy: W4A16 with BF16 Activations¶

The primary core computational path operates strictly at W4A16 precision:

Data	Type	Width	Notes
Weight	INT4	4-bit	Streamed through HP Ports and consumed purely as an INT4 layer
Feature Map	BF16	16-bit	Undergoes conversion from BF16 → 27-bit Fixed-Point for native MAC arithmetic
Accumulator	INT48	48-bit	Accumulated recursively through the P-Register of the DSP48E2 blocks
SFU I/O	BF16	16-bit	Reconstructed as BF16 Po st-Normalization heading for Non-Linear operations

Precision Promotion Flow¶

        graph TD
    A[Weight: INT4] --> MAC[DSP48E2 MAC]
    B[FMap: BF16 → 27-bit fixed-pt] --> MAC
    MAC -->|Accumulator| C[INT48]
    C -->|Barrel Shift + LOD| D[Normalization: BF16]
    D -->|SFU / CORDIC| E[Non-Linear Ops: exp, RMSNorm, Softmax...]
    E --> F[Output: BF16 to next layer]

At the transition segment toward the Non-Linear operations loop (Complex Vector Operation), the computation elevates precisely into BF16.

Compute Engines¶

Engine	Operation	Weights Input	Activation Fetch	Ac cumulator
Matrix Core	GEMM (prefill, projections)	HP0/1 (32 INT4/clk)	BF16 :math:` rightarrow` 27-bit fixed-pt	INT48 (DSP48E2)
Vector Core	GEMV ( autoregressive decode)	HP2/3 (32 INT4/clk each)	BF16 :math:` rightarrow` 27-bit fixed-pt	INT48 (DSP48E2)
CVO Core	Non-linear ops (Softmax, GELU, RoPE)	N/A	BF16 Stream via L2	BF16

Applying a structural Decoupled Dataflow design principle ensures operation instructions execute asynchronously. Distributed from the Global Pipeline across distinct modules, it completely prevents architectural stalling and pushes mathematical hardware throughput to its peak.