Archive: v001 Experimental Architecture

Warning

This architecture is a preliminary (v001) experimental design. Although structurally superior, it was engineered with a heavy inclination toward GEMM (Matrix) computations. Because local LLM environments are predominantly bound by GEMV (Vector) operations, this branch has been currently archived in favor of an optimized structure.

Status Archived SystemVerilog RTL Target Hardware Quantization Precision


Project Overview

pccx (formerly uXC) is a customized SystemVerilog-based Neural Processing Unit (NPU) engineered fundamentally from the ground up to accelerate the quantized Gemma 3N E4B Large Language Model on the bare-metal Xilinx Kria KV260 FPGA (400 MHz). The architecture is meticulously designed to push the absolute physical constraints of the KV260 platform, exploiting its 1,248 DSP48E2 slices and 144 BRAMs to their functional ceiling.

  • Software Baseline: llm-lite (x64 CPU reference implementation)

  • Full-Stack Co-Design: Hardware accelerator (SystemVerilog), Trace-Driven validation model (Python), and an AXI DMA memory pipeline.


Quick Menu

  • Architecture Overview

    Illustrates the internal NPU architecture, the 3-tier core system decoupled model, and the memory transition layer layout.

  • ISA Specification

    Explains the 64-bit VLIW core, Opcode routing design, register mappings, and pipeline scheduling methodologies.

  • ISA Spreadsheet

    Provides an internal spreadsheet-view breakdown of the overall modular ISA structure.

  • C API Detail

    Focuses on the primary wrapping interfaces of pccx_v1_api.c and pccx_v1_api.h targeting the active NPU host controller.

  • RTL Source Reference

    Per-core SystemVerilog deep-dive: every module under codes/v001/ rendered inline with syntax highlighting and collapsible dropdowns.


Quantization Strategy: W4A16 with BF16 Activations

The primary core computational path operates strictly at W4A16 precision:

Data

Type

Width

Notes

Weight

INT4

4-bit

Streamed through HP Ports and consumed purely as an INT4 layer

Feature Map

BF16

16-bit

Undergoes conversion from BF16 → 27-bit Fixed-Point for native MAC arithmetic

** Accumulator**

INT48

48-bit

Accumulated recursively through the P-Register of the DSP48E2 blocks

SFU I/O

BF16

16-bit

Reconstructed as BF16 Po st-Normalization heading for Non-Linear operations

Precision Promotion Flow

        graph TD
    A[Weight: INT4] --> MAC[DSP48E2 MAC]
    B[FMap: BF16 → 27-bit fixed-pt] --> MAC
    MAC -->|Accumulator| C[INT48]
    C -->|Barrel Shift + LOD| D[Normalization: BF16]
    D -->|SFU / CORDIC| E[Non-Linear Ops: exp, RMSNorm, Softmax...]
    E --> F[Output: BF16 to next layer]
    

At the transition segment toward the Non-Linear operations loop (Complex Vector Operation), the computation elevates precisely into BF16.


Compute Engines

Engine

Operation

Weights Input

Activation Fetch

Ac cumulator

Matrix Core

GEMM (prefill, projections)

HP0/1 (32 INT4/clk)

BF16 :math:` rightarrow` 27-bit fixed-pt

INT48 (DSP48E2)

Vector Core

GEMV ( autoregressive decode)

HP2/3 (32 INT4/clk each)

BF16 :math:` rightarrow` 27-bit fixed-pt

INT48 (DSP48E2)

CVO Core

Non-linear ops (Softmax, GELU, RoPE)

N/A

BF16 Stream via L2

BF16

Applying a structural Decoupled Dataflow design principle ensures operation instructions execute asynchronously. Distributed from the Global Pipeline across distinct modules, it completely prevents architectural stalling and pushes mathematical hardware throughput to its peak.