pccx: Parallel Compute Core eXecutor

License Architecture Target Precision

Notice: Active Development in Progress. pccx is a scalable, modular Neural Processing Unit (NPU) architecture designed to accelerate Transformer-based large language models (LLMs) on resource-constrained edge devices.


1. Architecture Overview

pccx is a device-agnostic hardware-software co-design framework. It attacks the memory-bandwidth and compute bottlenecks of edge hardware by scaling the core architecture to match the physical resource budget of each target device.

1.1 Ecosystem Structure

The project is strictly separated into three layers for maximum portability and scalability.

  • /architecture (Logic Layer) — core RTL and generate parameters.

    • Defines the logical pipeline, instruction scheduling, and the custom 64-bit ISA.

    • Independent of any specific hardware vendor or interface protocol.

  • /device (Implementation Layer) — maps the pccx architecture onto a specific hardware target.

    • Adjusts core count, systolic-array dimensions, and memory port widths to the available resource budget (DSP count, local memory size, etc.).

  • /driver (Software Layer) — a C/C++ hardware abstraction layer (HAL) and high-level API.

    • Handles instruction dispatch and memory mapping, bridging high-level AI models with the pccx hardware.


2. Key Technical Features

2.1 Decoupled Dataflow & Custom ISA

pccx uses a custom 64-bit ISA tuned for matrix and vector operations. A decoupled-dataflow pipeline separates instruction decode from execution, eliminating dispatch-side stalls and maximizing throughput.

2.2 W4A8 Dynamic Precision Promotion

pccx balances efficiency with accuracy:

  • Compute: a parallel 2D systolic array executes dense INT4 (weight) × INT8 (activation) operations.

  • Promotion: during non-linear operations (Softmax, RMSNorm, GELU), the CVO core automatically promotes precision to BF16 / FP32 so numerical integrity is preserved.

2.3 Tiered Memory Hierarchy

  • Matrix core: dedicated GEMM, with a scalable array size.

  • Vector core: GEMV and element-wise operations.

  • Shared interconnect: a flexible bus that lets cores and local caches access each other concurrently without arbitration overhead.


3. Documentation

Detailed technical specifications live under pccx v002 Architecture:

  1. Instruction Set Architecture (ISA) — 64-bit custom instruction set.

  2. Hardware Architecture — hardware architecture and floorplan.

  3. Software Stack — driver and SDK documentation.

The v001 architecture is archived at Archive: v001 Experimental Architecture.


4. License

Licensed under the Apache License 2.0. This provides freedom of use and modification while protecting the architecture from patent-related risks, keeping the ecosystem safe for open-source hardware development.


5. Ecosystem

RTL Implementation

github.com/hwkim-dev/pccx-FPGA-NPU-LLM-kv260

The active v002 SystemVerilog sources — ISA package, controller, compute cores (GEMM / GEMV / CVO), memory hierarchy. Target device is the Xilinx Kria KV260 (Zynq UltraScale+ ZU5EV).

Every v002 RTL reference page on this site links back to the exact .sv file in that repository.

Open the pccx-FPGA-NPU-LLM-kv260 repository on GitHub
Documentation source

github.com/hwkim-dev/pccx — the Sphinx project powering this site.

Open the pccx documentation repository on GitHub
Author portfolio

hwkim-dev.github.io/hwkim-dev — blog, other projects, about.

Open the hwkim-dev portfolio site