C API Overview¶
This page documents the v002 driver’s public C API. The implementation
lives under codes/v002/sw/driver/ (uCA_v1_api.h + uCA_v1_hal.h).
The stack is organized into two layers.
HAL (
uca_hal_*) — direct access to AXI-Lite / AXI-HP registers and the CMD_IN / STAT_OUT FIFOs. Applications should not call this layer directly.Public API (
uca_*) — “CUDA-equivalent” compute/memory primitives. Every call assembles a 64-bit VLIW instruction (Instruction Encoding) and issues it through the HAL.
Note
The API is named uca_* (micro Compute Architecture), not
pccx_*. pccx is the architecture / documentation project; uCA is
the RTL + software contract it exposes.
1. Initialization and Lifecycle¶
#include "uCA_v1_api.h"
// Bring the NPU up and verify the status register responds.
// Returns 0 on success, -1 on failure.
int uca_init(void);
// Park the NPU and release the HAL.
void uca_deinit(void);
The API does not take a context pointer — a single process talks to a single NPU instance, so the HAL keeps its state in a file-scope singleton.
2. Compute Primitives¶
Every compute call returns immediately after writing the instruction
into the CMD_IN FIFO. Use uca_sync (§5) to wait for completion.
2.1 Vector Core — uca_gemv¶
// y = W · x (INT4 weight × BF16/INT8 activation → BF16 result)
void uca_gemv(uint32_t dest_reg, // 17-bit L2 destination
uint32_t src_addr, // 17-bit L2 source (activation)
uint8_t flags, // OR of UCA_FLAG_* (see §4.1)
uint8_t size_ptr, // 6-bit size descriptor pointer
uint8_t shape_ptr, // 6-bit shape descriptor pointer
uint8_t lanes); // 5-bit parallel-lane mask (1–4)
2.2 Matrix Core — uca_gemm¶
// Y = W · X on the 32 × 32 systolic array.
// Identical field layout to uca_gemv; differs only in the opcode
// routed into the dispatcher.
void uca_gemm(uint32_t dest_reg,
uint32_t src_addr,
uint8_t flags,
uint8_t size_ptr,
uint8_t shape_ptr,
uint8_t lanes);
2.3 CVO / SFU — uca_cvo¶
// Complex vector operations:
// Softmax (EXP, REDUCE_SUM, SCALE)
// RMSNorm (SQRT, RECIP, SCALE)
// Activation (GELU)
// RoPE (SIN / COS)
void uca_cvo(uint8_t cvo_func, // UCA_CVO_* (see §4.3)
uint32_t src_addr, // 17-bit L2 source
uint32_t dst_addr, // 17-bit L2 destination
uint16_t length, // element count
uint8_t flags, // OR of UCA_CVO_FLAG_*
uint8_t async); // 1 = fire-and-forget, 0 = block
3. Memory Primitives¶
3.1 uca_memcpy¶
// Route-encoded DMA. The route enum combines source and destination
// (see §4.4). Typical use: load a weight tile from host DDR4 into
// L2 before a GEMM/GEMV.
void uca_memcpy(uint8_t route, // UCA_ROUTE_*
uint32_t dest_addr, // 17-bit destination
uint32_t src_addr, // 17-bit source
uint8_t shape_ptr, // 6-bit shape descriptor
uint8_t async);
3.2 uca_memset¶
// Write a shape / size descriptor (up to three 16-bit values) into
// one of the two descriptor caches.
void uca_memset(uint8_t dest_cache, // 0 = fmap_shape, 1 = weight_shape
uint8_t dest_addr, // 6-bit pointer slot
uint16_t a, // usually M
uint16_t b, // usually N
uint16_t c); // usually K
4. Constants¶
The header exposes named constants for every ISA field so application code never hard-codes bit patterns.
4.1 GEMV / GEMM flags (6-bit)¶
UCA_FLAG_FINDEMAX // bit 5 — record e_max of the output (softmax)
UCA_FLAG_ACCM // bit 4 — accumulate into dest (do not overwrite)
UCA_FLAG_W_SCALE // bit 3 — apply the weight scale during MAC
4.2 CVO flags (5-bit)¶
UCA_CVO_FLAG_SUB_EMAX // bit 4 — subtract e_max before the op
UCA_CVO_FLAG_RECIP_SCALE // bit 3 — SCALE uses 1 / scalar
UCA_CVO_FLAG_ACCM // bit 2 — accumulate into dst
4.3 CVO function codes (4-bit)¶
UCA_CVO_EXP // 0x0 element-wise exp(x) [SFU]
UCA_CVO_SQRT // 0x1 element-wise sqrt(x) [SFU]
UCA_CVO_GELU // 0x2 element-wise GELU(x) [SFU]
UCA_CVO_SIN // 0x3 element-wise sin(x) [CORDIC]
UCA_CVO_COS // 0x4 element-wise cos(x) [CORDIC]
UCA_CVO_REDUCE_SUM // 0x5 sum-reduction → scalar [SFU + adder]
UCA_CVO_SCALE // 0x6 element-wise × scalar [SFU]
UCA_CVO_RECIP // 0x7 element-wise 1/x [SFU]
4.4 Memory routes (8-bit)¶
UCA_ROUTE_HOST_TO_L2 // 0x01 host DDR4 → L2
UCA_ROUTE_L2_TO_HOST // 0x10 L2 → host DDR4
UCA_ROUTE_L2_TO_L1_GEMM // 0x12 L2 → GEMM L1 (input stage)
UCA_ROUTE_L2_TO_L1_GEMV // 0x13 L2 → GEMV L1 (input stage)
UCA_ROUTE_GEMM_RES_TO_L2 // 0x21 GEMM output → L2
UCA_ROUTE_GEMV_RES_TO_L2 // 0x31 GEMV output → L2
UCA_ROUTE_CVO_RES_TO_L2 // 0x41 SFU / CVO output → L2
5. Synchronization¶
// Poll UCA_STAT_BUSY until every issued instruction retires.
// Returns 0 when idle, -1 on timeout.
int uca_sync(uint32_t timeout_us);
The pccx controller is fully decoupled — uca_init configures the
decoder once, after which every uca_* issue is independent. Call
uca_sync before reading results back to host memory with
uca_memcpy(UCA_ROUTE_L2_TO_HOST, …).
6. Example — One FFN Block¶
A minimal FFN y = W_down · GELU(W_up · x) looks like this:
// 0) Bring the NPU up
if (uca_init() != 0) return -1;
// 1) Preload shape descriptors.
// slot 0: W_up (M, N, K) = (1, 4096, 4096)
// slot 1: W_down (M, N, K) = (1, 4096, 4096)
uca_memset(/*cache*/ 1, /*slot*/ 0, 1, 4096, 4096);
uca_memset(/*cache*/ 1, /*slot*/ 1, 1, 4096, 4096);
// 2) W_up · x → 0x0100 (GEMV)
uca_gemv(/*dest*/ 0x0100, /*src*/ 0x0000,
/*flags*/ 0, /*size_ptr*/ 0, /*shape_ptr*/ 0,
/*lanes*/ 0x0F); // all four cores active
// 3) GELU at 0x0100 → 0x0200
uca_cvo(UCA_CVO_GELU,
/*src*/ 0x0100, /*dst*/ 0x0200,
/*length*/ 4096,
/*flags*/ 0, /*async*/ 0);
// 4) W_down · activation → 0x0300
uca_gemv(/*dest*/ 0x0300, /*src*/ 0x0200,
/*flags*/ 0, /*size_ptr*/ 0, /*shape_ptr*/ 1,
/*lanes*/ 0x0F);
// 5) Wait for completion
uca_sync(/*timeout_us*/ 100000);
7. Error Handling¶
uca_init and uca_sync return 0 on success and -1 on
failure. All other uca_* calls return void — they only push an
instruction into the CMD_IN FIFO, and failure modes surface through the
next uca_sync (which polls UCA_STAT_*).
Typical failure surfaces:
uca_initreturns-1if the NPU STAT register does not read the expected boot pattern.uca_syncreturns-1if theBUSYbit stays asserted pasttimeout_us.Field overflows (e.g. address > 17 bits) are silently masked by the instruction encoder — keep application code within the documented widths.
See also
Instruction encoding: Instruction Encoding
Dataflow per opcode: Per-Instruction Dataflow
RTL side of the ISA: ISA Type Package