pccx ISA Specification¶
pccx (Parallel Compute Core eXecutor) — the FPGA NPU instruction set.
Target: Kria KV260 | Word width: 64-bit | Encoding: VLIW
RTL source: hw/rtl/NPU_Controller/NPU_Control_Unit/ISA_PACKAGE/
1. Instruction Format¶
Every instruction is 64 bits wide.
[63:60] [59:0]
OPCODE BODY (60 bits, layout depends on opcode)
The top-level decoder (ctrl_npu_decoder.sv) strips the 4-bit opcode
and routes the remaining 60-bit body to the appropriate execution
engine.
2. Opcode Table¶
O pcode |
Mnemonic |
Value |
Target Engine |
|---|---|---|---|
` OP_G EMV` |
Vec tor–Matrix Multiply |
|
Vector Core (μV-Cores) |
` OP_G EMM` |
Mat rix–Matrix Multiply |
|
Matrix Core (Systolic Array) |
|
Memory Copy |
|
MEM Dispatcher |
|
Memory Set |
|
MEM Dispatcher |
|
Complex Vector Op |
|
CVO Core (μCVO-Cores) |
— |
Reserved |
`` 4’h5` –``4 ‘hF` |
— |
3. Instruction Encoding¶
3.1 GEMV / GEMM (OP_GEMV, OP_GEMM)¶
Both share the same body layout.
[59:43] dest_reg 17-bit Destination register / address
[42:26] src_addr 17-bit Source address
[25:20] flags 6-bit Control flags (see §4)
[19:14] size_ptr_addr 6-bit Pointer to size descriptor
[13:8] shape_ptr_addr 6-bit Pointer to shape descriptor
[7:3] parallel_lane 5-bit Number of active parallel lanes
[2:0] reserved 3-bit
3.2 MEMCPY (OP_MEMCPY)¶
[59] from_device 1-bit 0=FROM_NPU, 1=FROM_HOST
[58] to_device 1-bit 0=TO_NPU, 1=TO_HOST
[57:41] dest_addr 17-bit Destination address
[40:24] src_addr 17-bit Source address
[23:7] aux_addr 17-bit Auxiliary address (reserved)
[6:1] shape_ptr_addr 6-bit Pointer to shape descriptor
[0] async 1-bit 0=sync, 1=async transfer
3.3 MEMSET (OP_MEMSET)¶
[59:58] dest_cache 2-bit 0=fmap_shape, 1=weight_shape
[57:52] dest_addr 6-bit Destination pointer address (ptr_addr_t)
[51:36] a_value 16-bit Value A
[35:20] b_value 16-bit Value B
[19:4] c_value 16-bit Value C
[3:0] reserved 4-bit
3.4 CVO (OP_CVO)¶
Dispatched to the CVO Core (2× μCVO-Cores). Each μCVO-Core contains a CORDIC unit (sin/cos) and an SFU (exp, sqrt, GELU). Required for Transformer softmax, RMSNorm, and activation functions.
[59:56] cvo_func 4-bit Function code (see §3.4.1)
[55:39] src_addr 17-bit Source address in L2 cache
[38:22] dst_addr 17-bit Destination address in L2 cache
[21:6] length 16-bit Number of elements (vector length)
[5:1] flags 5-bit Control flags (see §3.4.2)
[0] async 1-bit 0=sync, 1=async
3.4.1 CVO Function Codes¶
Code |
Mnemonic |
Description |
Hardware Unit |
|---|---|---|---|
|
`` CVO_EXP`` |
Element-wise exp(x) |
SFU |
|
|
Element-wise sqrt(x) |
SFU |
|
|
Element-wise GELU(x) |
SFU |
|
`` CVO_SIN`` |
Element-wise sin(x) |
CORDIC |
|
`` CVO_COS`` |
Element-wise cos(x) |
CORDIC |
|
|
Sum all elements → scalar at dst_addr |
SFU + Adder |
|
|
Element-wise multiply by scalar at src_addr+0 |
SFU |
|
|
Element-wise 1/x |
SFU |
`
4’h
8`–
|
— |
Reserved |
— |
Softmax sequence (one CVO pipeline pass): 1.
OP_GEMVwithFLAG_FINDEMAX— find e_max over attention scores 2.OP_CVO CVO_EXPwithFLAG_SUB_EMAX— exp(x − e_max) for each score 3.OP_CVO CVO_REDUCE_SUM— sum of exps (denominator) 4.OP_CVO CVO_SCALEwithFLAG_RECIP_SCALE— divide each exp by sumRMSNorm sequence: 1.
OP_GEMVwithFLAG_FINDEMAXduring projection (emax already tracked) 2.OP_CVO CVO_REDUCE_SUM(of squares) → then 3.OP_CVO CVO_SQRT+CVO_RECIP→ normalization factor 4.OP_CVO CVO_SCALE— apply learned weight γ
3.4.2 CVO Flags (5-bit, [5:1] of body)¶
[5] sub_emax Subtract e_max from input before operation (requires prior FINDEMAX)
[4] recip_scale Use reciprocal of scalar for SCALE (divide instead of multiply)
[3] accm Accumulate into dst (do not overwrite)
[2:1] reserved
4. Flags Field for GEMV/GEMM (6-bit, [25:20])¶
[5] findemax Find and register the exponent maximum (e_max) for output normalization
[4] accm Accumulate result into destination register (do not overwrite)
[3] w_scale Apply weight scale factor during MAC
[2:0] reserved
5. Memory Routing Table (MEMCPY)¶
Defined in isa_memctrl.svh as data_route_e.
Route Enum |
Encoding
( |
Description |
|---|---|---|
|
|
Host DDR4 → L2 cache (fmap DMA in via ACP) |
|
|
L2 cache → Host DDR4 (result DMA out via ACP) |
|
|
L2 → Matrix Core fmap broadcast |
|
|
L2 → Vector Core fmap broadcast |
|
|
L2 → CVO Core input stream |
|
|
Vector Core result → L2 cache |
|
|
Matrix Core result → L2 cache |
|
|
CVO Core result → L2 cache |
6. Micro-Op (uop) Structures¶
After decoding, the Global Scheduler splits the instruction body into engine-specific micro-ops before dispatch.
6.1 GEMV / GEMM Control uop¶
typedef struct packed {
flags_t flags; // 6-bit
ptr_addr_t size_ptr_addr; // 6-bit
parallel_lane_t parallel_lane; // 5-bit
} gemv_control_uop_t; // = gemm_control_uop_t
6.2 Memory Control uop¶
typedef struct packed {
data_route_e data_dest; // 8-bit (source[3:0] | dest[3:0])
dest_addr_t dest_addr; // 17-bit
src_addr_t src_addr; // 17-bit
ptr_addr_t shape_ptr_addr; // 6-bit
async_e async; // 1-bit
} memory_control_uop_t;
6.3 Memory Set uop¶
typedef struct packed {
dest_cache_e dest_cache; // 2-bit
ptr_addr_t dest_addr; // 6-bit
a_value_t a_value;
b_value_t b_value;
c_value_t c_value;
} memory_set_uop_t;
6.4 CVO Control uop¶
typedef struct packed {
cvo_func_e cvo_func; // 4-bit
src_addr_t src_addr; // 17-bit
dst_addr_t dst_addr; // 17-bit
length_t length; // 16-bit
cvo_flags_t flags; // 5-bit
async_e async; // 1-bit
} cvo_control_uop_t;
7. Decoupled Dataflow Pipeline¶
The front-end and execution engines are strictly decoupled.
Host (AXI-Lite) --> [AXIL_CMD_IN] --> ctrl_npu_decoder
|
+----------+------+------+------+-----------+
v v v v v
GEMV FIFO GEMM FIFO CVO FIFO MEM FIFO MEMSET FIFO
| | | | |
μV-Core Systolic μCVO-Core mem_dispatcher mem_set
(GEMV) Array(GEMM) (CVO)
The front-end (ctrl_npu_decoder) issues instructions into per-engine
FIFOs and immediately returns — it never stalls waiting for execution to
complete. Each engine’s local dispatcher independently pops from its
FIFO and fires when operands are ready.
8. AXI-Lite Register Map¶
Control is via S_AXIL_CTRL (HPM port on KV260).
Off set |
Wi dth |
Dire ction |
Description |
|---|---|---|---|
`` 0x0 0`` |
32- bit |
W |
VLIW instruction [31:0] (write lower word first) |
`` 0x0 4`` |
32- bit |
W |
VLIW instruction [63:32] (writing this word triggers NPU latch) |
`` 0x0 8`` |
32- bit |
R |
NPU status register (see §9) |
9. Status Register (0x08)¶
Bit |
Name |
Description |
|---|---|---|
[0] |
|
NPU is executing — do not issue new instruction |
[1] |
|
Last operation completed successfully |
[31:2] |
— |
Reserved |