pccx ISA Specification

pccx (Parallel Compute Core eXecutor) — the FPGA NPU instruction set.

Target: Kria KV260 | Word width: 64-bit | Encoding: VLIW

RTL source: hw/rtl/NPU_Controller/NPU_Control_Unit/ISA_PACKAGE/


1. Instruction Format

Every instruction is 64 bits wide.

[63:60]   [59:0]
OPCODE    BODY (60 bits, layout depends on opcode)

The top-level decoder (ctrl_npu_decoder.sv) strips the 4-bit opcode and routes the remaining 60-bit body to the appropriate execution engine.


2. Opcode Table

O pcode

Mnemonic

Value

Target Engine

` OP_G EMV`

Vec tor–Matrix Multiply

4 'h0

Vector Core (μV-Cores)

` OP_G EMM`

Mat rix–Matrix Multiply

4 'h1

Matrix Core (Systolic Array)

O P_MEM CPY

Memory Copy

4 'h2

MEM Dispatcher

O P_MEM SET

Memory Set

4 'h3

MEM Dispatcher

OP_ CVO

Complex Vector Op

4 'h4

CVO Core (μCVO-Cores)

Reserved

`` 4’h5` –``4 ‘hF`


3. Instruction Encoding

3.1 GEMV / GEMM (OP_GEMV, OP_GEMM)

Both share the same body layout.

[59:43]  dest_reg       17-bit  Destination register / address
[42:26]  src_addr       17-bit  Source address
[25:20]  flags           6-bit  Control flags (see §4)
[19:14]  size_ptr_addr   6-bit  Pointer to size descriptor
[13:8]   shape_ptr_addr  6-bit  Pointer to shape descriptor
[7:3]    parallel_lane   5-bit  Number of active parallel lanes
[2:0]    reserved        3-bit

3.2 MEMCPY (OP_MEMCPY)

[59]     from_device     1-bit  0=FROM_NPU, 1=FROM_HOST
[58]     to_device       1-bit  0=TO_NPU,   1=TO_HOST
[57:41]  dest_addr      17-bit  Destination address
[40:24]  src_addr       17-bit  Source address
[23:7]   aux_addr       17-bit  Auxiliary address (reserved)
[6:1]    shape_ptr_addr  6-bit  Pointer to shape descriptor
[0]      async           1-bit  0=sync, 1=async transfer

3.3 MEMSET (OP_MEMSET)

[59:58]  dest_cache      2-bit  0=fmap_shape, 1=weight_shape
[57:52]  dest_addr       6-bit  Destination pointer address (ptr_addr_t)
[51:36]  a_value        16-bit  Value A
[35:20]  b_value        16-bit  Value B
[19:4]   c_value        16-bit  Value C
[3:0]    reserved        4-bit

3.4 CVO (OP_CVO)

Dispatched to the CVO Core (2× μCVO-Cores). Each μCVO-Core contains a CORDIC unit (sin/cos) and an SFU (exp, sqrt, GELU). Required for Transformer softmax, RMSNorm, and activation functions.

[59:56]  cvo_func        4-bit  Function code (see §3.4.1)
[55:39]  src_addr       17-bit  Source address in L2 cache
[38:22]  dst_addr       17-bit  Destination address in L2 cache
[21:6]   length         16-bit  Number of elements (vector length)
[5:1]    flags           5-bit  Control flags (see §3.4.2)
[0]      async           1-bit  0=sync, 1=async

3.4.1 CVO Function Codes

Code

Mnemonic

Description

Hardware Unit

4' h0

`` CVO_EXP``

Element-wise exp(x)

SFU

4' h1

C VO_SQRT

Element-wise sqrt(x)

SFU

4' h2

C VO_GELU

Element-wise GELU(x)

SFU

4' h3

`` CVO_SIN``

Element-wise sin(x)

CORDIC

4' h4

`` CVO_COS``

Element-wise cos(x)

CORDIC

4' h5

CVO_RED UCE_SUM

Sum all elements → scalar at dst_addr

SFU + Adder

4' h6

CV O_SCALE

Element-wise multiply by scalar at src_addr+0

SFU

4' h7

CV O_RECIP

Element-wise 1/x

SFU

` 4’h 8`4' hF

Reserved

Softmax sequence (one CVO pipeline pass): 1. OP_GEMV with FLAG_FINDEMAX — find e_max over attention scores 2. OP_CVO CVO_EXP with FLAG_SUB_EMAX — exp(x − e_max) for each score 3. OP_CVO CVO_REDUCE_SUM — sum of exps (denominator) 4. OP_CVO CVO_SCALE with FLAG_RECIP_SCALE — divide each exp by sum

RMSNorm sequence: 1. OP_GEMV with FLAG_FINDEMAX during projection (emax already tracked) 2. OP_CVO CVO_REDUCE_SUM (of squares) → then 3. OP_CVO CVO_SQRT + CVO_RECIP → normalization factor 4. OP_CVO CVO_SCALE — apply learned weight γ

3.4.2 CVO Flags (5-bit, [5:1] of body)

[5]  sub_emax      Subtract e_max from input before operation (requires prior FINDEMAX)
[4]  recip_scale   Use reciprocal of scalar for SCALE (divide instead of multiply)
[3]  accm          Accumulate into dst (do not overwrite)
[2:1] reserved

4. Flags Field for GEMV/GEMM (6-bit, [25:20])

[5]  findemax   Find and register the exponent maximum (e_max) for output normalization
[4]  accm       Accumulate result into destination register (do not overwrite)
[3]  w_scale    Apply weight scale factor during MAC
[2:0] reserved

5. Memory Routing Table (MEMCPY)

Defined in isa_memctrl.svh as data_route_e.

Route Enum

Encoding (sr c[3:0]\|dst[3:0])

Description

fr om_host_to_L2

8'h01

Host DDR4 → L2 cache (fmap DMA in via ACP)

fr om_L2_to_host

8'h10

L2 cache → Host DDR4 (result DMA out via ACP)

from_ L2_to_L1_GEMM

8'h12

L2 → Matrix Core fmap broadcast

from_ L2_to_L1_GEMV

8'h13

L2 → Vector Core fmap broadcast

f rom_L2_to_CVO

8'h14

L2 → CVO Core input stream

from_G EMV_res_to_L2

8'h31

Vector Core result → L2 cache

from_G EMM_res_to_L2

8'h21

Matrix Core result → L2 cache

from_ CVO_res_to_L2

8'h41

CVO Core result → L2 cache


6. Micro-Op (uop) Structures

After decoding, the Global Scheduler splits the instruction body into engine-specific micro-ops before dispatch.

6.1 GEMV / GEMM Control uop

typedef struct packed {
    flags_t         flags;           // 6-bit
    ptr_addr_t      size_ptr_addr;   // 6-bit
    parallel_lane_t parallel_lane;   // 5-bit
} gemv_control_uop_t;  // = gemm_control_uop_t

6.2 Memory Control uop

typedef struct packed {
    data_route_e data_dest;      // 8-bit  (source[3:0] | dest[3:0])
    dest_addr_t  dest_addr;      // 17-bit
    src_addr_t   src_addr;       // 17-bit
    ptr_addr_t   shape_ptr_addr; // 6-bit
    async_e      async;          // 1-bit
} memory_control_uop_t;

6.3 Memory Set uop

typedef struct packed {
    dest_cache_e dest_cache;  // 2-bit
    ptr_addr_t   dest_addr;   // 6-bit
    a_value_t    a_value;
    b_value_t    b_value;
    c_value_t    c_value;
} memory_set_uop_t;

6.4 CVO Control uop

typedef struct packed {
    cvo_func_e  cvo_func;     // 4-bit
    src_addr_t  src_addr;     // 17-bit
    dst_addr_t  dst_addr;     // 17-bit
    length_t    length;       // 16-bit
    cvo_flags_t flags;        // 5-bit
    async_e     async;        // 1-bit
} cvo_control_uop_t;

7. Decoupled Dataflow Pipeline

The front-end and execution engines are strictly decoupled.

Host (AXI-Lite) --> [AXIL_CMD_IN] --> ctrl_npu_decoder
                                            |
                   +----------+------+------+------+-----------+
                   v          v      v             v           v
              GEMV FIFO  GEMM FIFO  CVO FIFO  MEM FIFO    MEMSET FIFO
                   |          |      |             |           |
              μV-Core    Systolic  μCVO-Core  mem_dispatcher  mem_set
             (GEMV)    Array(GEMM) (CVO)

The front-end (ctrl_npu_decoder) issues instructions into per-engine FIFOs and immediately returns — it never stalls waiting for execution to complete. Each engine’s local dispatcher independently pops from its FIFO and fires when operands are ready.


8. AXI-Lite Register Map

Control is via S_AXIL_CTRL (HPM port on KV260).

Off set

Wi dth

Dire ction

Description

`` 0x0 0``

32- bit

W

VLIW instruction [31:0] (write lower word first)

`` 0x0 4``

32- bit

W

VLIW instruction [63:32] (writing this word triggers NPU latch)

`` 0x0 8``

32- bit

R

NPU status register (see §9)


9. Status Register (0x08)

Bit

Name

Description

[0]

BUSY

NPU is executing — do not issue new instruction

[1]

DONE

Last operation completed successfully

[31:2]

Reserved