Per-Instruction Encoding¶
Every instruction consists of opcode[3:0] plus a 60-bit body. This
page details the body layout and field semantics for each opcode.
1. GEMV / GEMM (opcode = 0x0 / 0x1)¶
GEMV and GEMM share an identical body layout.
Bits |
Width |
Field |
Description |
|---|---|---|---|
[59:43] |
17 |
|
L2 address where the result is written. |
[42:26] |
17 |
|
L2 address of the activation source. |
[25:20] |
6 |
|
|
[19:14] |
6 |
|
Constant Cache index for the size entry. |
[13:8] |
6 |
|
Constant Cache index for the shape entry. |
[7:3] |
5 |
|
Number of core lanes to activate (0 = all). |
[2:0] |
3 |
reserved |
Must be zero. |
1.1 Flags Field (6 bit)¶
Bit |
Name |
Description |
|---|---|---|
[5] |
|
Update the e_max register for output normalization (used in softmax). |
[4] |
|
Accumulate into |
[3] |
|
Apply the weight scale factor during MAC. |
[2:0] |
reserved |
Must be zero. |
2. MEMCPY (opcode = 0x2)¶
Bits |
Width |
Field |
Description |
|---|---|---|---|
[59] |
1 |
|
0 = NPU, 1 = Host. |
[58] |
1 |
|
0 = NPU, 1 = Host. |
[57:41] |
17 |
|
Destination address. |
[40:24] |
17 |
|
Source address. |
[23:7] |
17 |
|
Auxiliary address (e.g., a host DDR offset). |
[6:1] |
6 |
|
Shape pointer for the transfer. |
[0] |
1 |
|
0 = wait for completion, 1 = fire-and-forget. |
3. MEMSET (opcode = 0x3)¶
Writes to the Constant Cache’s shape / size / scale registers.
Bits |
Width |
Field |
Description |
|---|---|---|---|
[59:58] |
2 |
|
0 = fmap_shape, 1 = weight_shape. |
[57:52] |
6 |
|
Constant Cache index. |
[51:36] |
16 |
|
First 16-bit value. |
[35:20] |
16 |
|
Second 16-bit value. |
[19:4] |
16 |
|
Third 16-bit value. |
[3:0] |
4 |
reserved |
Must be zero. |
Tip
The three slots (a / b / c) let a single MEMSET write an entire (M, N, K) tuple in one shot.
4. CVO (opcode = 0x4)¶
Complex Vector Operation — the instruction sent to the SFU.
Bits |
Width |
Field |
Description |
|---|---|---|---|
[59:56] |
4 |
|
Function code (table below). |
[55:39] |
17 |
|
L2 address of the input vector. |
[38:22] |
17 |
|
L2 address for the result. |
[21:6] |
16 |
|
Number of elements to process. |
[5:1] |
5 |
|
|
[0] |
1 |
|
0 = synchronous, 1 = asynchronous. |
4.1 CVO Function Codes¶
Function |
Code |
Description |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sum of vector elements. |
|
|
Scalar × vector. |
|
|
|
|
— |
Reserved for future extensions. |
4.2 CVO Flags (5 bit)¶
Bit |
Name |
Description |
|---|---|---|
[4] |
|
Subtract e_max before computing (softmax stabilization). |
[3] |
|
Use the reciprocal of the scalar (turns multiplication into division). |
[2] |
|
Accumulate into |
[1:0] |
reserved |
Must be zero. |
5. Summary¶
Opcode |
Primary use case |
|---|---|
GEMM |
Prefill. Q/K/V projection, FFN up / down projection. |
GEMV |
Decoding. Every projection in the autoregressive step. |
CVO |
Softmax, RMSNorm, RoPE, GELU. |
MEMCPY |
Host ↔ device weight loads, KV cache updates, token output. |
MEMSET |
Preset shape / size pointers at layer start, inject scale factors. |