SFU Core (Complex Vector Operations)¶
The SFU (Special Function Unit) handles the Transformer’s non-linear operations — Softmax, GELU, RMSNorm, and RoPE. In the ISA, these appear under the CVO (Complex Vector Operation) opcode family.
1. Configuration¶
Parameter |
Value |
|---|---|
Unit pipeline |
1 BF16 scalar / clk (streaming — a single operation works
its way through |
Instances |
1 (single |
Sub-units |
|
Internal precision |
BF16 / FP32 (dynamic promotion) |
Supported functions |
EXP, SQRT, GELU, SIN, COS, REDUCE_SUM, SCALE, RECIP |
2. Precision Promotion Strategy¶
Inputs and outputs stay in INT8 or BF16 to keep L2 bandwidth usage reasonable, but internal computation is promoted as follows.
Function |
Input |
Internal |
Rationale |
|---|---|---|---|
|
BF16 |
FP32 |
Overflow protection (softened further by the |
|
BF16 |
FP32 |
Numerical stability for RMSNorm’s |
|
BF16 |
BF16 |
Approximation (tanh or rational) suffices. |
|
BF16 |
FP32 |
Prevents phase drift in RoPE. |
|
BF16 / INT8 |
FP32 |
Minimizes long-running accumulation error. |
|
INT8 / BF16 |
BF16 |
Dequantize × scalar. |
|
BF16 |
FP32 |
Softmax denominator, layer-norm divisor. |
3. Implementation Techniques¶
3.1 CORDIC + LUT Hybrid¶
Inherits from v001’s CVO_cordic_unit.sv and CVO_sfu_unit.sv:
CORDIC: iterative-convergence operators —
SIN,COS,SQRT,RECIP. 15–20 pipeline stages.LUT + polynomial correction:
EXPandGELU. The LUT produces a coarse estimate; a 2nd- or 3rd-order polynomial refines it.
3.2 Reduction¶
CVO_REDUCE_SUM uses the BF16 accumulator inside CVO_sfu_unit to
sum one element per cycle serially for IN_length cycles, then
emits the scalar sum. When upstream data arrives as a 32-wide GEMV
output, the GEMV reduction tree has already collapsed that to a scalar,
so the SFU only needs to accumulate and normalise. Softmax’s denominator
is the canonical user.
3.3 Softmax Fast Path¶
Softmax decomposes into a 3-instruction sequence.
sequenceDiagram
autonumber
participant D as Dispatcher
participant S as SFU
participant E as e_max reg
D->>S: CVO_REDUCE_SUM (flags.findemax)
S->>E: update e_max
D->>S: CVO_EXP (flags.sub_emax)
S->>S: exp(x − e_max)
D->>S: CVO_RECIP × CVO_SCALE
S-->>D: softmax(x)
The findemax and sub_emax flags perform online
max-normalization in hardware, so the software layer never has to run
a separate scan pass.
4. Pipeline Integration¶
The SFU is wired to the GEMV core through a direct FIFO, enabling two tight loops.
GEMV → SFU: Attention’s
Q·K^T → softmaxand the FFN’s GEMV → GELU hand-offs skip the L2 round trip.SFU → GEMV: the softmax output is multiplied with V immediately, again without going through L2.
The SFU supports async execution (the async_e ISA field) so the
controller can dispatch the next instruction without waiting for
completion. Completion notifications are handled by
fsmout_npu_stat_collector (inherited from v001).
5. Physical Placement¶
The floorplan (Physical Floorplan) reserves a single SFU instance near the centre, accessible to both slices. Siting it adjacent to the GEMV cores keeps the direct FIFO between GEMV and SFU as short as possible.
6. Scalability¶
Adding a function to the CVO table (cvo_func_e in
Per-Instruction Encoding) requires:
Extending the
cvo_func_eenum (4 bits total, 8 of 16 used so far).Adding the corresponding CORDIC / LUT block inside the SFU.
Updating the Dispatcher’s decode table.
The hardware function slots are gated by a generate parameter
SFU_ENABLE_MASK, letting you drop unused functions from synthesis to
save LUTs.