SFU Core (Complex Vector Operations)

The SFU (Special Function Unit) handles the Transformer’s non-linear operations — Softmax, GELU, RMSNorm, and RoPE. In the ISA, these appear under the CVO (Complex Vector Operation) opcode family.

1. Configuration

Parameter

Value

Unit pipeline

1 BF16 scalar / clk (streaming — a single operation works its way through IN_length elements one cycle at a time)

Instances

1 (single CVO_top, shared by both slices)

Sub-units

CVO_sfu_unit (EXP / SQRT / GELU / RECIP / SCALE / REDUCE_SUM) + CVO_cordic_unit (SIN / COS)

Internal precision

BF16 / FP32 (dynamic promotion)

Supported functions

EXP, SQRT, GELU, SIN, COS, REDUCE_SUM, SCALE, RECIP

2. Precision Promotion Strategy

Inputs and outputs stay in INT8 or BF16 to keep L2 bandwidth usage reasonable, but internal computation is promoted as follows.

Function

Input

Internal

Rationale

CVO_EXP

BF16

FP32

Overflow protection (softened further by the sub_emax flag in softmax).

CVO_SQRT

BF16

FP32

Numerical stability for RMSNorm’s 1/sqrt(var + ε).

CVO_GELU

BF16

BF16

Approximation (tanh or rational) suffices.

CVO_SIN / COS

BF16

FP32

Prevents phase drift in RoPE.

CVO_REDUCE_SUM

BF16 / INT8

FP32

Minimizes long-running accumulation error.

CVO_SCALE

INT8 / BF16

BF16

Dequantize × scalar.

CVO_RECIP

BF16

FP32

Softmax denominator, layer-norm divisor.

3. Implementation Techniques

3.1 CORDIC + LUT Hybrid

Inherits from v001’s CVO_cordic_unit.sv and CVO_sfu_unit.sv:

  • CORDIC: iterative-convergence operators — SIN, COS, SQRT, RECIP. 15–20 pipeline stages.

  • LUT + polynomial correction: EXP and GELU. The LUT produces a coarse estimate; a 2nd- or 3rd-order polynomial refines it.

3.2 Reduction

CVO_REDUCE_SUM uses the BF16 accumulator inside CVO_sfu_unit to sum one element per cycle serially for IN_length cycles, then emits the scalar sum. When upstream data arrives as a 32-wide GEMV output, the GEMV reduction tree has already collapsed that to a scalar, so the SFU only needs to accumulate and normalise. Softmax’s denominator is the canonical user.

3.3 Softmax Fast Path

Softmax decomposes into a 3-instruction sequence.

        sequenceDiagram
  autonumber
  participant D as Dispatcher
  participant S as SFU
  participant E as e_max reg
  D->>S: CVO_REDUCE_SUM (flags.findemax)
  S->>E: update e_max
  D->>S: CVO_EXP (flags.sub_emax)
  S->>S: exp(x − e_max)
  D->>S: CVO_RECIP × CVO_SCALE
  S-->>D: softmax(x)
    

The findemax and sub_emax flags perform online max-normalization in hardware, so the software layer never has to run a separate scan pass.

4. Pipeline Integration

The SFU is wired to the GEMV core through a direct FIFO, enabling two tight loops.

  • GEMV → SFU: Attention’s Q·K^T softmax and the FFN’s GEMV → GELU hand-offs skip the L2 round trip.

  • SFU → GEMV: the softmax output is multiplied with V immediately, again without going through L2.

The SFU supports async execution (the async_e ISA field) so the controller can dispatch the next instruction without waiting for completion. Completion notifications are handled by fsmout_npu_stat_collector (inherited from v001).

5. Physical Placement

The floorplan (Physical Floorplan) reserves a single SFU instance near the centre, accessible to both slices. Siting it adjacent to the GEMV cores keeps the direct FIFO between GEMV and SFU as short as possible.

6. Scalability

Adding a function to the CVO table (cvo_func_e in Per-Instruction Encoding) requires:

  1. Extending the cvo_func_e enum (4 bits total, 8 of 16 used so far).

  2. Adding the corresponding CORDIC / LUT block inside the SFU.

  3. Updating the Dispatcher’s decode table.

The hardware function slots are gated by a generate parameter SFU_ENABLE_MASK, letting you drop unused functions from synthesis to save LUTs.