DSP48E2 W4A8 Bit Packing and Sign Recovery

pccx v002 executes two independent MACs per DSP48E2 slice by packing two weights into a 23-bit-shifted layout on Port A. This page documents the mathematical rationale, how negative lower-channel results are handled, and the RTL-level implementation.

Note

This technique applies uniformly to both the GEMM systolic array and the GEMV cores, and doubles the theoretical integer throughput of the DSP48E2. Widening the weights to W5 or W6 eats into the guard band and shrinks N_max sharply, so W4A8 is the sweet spot on KV260.

1. Problem Statement

DSP48E2 provides a single 27-bit × 18-bit multiply with 48-bit accumulation.

\[P[47:0] \leftarrow P[47:0] + A[26:0] \times B[17:0]\]

In W4A8, the weight W is 4-bit and the activation A is 8-bit, leaving substantial unused bit space on both Port A and Port B. We exploit that slack to pack two independent MACs into a single DSP via dual-channel packing.

2. Bit-Packing Layout

DSP48E2 Port A / Port B W4A8 dual-channel packing layout

Figure 8 Figure W4A8-Layout. Port A (27 bits) holds W₁ in the top 4 bits and W₂ in the bottom 4 bits, with 19 guard bits between them. Port B (18 bits) holds activation A₁ in the bottom 8 bits and sign-extends it (A₁[7]) through the upper 10 bits.

2.1 Port A — Dual Weight (27 bits)

Bit range

Content

Description

A[26:23]

W₁ (upper channel)

4-bit signed weight. Aligned at 2²³ so the MSB maps to the sign bit.

A[22:4]

Guard bits

19 zero-filled bits. Absorb overflow from the lower channel’s accumulation.

A[3:0]

W₂ (lower channel)

4-bit signed weight.

2.2 Port B — Shared Activation (18 bits)

Bit range

Content

Description

B[7:0]

A₁

8-bit signed activation.

B[17:8]

Sign extension guard

Sign-extended from A₁[7] to preserve the multiplication’s sign correctness.

2.3 Result Separation

The Port A × Port B product naturally splits into two regions:

\[A \times B = (W_1 \cdot 2^{23} + W_2) \times A_1 = (W_1 \cdot A_1) \cdot 2^{23} + (W_2 \cdot A_1)\]
  • Upper 25 bits acc[47:23]: W₁ · A₁ accumulates here.

  • Lower 23 bits acc[22:0]: W₂ · A₁ accumulates here.

3. Safe Accumulation Limit

The maximum number of cycles the lower 23-bit region can safely accumulate before crossing into the guard band is derived below.

\[|W_2 \cdot A_1|_{\max} = 2^3 \cdot 2^7 = 2^{10}\]
\[N_{\max} = \frac{2^{22}}{2^{10}} = \mathbf{4{,}096 \ \text{cycles}}\]

(We budget up to 2²² to leave headroom for the result’s sign bit.)

During those 4,096 cycles the guard band protects the upper channel. By the same argument, the upper 25-bit region safely accumulates W₁ · A₁ in parallel.

Important

Layers with K > 4,096 must be split (K-split) into multiple accumulation sessions. The split unit is decided at the driver / compiler level.

4. Negative Accumulation and the Borrow Effect

When the lower channel’s result is negative, the two’s-complement hardware of DSP48E2 sign-extends through the 48-bit register. This produces a “borrow” from the upper bit region.

4.1 Mathematical Analysis

Let the total lower-channel sum be -X. In 48-bit two’s complement:

\[ACC = \Big(\sum U_i\Big) \cdot 2^{23} - X\]

Rewriting against the 23-bit boundary:

\[ACC = \Big(\big(\sum U_i\big) - 1\Big) \cdot 2^{23} + \Big(2^{23} - X\Big)\]

So:

  • Lower 23 bits acc[22:0]: 2²³ X — the correct two’s complement negative value.

  • Upper 25 bits acc[47:23]: (∑Uᵢ) 1 — off by 1.

Key observation

The data is not lost. The upper subtraction is an exact, mathematically correct borrow from sign extension, and can be reversed in a single post-processing cycle.

4.2 Sign Recovery Logic

The lower channel’s sign is read directly from a single bit: acc[22].

  • acc[22] == 1: lower channel was negative → upper lost 1 → restore.

  • acc[22] == 0: lower channel non-negative → no restoration needed.

Recovery runs exactly once, in the final extraction cycle after all accumulation completes.

Sign-recovery Verilog

Figure 9 Figure Sign-Recover. On the accumulate_done cycle the upper channel adds acc[22] back in; the lower channel keeps its 23-bit two’s-complement value as-is.

// Sign bit of the lower 23-bit accumulation (1: negative, 0: non-negative)
wire lower_sign = acc[22];

// Executed only on the accumulate_done cycle
always @(posedge clk) begin
    if (accumulate_done) begin
        // Upper channel: add 1 back if the lower channel was negative
        PE[1][2] <= acc[47:23] + lower_sign;
        // Lower channel: keep the 23-bit two's-complement result
        PE[1][1] <= acc[22:0];
    end
end

5. Full Operation Sequence

        flowchart LR
  PACK["Weight Pack<br/>{W₁ | guard | W₂}"] -->|Port A 27b| DSP[DSP48E2<br/>M·P reg]
  ACT["Activation Stream<br/>A₁ + sign ext"] -->|Port B 18b| DSP
  DSP -->|48-bit ACC| ACCR{"N ≤ 4,096?"}
  ACCR -- yes --> LOOP[Continue Accumulate]
  LOOP --> DSP
  ACCR -- done --> SREC["Sign Recovery<br/>acc[47:23] + acc[22]"]
  SREC --> OUT[/Upper ch · Lower ch/]
    

6. Design Benefits

Aspect

Effect

DSP utilization

1 DSP = 2 MAC → theoretical peak doubled.

Extra logic cost

One 1-bit adder plus a 23-bit comparator per PE.

Pipeline impact

Recovery runs once per accumulation, so throughput is unaffected.

Generality

Applies to any signed 4-bit × 8-bit multiplication, not just W4A8.

7. Limits and Trade-offs

  • No W-width extension: pushing to W5 or W6 runs out of guard bits and collapses N_max. Given DSP48E2’s shape, W4A8 is optimal on KV260.

  • Assumes signed activations: if A is unsigned, the guard-bit strategy shifts; the current design assumes INT8 signed activations.

  • Shared scale constraint: W₁ and W₂ are always multiplied by the same A, so the technique only helps when both channels share a common output location. This aligns naturally with GEMM’s M-axis tiling.

See also

The PE implementations that use this technique live in GEMM Core (Systolic Array) and GEMV Core. For the resulting system-level speedup, see the “3.125×” analysis in Design Rationale: v001 → v002.