Gemma 3N — Attention and RoPE Constraints¶
Gemma 3N makes two decisions about the attention block that diverge sharply from the standard Transformer recipe. Both of them simplify the pccx v002 instruction stream, so they are worth stating explicitly.
1. No Attention Scaling, No Softcap¶
In a textbook attention block, the Q · Kᵀ result is divided by
\(\sqrt{d_{head}}\). Earlier Gemma revisions additionally applied a
softcap (usually 50.0) on both the attention logits and the
final logits.
Gemma 3N drops both of these.
Standard Transformer |
Gemma 3N |
|
|---|---|---|
Attention score |
\(Q \cdot K^{T} / \sqrt{d_{head}}\) |
\(Q \cdot K^{T}\) |
Softcap |
\(50 \cdot \tanh(x / 50)\) before softmax |
None |
Final-logit cap |
optional |
\(30 \cdot \tanh(x / 30)\) once, at the very end |
# Wrong (old recipe):
# attn_weights = np.dot(Q, K.T) / np.sqrt(256)
# attn_weights = softcap(attn_weights, 50.0)
# Gemma 3N:
attn_weights = np.dot(Q, K.T)
Note
Hardware consequence on pccx v002. The softmax sequence collapses from four CVO invocations to three:
GEMV flags.findemax=1 ; Q · Kᵀ, track e_max
CVO CVO_EXP flags.sub_emax=1 ; exp(score - e_max)
CVO CVO_REDUCE_SUM ; Σ exp → scalar
CVO CVO_SCALE flags.recip_scale=1 ; divide each exp by the sum
No extra CVO_SCALE before CVO_EXP, no softcap tanh in the
middle.
2. Dynamic Alternating RoPE θ¶
Rotary Position Embedding normally uses a fixed theta_base (10 000 or
1 000 000) for every layer. Gemma 3N alternates per layer in a 5-layer
cycle.
2.1 The 5-Layer Pattern¶
[Local, Local, Local, Local, Global], repeated.
Layer slot |
Role |
|
Receptive field |
|---|---|---|---|
0, 1, 2, 3, 5, 6, 7, 8, … |
Local |
10 000 |
Short-range syntax |
4, 9, 14, 19, 24, 29, 34 |
Global |
1 000 000 |
Long-range semantic |
2.2 Visualization¶
block-beta
columns 5
L0["Layer 0\nLocal\n(10,000)"]
L1["Layer 1\nLocal\n(10,000)"]
L2["Layer 2\nLocal\n(10,000)"]
L3["Layer 3\nLocal\n(10,000)"]
L4["Layer 4\nGlobal\n(1,000,000)"]
L5["Layer 5\nLocal\n(10,000)"]
L6["Layer 6\nLocal\n(10,000)"]
L7["Layer 7\nLocal\n(10,000)"]
L8["Layer 8\nLocal\n(10,000)"]
L9["Layer 9\nGlobal\n(1,000,000)"]
E1["..."]
E2["..."]
E3["..."]
E4["..."]
E5["..."]
2.3 Hardware Consequence on pccx v002¶
The
theta_baseis a per-layer constant, not per-token. It can be preloaded into the Constant Cache with a singleMEMSETat the start of each layer.The
CVO_SIN/CVO_COSkernels only need the phase \(pos \cdot \omega_j\) where \(\omega_j = \theta^{-2j/d_{head}}\). These frequency tables are precomputed on host andMEMCPY’d once at boot for both θ values.target_K/target_Vare pulled from the correct cache slot (layer 19 for global θ, layer 18 for local θ) in the cross-layer sharing regime — see Gemma 3N E4B — Operator-Level Pipeline §4.B-4.
3. Combined Effect on Tokens per Second¶
The two simplifications together remove one CVO_SCALE and one CVO_TANH per attention block per layer. Over the 35 layers of Gemma 3N E4B, that is 70 CVO invocations saved per decode step. At the decoding target of 20 tok/s, the SFU budget saves roughly 2–3 % wall-clock time.
See also
Operator spec: Gemma 3N E4B — Operator-Level Pipeline §4.B.
CVO function codes: Per-Instruction Encoding §4.