Gemma 3N E4B — Overview¶
pccx v002 is sized to run Gemma 3N E4B at 20 tok/s on a bare-metal Kria KV260. Before diving into the operator-level pipeline, this page fixes the key dimensions and enumerates the deviations from a “standard” decoder-only Transformer that the hardware has to accommodate.
1. Model Dimensions¶
Quantity |
Value |
Notes |
|---|---|---|
Hidden dim |
2048 |
Main residual stream width. |
FFN intermediate |
16384 |
8× expansion. |
Number of layers |
35 |
35 decoder blocks. |
Attention heads |
8 Q / 2 KV |
Grouped-Query Attention, 4:1 ratio. |
Head dim |
256 |
|
Vocab size |
262,400 |
|
Patch/router dims |
|
PLE patch embedding and AltUp router. |
Streams (AltUp) |
4 |
|
2. What Makes Gemma 3N Non-standard¶
Gemma 3N departs from the textbook decoder in five places. Each one has a direct consequence for how pccx v002 schedules its instructions.
Feature |
Behavior |
Hardware Consequence |
|---|---|---|
AltUp 4-stream |
Four parallel residual streams; the shadow streams receive depth-dependent updates, the main stream stays clean. |
Four copies of |
Alternating RoPE θ |
5-layer cycle |
θ is a per-layer constant preloaded via |
No attention scaling / softcap |
Attention score is |
One fewer |
LAuReL parallel branch |
Low-rank side path combined with the attention output, then
divided by |
Two tiny GEMVs ( |
PLE shadow injection |
Per-Layer Embedding is injected only into |
Main stream path is never polluted by PLE; the scheduler keeps PLE activity off the critical path. |
Full math for each of these is in the following pages:
Gemma 3N — Attention and RoPE Constraints — scaling removal and dynamic θ.
Gemma 3N — LAuReL and PLE Calibration Modules — LAuReL scaling and PLE injection rules.
Gemma 3N — FFN Gaussian Top-K Sparsity — Gaussian Top-K gate on early layers.
3. Cross-Layer KV Sharing¶
Gemma 3N does not store a KV entry in every layer. Of the 35 layers:
Layers 0–19 store their own
K/Vin cache.Layers 20–34 reuse the caches of layer 18 (local RoPE) or layer 19 (global RoPE) depending on the 5-layer cycle.
Concretely, K_cache and V_cache are sized [20, max_seq, 512]
rather than [35, max_seq, 512]. This is the main reason the KV
footprint budget in KV Cache Optimization Strategy lists ~40 KB per
token, not 70 KB.
4. Datatype Map¶
How each tensor type lands on pccx v002 compute:
Tensor |
Storage |
Compute path |
Notes |
|---|---|---|---|
Weights (Q / K / V / O / FFN) |
INT4 packed |
Systolic Array (GEMM) or GEMV Core |
W4 + per-channel scale. |
Activations (hidden, Q / K / V) |
INT8 on L2 |
Same, after preprocess |
Promoted to BF16 only through the SFU. |
KV cache |
FP16 (baseline), INT8/INT4 recommended |
MEMCPY host ↔ L2 |
|
AltUp / LAuReL / PLE scales |
FP32 (host) → BF16 (device) |
SFU |
Small vectors, amortized. |
Logits |
FP32 on host |
Post-processing on CPU |
Top-P / temperature happen outside the NPU. |
5. Where to Go Next¶
Full operator-level spec (embedding → sampling): Gemma 3N E4B — Operator-Level Pipeline.
Instruction-level mapping and scheduling: Gemma 3N E4B on pccx v002 — Execution and Scheduling.
Baseline x64 CPU reference: llm-lite.