Roadmap (Two-Track)¶

As of 2026-04-20, pccx is developed along two parallel tracks. v002 is the currently active architecture; v003 targets the next-generation model (Gemma 4 E4B) on the same KV260 platform. The two tracks share RTL assets — sparse weight fetcher, SSD dispatcher, tree mask generator, EAGLE training pipeline.

The long-term stretch goal is an Auto-Porting Pipeline α — a compiler that emits a pccx ISA stream from an arbitrary transformer config. It begins in Year 2 once both tracks are stable.

1. Common assumptions ¶

Platform: Xilinx Kria KV260 (Zynq UltraScale+ ZU5EV), bare-metal
Quantization: W4A8 (INT4 weights × INT8 activations), KV cache INT4
Clock: AXI 250 MHz, core 400 MHz
VLIW ISA: 5 base opcodes (GEMV, GEMM, MEMCPY, MEMSET, CVO) + SPEC extensions
L2 URAM 2.25 MB budget: 1.0 MB activation pinning, 0.5 MB KV prefetch, 0.25 MB tree mask + scheduler state, 0.5 MB hot-attention tile

2. Track 1 — v002 Extended (Gemma 3N E4B, 20 tok/s)¶

Layers FFN sparsity and a speculative-decoding stack on top of the active v002 to reach the promised 20 tok/s measured throughput.

Tiered targets ¶

Tier	tok/s	Gate
Baseline	5–6	Phase A–F done
Viable	10–12	Phase G + H done
Promise	20	Phase G–K done
Stretch	25+	Add Tree EAGLE (Phase J)

Phase plan ¶

Phase	Weeks	Main work	Target tok/s
A–F	1–26	Re-parameterization → driver → Gemma 3N app → verification → synthesis → board bring-up	5–6
G	27–30	All-layer Gaussian Top-K sparsity (BW 1.95 → 1.36 GB/token)	8–9
H	31–32	Gemma 3 1B vanilla drafter (fast path, tokenizer-compatible)	11–14
H+	33–38	EAGLE-3 head training & swap-in ($20–30 on Vast.ai RTX 4090)	14–16
I	39–42	SSD async overlap (draft/verify pipelining)	17–19
J	43–46	Tree EAGLE (optional, stretch)	20–23
K	47–49	Final tuning & official benchmark	20

Why the schedule changed: the original Phase H trained EAGLE-3 first. Current plan attaches Gemma 3 1B as a vanilla drafter first (fast path) and swaps in a trained EAGLE-3 head only if the measured acceptance rate is insufficient.

Decision points & fallbacks ¶

Week	Condition	Action
26	baseline < 5 tok/s	Root-cause analysis, hold G–K
36	EAGLE acceptance < 2.0×	Run Phase I only, drop J
40	< 15 tok/s	Phase J becomes mandatory
47	< 20 tok/s	Settle for 15–18 tok/s, push 20 tok/s to v003

3. Track 2 — v003 (Gemma 4 E4B, 12–15 tok/s)¶

Gemma 4 E4B (42 layers, MQA, sliding + full attention, 128 K context) on the same KV260 platform. Reuses v002 RTL assets to cut Phase 2+ implementation cost by ~30 %.

Tiered targets ¶

Tier	tok/s	Gate
Minimum viable	10	Phase 2 done
Acceptable	12	Phase 3 done
Target	12–15	Phase 5 done
Stretch	15+	Add experimental techniques (e.g. DEER)

Phase plan ¶

Phase	Weeks	Main work	Target tok/s
1	16–26	Foundation — extend `quantize_and_save`, vocab trim 262K → 50K, re-parameterize RTL	7
2	27–34	EAGLE-3 linear chain baseline ($30–50 or TRC TPU if granted)	10
3	35–39	Tree EAGLE verify (acceptance 3.5–4×)	12
4	40–43	SSD async overlap (reuse v002 Phase I RTL)	13–14
5	44–52	P-EAGLE + LTD (dynamic K, RL policy)	15

Cross-track asset reuse ¶

        flowchart LR
  subgraph v002["v002 Extended — Gemma 3N E4B"]
    A[A–F baseline] --> G[G: sparsity] --> H[H/H+: EAGLE-3] --> I[I: SSD] --> J[J: Tree] --> K[K: 20 tok/s benchmark]
  end
  subgraph v003["v003 — Gemma 4 E4B"]
    P1[1: foundation] --> P2[2: EAGLE linear] --> P3[3: Tree] --> P4[4: SSD] --> P5[5: P-EAGLE + LTD]
  end
  G -. sparse weight fetcher .-> P1
  H -. EAGLE training pipeline .-> P2
  J -. tree mask generator .-> P3
  I -. SSD dispatcher .-> P4

4. Integrated timeline (52 weeks)¶

Week	v002	v003
1–15	Phase A–C	—
16–18	Phase D	Phase 1 starts
19–26	Phase E–F (baseline 5–6 tok/s)	Phase 1 continues
27–30	Phase G	Phase 2 starts
31–38	Phase H / H+	Phase 2 completes
39–42	Phase I	Phase 3
43–46	Phase J (optional)	Phase 4
47–49	Phase K (20 tok/s benchmark)	Phase 4 continues
50–52	—	Phase 5 (15 tok/s)

Total 52 weeks (~12 months) assuming full-time solo work. Double the duration for a part-time schedule.

5. Compute budget ¶

Item	Cost	Window
Vast.ai / RunPod sign-up	$10 minimum deposit	Week 0
v002 Phase H+ EAGLE-3 (Gemma 3N)	$20–30	Week 33–38
v002 Phase J EAGLE tree variant	$10–15	Week 43–46
v003 EAGLE-3 (Gemma 4)	$30–50 (or $0 with TRC)	Week 27–34
Total	$70–100 ($40 with TRC)	—

Submit TRC TPU application at Phase 0 (1–2 week approval).
First 3–4 weeks need no cloud compute — local development only.
GPU becomes necessary starting at Phase H+.

6. Year 2 — Auto-Porting Pipeline α (stretch)¶

Kicks off once v002 + v003 are stable. Takes an arbitrary transformer config.json + weight safetensors and emits C code that calls the pccx driver API, plus a quantized weight binary.

Technical shape ¶

        flowchart TD
  CFG[Transformer config.json] --> P[Config Parser]
  P --> R[Architecture Resolver]
  R --> F[Special Feature Detector]
  F --> I[ISA Code Generator]
  I --> W[Weight Layout Planner]
  W --> C[C Stub Generator]
  C --> V[Validator: Python golden vs RTL]

Target priority ¶

Tier 1: regenerate hand-ported models — Gemma 3N E4B, Gemma 4 E4B
Tier 2: standard architectures — Llama 3.x, Qwen3, Mistral 7B
Tier 3: complex architectures — DeepSeek-V3 (MoE), Gemma 4 26B A4B (MoE), Phi-3/4

Schedule (Week 53+)¶

Week 53–76 (24 weeks / 6 months), full-time:

Week 53–58 — Parser + Resolver + Gemma 3N/4 regeneration check
Week 59–64 — Tier 2 support (Llama, Qwen, Mistral)
Week 65–70 — Feature plugin system + MoE support
Week 71–76 — E2E automation, web UI / CLI, docs

Publication angle ¶

“Auto-Compilation of Transformer Inference Workloads to Custom NPU ISAs” — target venues ISCA / MICRO / HPCA / FCCM / FPGA.

7. Milestones ¶

Year 1 KPIs ¶

Week 26 — Coherent Gemma 3N E4B decode on board, 5+ tok/s
Week 38 — EAGLE-3 Gemma 3N checkpoint released on HF (first public)
Week 47 — Gemma 3N E4B 20 tok/s officially measured ← promise met
Week 52 — Gemma 4 E4B at 12+ tok/s
Blog post / paper draft (v002 results)

Year 2 KPIs (Auto-Porting α)¶

Week 76 — Llama 3.1 8B auto-generated + running on KV260
Year 2 end — 5+ model families supported
Academic publication

8. RTL repository ¶

Both tracks are implemented in hwkim-dev/pccx-FPGA-NPU-LLM-kv260. At v002 freeze the codes/v002/ snapshot is pinned into this docs repo (see the version-cutover ceremony), then v003 branches off.

Document version: 2026-04-20, first edition. Compiled from local plan drafts (pccx_master_roadmap_final.md, pccx_v002_extended_20toks_plan.md, tinynpu_v003_gemma4_e4b_plan.md). Next update: at v002 Phase F completion (~ Week 26).