Analyzer API¶
The TraceAnalyzer trait is the single abstraction every analysis
in pccx-lab implements. Consumers (UI / CLI / Copilot / extensions)
never call individual roofline::analyze() / bottleneck::detect()
free functions anymore — they iterate the registry and pattern-match
on AnalysisPayload variants.
Trait¶
pub trait TraceAnalyzer: Send + Sync {
fn id(&self) -> &'static str; // stable id
fn display_name(&self) -> &'static str; // menu label
fn description(&self) -> &'static str; // one-sentence blurb
fn analyze(&self, trace: &NpuTrace, hw: &HardwareModel) -> AnalysisReport;
}
Purity requirement: analyze() must not mutate external state or
issue IPC. The registry reserves the right to parallelise calls.
Return shape¶
pub struct AnalysisReport {
pub analyzer_id: String, // == analyzer's id()
pub summary: String, // ≤ 500 chars — LLM-ready
pub payload: AnalysisPayload,
}
pub enum AnalysisPayload {
Roofline(RooflinePointPayload),
RooflineHierarchical(Vec<RooflineBand>),
Bottleneck(Vec<BottleneckInterval>),
Custom(serde_json::Value), // for new analyzers without a typed variant
Text, // summary is the whole result
}
Custom is the escape hatch — experimental analyzers can carry any
JSON-serialisable payload without modifying the core enum.
Built-in registry¶
id |
display |
payload |
what it reports |
|---|---|---|---|
|
Roofline |
|
Williams/Waterman AI vs. achieved GOPS |
|
Hierarchical Roofline |
|
Per-memory-tier dwell (Ilic 2014, Yang 2020) |
|
Bottleneck Windows |
|
Sliding-window contention detector |
|
DMA Utilisation |
|
Read/write/MAC cycle share, starved vs. saturated flags |
|
Stall Histogram |
|
Log-bucket distribution of |
|
Per-Core Throughput |
|
MAC share per |
|
Latency Distribution |
|
P50/P95/P99 of |
|
Power Estimate |
|
Dynamic W + % of KV260 7 W TDP; thermal-bound flag |
|
KV-Cache Pressure |
|
Per-decode KV footprint vs L2 URAM; flags HBM spill |
|
Prefill/Decode Phase |
|
Tags spans as prefill/decode/idle; wall-clock split |
|
Arithmetic Intensity Trend |
|
Rolling-window AI (ops/byte); flags prefill→decode collapse |
|
DMA Burst Efficiency |
|
AXI effective BW after overhead amortisation |
|
Matryoshka Subnet |
|
Gemma-3N E2B/E4B swap savings projection |
|
MoE Expert Sparsity |
|
Per-expert activation rate; flags over-sparse routing |
|
FlashAttention Tile Efficiency |
|
Share of MAC bursts meeting tile threshold |
|
Dual-HP-Port Balance |
|
HP0/1 vs HP2/3 traffic share; one-port starvation |
Research lineage¶
The post-2025 analyzers are grounded in recent literature so the UVM Copilot can cite evidence when narrating findings. See the Research lineage page for the complete auto-generated citation table.
Running everything¶
use pccx_core::{analyze_all, HardwareModel, NpuTrace};
let hw = HardwareModel::pccx_reference();
for report in analyze_all(&trace, &hw) {
println!("[{}] {}", report.analyzer_id, report.summary);
}
Writing a new analyzer¶
use pccx_core::analyzer::{TraceAnalyzer, AnalysisReport, AnalysisPayload};
use pccx_core::{NpuTrace, HardwareModel};
pub struct DmaLatencyAnalyzer;
impl TraceAnalyzer for DmaLatencyAnalyzer {
fn id(&self) -> &'static str { "dma_latency" }
fn display_name(&self) -> &'static str { "DMA Latency Distribution" }
fn description(&self) -> &'static str {
"Distribution of DMA_READ durations; flags long-tail outliers."
}
fn analyze(&self, trace: &NpuTrace, _hw: &HardwareModel) -> AnalysisReport {
let durs: Vec<u64> = trace.events.iter()
.filter(|e| e.event_type == "DMA_READ")
.map(|e| e.duration)
.collect();
let max = durs.iter().copied().max().unwrap_or(0);
let mean = if durs.is_empty() { 0.0 }
else { durs.iter().sum::<u64>() as f64 / durs.len() as f64 };
AnalysisReport {
analyzer_id: self.id().to_string(),
summary: format!("{} DMA reads · mean {:.0} cy · max {} cy",
durs.len(), mean, max),
payload: AnalysisPayload::Text,
}
}
}
To ship it:
Append
Box::new(DmaLatencyAnalyzer)tobuiltin_analyzers().Re-export from
lib.rsif consumers will construct the type directly.Add a unit test in
analyzer::tests.(Optional) Add a
Citationentry incore/src/research.rs.cargo test -p pccx-coreshould pass.
The Copilot picks up the new analyzer through the registry; the UI
sees it via the same invoke("run_all_analyzers", ...) Tauri command.
Summary formatting convention¶
Keep
summary≤ 500 chars.Start with the metric:
"AI 0.69 ops/byte · 321.4 GOPS (0% of peak) · memory-bound".Reserved uppercase prefixes drive the UI severity colour:
error:/HBM-SPILL/OVER-SPARSE/PORT-STARVED/TILE-BOUND/AI-COLLAPSE/OVERHEAD-BOUND→ redTHERMAL-BOUND/PREFILL-BOUND/DECODE-BOUND/SWAP-E2B→ amber
The Copilot joins all summaries into one LLM system prompt; this is why each one must stand on its own without reader context.