10 queries.
Real codebase.
A standing benchmark: 10 hand-curated queries on a real production codebase (~12k Rust + 4k TS + 2k Python files). Pass means the tool returns the exact symbol set the human author would expect. Bench harness ships with Mneme.
From 2/10
to ~6/10.
Genesis closes the recall gap by shipping three real symbol resolvers (Rust, TypeScript, Python). Same queries, same corpus, dramatic improvement.
Mneme v0.4.0
Mneme v0.3.2
CRG
ripgrep
All ten.
Side by side.
A representative sample of what AI coding agents actually ask in practice. Each row scored against the known-correct answer set.
WorkerPool::spawn called?✓ exact✓ exact~ textspawn()✓ exact~ partial✗ missprocessOrder?✓ exact✓ exact✗ misshealth.rs✓ exact✓ exact✗ miss__init__ with super().__init__~ partial✓ exact~ textBaseHandler✓ exact✓ exact~ textimpl Display for *✓ exact~ partial~ texttracing::info!✗ miss✗ miss✗ missget across hierarchy~ partial✓ exact✗ missHow we
measure.
Real production codebase. Hand-authored queries. Reproducible harness. No cherry-picking — every result published with its scoring rule.
Corpus
~12k Rust files, ~4k TypeScript, ~2k Python. ~14k canonical symbols, ~50k call-graph edges.
Queries
10 hand-authored queries representing what AI coding agents actually ask. Each has a known-correct answer set provided by the original authors.
Scoring
Pass = returns the exact correct symbol set. Partial = returns subset but missing some. Miss = wrong answer or empty.
Reproducibility
The bench harness ships with Mneme. Run mneme bench in any indexed project to score it locally.
To 10/10.
v0.5 targets.
Genesis closes 4 gaps from 2/10 → ~6/10. Four remain, each with a stated ship target. The roadmap is public; gaps are not hidden.
→ Macro-expanded Rust
Symbol resolution through tracing macros, derive impls, async-trait. Targeted for v0.5.
→ Python overload hierarchies
Better MRO resolution, ABCs, mixin chains. Targeted for v0.5.
→ Cross-language FFI
TS calling Rust over WASM boundaries, Python ctypes. Research target.
→ Dynamic dispatch
Trait objects, virtual methods. Hardest; involves runtime data. Long-term research.