Graph-Native Assimilation Engine¶
CONCEPT:KG-2.7 — Research Assimilation (+ KG-2.5 synergy, KG-2.10 orchestration synthesis) Package:
agent_utilities/knowledge_graph/assimilation/· Driver:research/golden_loop.py· MCP:graph_orchestrate(action="assimilate")Strategy/plan:.specify/specs/ecosystem-evolution/
Why¶
The first evolution pass read every paper with an LLM, found synergies by hand, and re-discovered "does this already exist?" by re-reading — none of which scales, and it repeatedly re-proposed already-built features. The assimilation engine makes the hard parts graph operations over one Evidence/Capability Knowledge Graph, so the LLM is used only at the edges (per-source extraction; plan synthesis from a neighborhood). Cost grows with the delta, not the corpus.
The graph¶
Sources (Article/Source/Document/Requirement/Decision), capabilities
(SDDFeature/Capability), and our Concepts, linked by:
SIMILAR_TO (dedup) · SUPERSEDES (duplicate/older) · SATISFIED_BY (a feature our
code already provides) · HAS_SYNERGY_WITH (cross-pillar bundle) ·
DERIVED_FROM_RESEARCH / ASSIMILATED_INTO (provenance close-out) ·
ADDRESSED_BY (an in-flight plan). Edges carry a _rel property marker so the
lifecycle read path is backend-portable (out_edges/in_edges expose properties,
not the relationship label). Type matching is case-insensitive (the live graph
stores capitalized labels like Article; our enum values are lowercase).
Pipeline (all graph compute except where noted)¶
flowchart LR
ING["ingest (content-addressed)<br/>papers · OSS · repos · docs/chat"] --> DEDUP
DEDUP["dedup<br/>SIMILAR_TO + SUPERSEDES"] --> GAP
GAP["gap (auto_satisfy)<br/>SATISFIED_BY · open_features()"] --> SYN
SYN["synergy + rank<br/>HAS_SYNERGY_WITH · PageRank"] --> PLAN
PLAN["plan synthesis (LLM)<br/>from KG neighborhood"] --> CLOSE
CLOSE["close-out on implement<br/>DERIVED_FROM_RESEARCH · ASSIMILATED_INTO"]
| Stage | Module | What it does |
|---|---|---|
| ingest | ingest.py, breadth_ingest.py |
docs→Requirement, chat→Decision, codebases via IngestionEngine; canonical_source_id collapses arxiv/DOI/URL/path dupes; content_fingerprint per-item skip |
| dedup | dedup.py |
embedding all-pairs (engine compute_similarity_edges fast path / local fallback) → SIMILAR_TO; cluster → SUPERSEDES survivor→dup |
| gap | gap_analysis.py |
match features↔our concepts → SATISFIED_BY; open_features = no closing edge/status (the "stop rediscovering" filter) |
| synergy + rank | synergy.py |
Louvain communities (engine / components fallback) → cross-pillar HAS_SYNERGY_WITH; rank_features = source_count × (1+centrality) (PageRank / degree fallback) |
| plan synthesis | plan_synthesis.py |
hydrate_feature neighborhood → SDD plan (planner role; grounded-template fallback) → propose + flip feature to proposed |
| ledger / close-out | ledger.py |
record_feature/set_status; on implement close_out writes provenance edges + status → permanently excluded |
| pilot | pilot.py |
acceptance: asserts no already-built feature is re-proposed + emits ranked gaps |
Idempotency (the "don't re-hit it" guarantee)¶
- Ingest is content-addressed — unchanged source = no-op (
content_fingerprint); duplicate URIs collapse (canonical_source_id). - Cycle carries a state watermark ((id, status, content_hash) of feature/source nodes);
golden_loopskips the assimilate pass when the graph is unchanged (forceoverrides). - Lifecycle excludes satisfied/superseded/implemented/in-flight features from
open_features, so nothing built is re-proposed. Dedup/satisfy/synergy edges MERGE → re-runs converge.
Running it¶
# programmatic
from agent_utilities.knowledge_graph.research.golden_loop import run_assimilation_pass
run_assimilation_pass(synthesize=True, top_n=10)
# MCP
graph_orchestrate(action="assimilate") # dedup→gap→synergy→rank
graph_orchestrate(action="assimilate", task="synthesize") # + propose plans
# autonomous daemon (golden-loop tick) — env-gated
KG_LOOP=1 KG_LOOP_BREADTH=1 \
KG_BREADTH_LIBRARY_ROOTS=/path/to/open-source-libraries \
KG_BREADTH_REPO_ROOTS=/path/to/agent-packages
# breadth + acceptance pilot CLI
python scripts/run_assimilation_breadth.py ingest --libraries … --repos … --pilot
Monitoring¶
Every run_one_cycle carries a metrics block (per-stage timings, error_count,
open_gaps, total duration), logs a structured health line, surfaces stage errors,
and persists a queryable EvolutionCycle node — query
MATCH (c) WHERE c.type='orchestration_cycle' RETURN c.error_count, c.stage_ms to
watch error rates / latencies and tune ingestion.
Related¶
- In-House Training Substrate · Global Workspace Attention · MASS
- Strategy:
.specify/specs/ecosystem-evolution/ASSIMILATION_STRATEGY.md; plan:…/PHASE0_IMPLEMENTATION_PLAN.md.