KG Connectors, Ingestors & Enrichers — the unified ingestion architecture¶
One entrypoint, one provenance contract, one delta model. Every external system the Knowledge Graph knows about — enterprise apps, code, documents, research — flows in through the same mechanism and is enriched by the same OWL/RDF reasoning. This is the map of all of them. (CONCEPT:KG-2.9)
This page is the canonical inventory and architecture for how the KG is
hydrated. The connector list at the bottom is auto-generated from the live
registries (scripts/generate_connector_map.py) so it never drifts.
1. The one mental model¶
flowchart LR
subgraph SRC["External systems (~40+ connectors)"]
EA["Enterprise apps\nLeanIX/ServiceNow/ERPNext/Jira/…"]
CODE["Code\nGitLab/GitHub repos"]
PROC["Process\nCamunda/ARIS/Egeria/ArchiMate"]
DOC["Documents & web\nArchiveBox/crawl4ai/scholarx/search"]
end
subgraph CORE["agent-utilities — one ingestion core"]
SS["source_sync() · sweep_all_sources()\nTHE entrypoint (delta / full / reconcile)"]
PROV["stamp_source()\nsource_system + domain (provenance)"]
DELTA["write-layer content-hash delta\nskip unchanged → no write, no re-reason"]
WB["write_entities() — THE one writer\n(ingest_external_batch + write_batch\nare thin adapters over it)"]
end
subgraph EG["epistemic-graph (Rust)"]
PARSE["tree-sitter parse + resolve\nIndexRepository (ast_hash = content hash)"]
STORE["LPG + Neo4j/FalkorDB/Stardog/pg-age/fanout"]
end
subgraph ENR["Enrichers"]
OWL["OWLBridge reasoning\ntransitive :calls/:dependsOn, crosswalks"]
EXTR["extractors (code/test, facts, process)"]
end
EA & PROC & DOC --> SS
CODE --> PARSE --> SS
SS --> PROV --> DELTA --> WB --> STORE
STORE --> OWL --> STORE
WB --> EXTR --> STORE
STORE -->|writeback sinks| SRC
Three things are deliberately uniform across every connector:
- One entrypoint —
sync_source(engine, source, mode)(and its fleet-wide siblingsweep_all_sources). No connector hydrates ad hoc. - One provenance contract —
stamp_source()stampssource_system+domainon every row, so named-graph routing, federation, and mirroring treat all connectors identically. - One delta model — see §4.
- One writer —
core/materialization.write_entities()is the single materialization implementation. The two historical write paths (ingest_external_batch, dict entities; andwrite_batch, typedExtractionBatchfor the materialize/extractor fleet) are now thin input adapters over it with zero duplicated logic, so provenance, the content-hash delta, and typed-label batching are implemented once. Sinceexecute/execute_batchare@abstractmethodonGraphBackend(every backend provides them), the writer has just two branches: UNWIND MERGE (all backends) and a per-row MERGE variant for Ladybug (Kuzu has no UNWIND). The schema helpers (normalize_label/schema_valid_keys/set_clause) also live here once — the engine's_normalize_label/_get_set_clausedelegate to them.
2. The standardized surface (3 MCP tools → clear roles)¶
The Python core was always unified (sync_source is "the single entrypoint").
The MCP surface is now standardized to match:
| MCP tool | Role | Delegates to |
|---|---|---|
source_sync |
Canonical connector→KG ingestion. source=<name> or source="all" (fleet sweep); mode=delta\|full\|reconcile. |
sync_source / sweep_all_sources |
graph_hydrate |
Back-compat alias (full mode). Kept so existing callers don't break. | sync_source(mode="full") |
graph_ingest |
Different concern: content ingestion — paths, URLs, documents, codebases, corpus/job control. Its sync/materialize_source actions delegate to the same core. |
sync_source / run_materialize_source |
REST twins live under /api/dashboard/ (hydrate/{source}, hydrate,
hydration-status, daemon/start).
Rule of thumb: sync a system → source_sync; ingest a file/URL/repo path
→ graph_ingest.
3. The three ingestion paths (how a connector gets in)¶
A connector participates in one or more of these, dispatched by sync_source:
flowchart TD
S["sync_source(engine, source, mode)"] --> A{source in\n_DELTA_HANDLERS?}
A -->|yes| D["delta handler\nwatermark poll + reconcile\n(leanix / gitlab / archivebox)"]
A -->|no| B{source in\nMATERIALIZE_SOURCES?}
B -->|yes| M["run_materialize_source\nvendor client → extractor → write_batch\n(camunda / egeria / okta / …)"]
B -->|no| C["HydrationManager.hydrate_source\ngeneric full hydrate via CAPABILITY_REGISTRY"]
D & M & C --> P["stamp_source → content-hash delta → write"]
- Delta handlers (
_DELTA_HANDLERS) — native incremental sync with a per-source watermark (SourceSyncStatenode) + reconcile (tombstone upstream deletions). The most efficient path. - Materialize extractors (
MATERIALIZE_SOURCES) — an in-process vendor client + extractor maps the system to BFO/PROV-O entities, persisted viawrite_batch, followed by one OWL reasoning cycle. - Capability hydrate (
CAPABILITY_REGISTRY) — the generic full-hydrate fallback for any registered source that hasn't grown a delta handler yet.
Plus a fourth, document-oriented path: MCP_TOOL_PRESETS declarative
connectors that pull records/files/search results as Documents through the
generic McpToolSourceConnector (used by graph_ingest/build_skill_graph).
4. Delta for every connector (the optimization)¶
"Delta-focused ingestion for all connectors" is two layers — and the second is what makes it universal:
(a) Fetch-layer watermark (per-source, opportunistic). Where the source API
supports "changed since", the delta handler stores the max updatedAt/
last_activity_at/created_at on a SourceSyncState node and fetches only the
delta next run. Today: LeanIX, GitLab, ArchiveBox.
(b) Write-layer content-hash delta (generic, all connectors). At the single
write fan-in (ingest_external_batch), every entity gets a stable content_hash
over its semantic properties. Before writing, stored hashes are read in one
batched round-trip and unchanged entities are dropped — no MERGE, no
re-reasoning — even when the source was fetched in full. This is what makes a
full re-mirror cheap and turns every connector incremental regardless of whether
its API supports watermarks. Disable with KG_WRITE_DELTA=0.
flowchart LR
E["incoming entities"] --> H["content_hash each\n(id + volatile timestamps excluded)"]
H --> Q["batch read stored hashes\n(MATCH … WHERE n.id IN $ids)"]
Q --> F{hash changed?}
F -->|yes / new| W["MERGE + re-reason"]
F -->|no| K["skip (skipped_unchanged++)"]
Leveraging Rust epistemic-graph. For code, the content hash is free: the
tree-sitter parser already emits a content-stable ast_hash and uses it as the
symbol:<hash> node id, so "which symbols changed" is answered by node existence
(HasNodesBatch) with zero extra compute. IndexRepository resolves an entire
repo's :calls/:dependsOn in one parallel (rayon) pass off-reactor. The
generic write-layer delta extends that same content-hash idea to every non-code
connector.
5. Background ingestion across the board¶
A single host-role daemon runs skill_scheduler every 60s, reading
deploy/schedules.yml. The fleet sweep is one declarative entry:
- name: all-sources-delta-sweep
cron: "*/20 * * * *"
kind: skill
ref: all # → sync_source(engine, "all", mode="delta") → sweep_all_sources
action: delta
enabled: true
sweep_all_sources(mode="delta") enumerates the union of delta handlers +
configured capability sources + materialize extractors and syncs each,
isolating per-connector failures (unconfigured → skipped, not errored). With
the write-layer delta, each 20-minute pass is proportional to what changed.
Per-source entries (e.g. a nightly LeanIX reconcile, or a tighter cadence for a
hot source) still live alongside it when a source needs its own schedule.
6. Enrichers (what happens after the write)¶
Ingestion is only half the story — the KG's differentiator is that everything lands in one ontology and is reasoned over together:
- OWLBridge reasoning — transitive
:calls/:dependsOn/:covers, cross-vendor process crosswalks,:Featureclustering; runs as a cycle after materialize and on the Loop. (core/owl_bridge.py,ontology_*.ttl) - Extractors —
code_test(symbols/tests →:Code/:Test), the document fact extractor (text → atomic fact edges), process lift (Camunda/ARIS → ArchiMate). - Writeback sinks — the outbound half: KG intelligence is pushed back into
the source systems (issues, CMDB CIs, fact-sheet attributes). High-stakes sinks
are propose-only via the ProposalQueue. (
enrichment/writeback/sinks/)
See also: KG as Bidirectional ETL Hub, Content-Aware Ingestion, Code Intelligence, Vendor-Neutral Enterprise Ontology, Camunda + ARIS ↔ KG.
7. Connector inventory¶
Auto-generated — do not edit by hand. Run python scripts/generate_connector_map.py.
50 distinct connectors across the ingestion/enrichment paths: 3 delta handlers · 34 capability-hydrate · 23 materialize extractors · 28 writeback sinks · 27 document-ingest presets.
Connector × path matrix¶
in = ingests into the KG · out = writes KG intelligence back to the system.
| Connector | Delta (in) | Hydrate (in) | Materialize (in) | Writeback (out) |
|---|---|---|---|---|
ansible |
— | — | ✅ | ✅ |
archimate |
— | — | ✅ | ✅ |
archivebox |
✅ | — | — | — |
aris |
— | ✅ | ✅ | — |
caddy |
— | ✅ | ✅ | ✅ |
camunda |
— | — | ✅ | — |
capability |
— | — | — | ✅ |
databases |
— | ✅ | — | — |
egeria |
— | — | ✅ | ✅ |
emerald |
— | — | ✅ | ✅ |
emerald_exchange |
— | ✅ | — | — |
enterprise_architecture |
— | ✅ | — | — |
erpnext |
— | ✅ | ✅ | ✅ |
essential_ea |
— | ✅ | — | — |
github |
— | ✅ | — | ✅ |
gitlab |
✅ | ✅ | — | ✅ |
glpi |
— | ✅ | — | — |
homeassistant |
— | — | ✅ | ✅ |
issue_tracking |
— | ✅ | — | — |
jira |
— | ✅ | — | ✅ |
kafka |
— | — | ✅ | ✅ |
keycloak |
— | ✅ | ✅ | ✅ |
langfuse |
— | ✅ | — | — |
leanix |
✅ | ✅ | — | ✅ |
legal |
— | — | — | ✅ |
lgtm |
— | ✅ | ✅ | ✅ |
listmonk |
— | ✅ | — | — |
mattermost |
— | ✅ | — | — |
mealie |
— | — | ✅ | ✅ |
message_protocol |
— | ✅ | — | — |
microsoft |
— | — | ✅ | — |
nextcloud |
— | ✅ | ✅ | ✅ |
okta |
— | — | ✅ | ✅ |
openbao |
— | ✅ | — | — |
openmaint |
— | ✅ | — | — |
plane |
— | ✅ | — | ✅ |
portainer |
— | ✅ | ✅ | ✅ |
postiz |
— | ✅ | — | — |
process |
— | — | — | ✅ |
process_modeling |
— | ✅ | — | — |
relational_database |
— | ✅ | — | — |
salesforce |
— | — | ✅ | ✅ |
scholarx |
— | ✅ | — | — |
servicenow |
— | ✅ | ✅ | ✅ |
source_control |
— | ✅ | — | — |
technitium_dns |
— | ✅ | ✅ | ✅ |
tunnel_manager |
— | ✅ | — | — |
twenty |
— | ✅ | ✅ | ✅ |
uptime_kuma |
— | ✅ | ✅ | ✅ |
wger |
— | — | ✅ | ✅ |
Document-ingest presets (MCP_TOOL_PRESETS)¶
Declarative connectors that pull records/files/search-results as Documents through the generic McpToolSourceConnector:
archiveboxgithub-reposgitlab-issuesgitlab-merge-requestsharness-runskeycloak-usersmealie-recipesnextcloud-filesobjectstore-prefixokta-userspulselink-bilibilipulselink-exapulselink-githubpulselink-hackernewspulselink-newspulselink-redditpulselink-rsspulselink-v2expulselink-webpulselink-xpulselink-xiaohongshupulselink-xueqiupulselink-youtubesearxng-searchservicenow-tablesql-querysql-table