Skip to content

Harvesters, Cross-Links & Federation — Complete Map

The full reference for what egeria-mcp ingests, how the layers are cross-linked, and how it federates back into the knowledge graph. All layers are config-driven (read their connection from the environment), tolerant (skip with a clear message when a source/credential is absent), and idempotent (reconcile by qualifiedName).

Run order and dispatch: python -m egeria_mcp.harvest [layer … | all | reconcile | audit] or the egeria_harvest(layer) MCP tool. A full run does harvest → reconcile → audit. Confidentiality scale: 0 Unclassified · 1 Internal · 2 Confidential · 3 Sensitive · 4 Restricted.

1. Ingesters (34 source layers)

Infrastructure substrate

Layer Source Env Egeria asset(s) Conf Catalogs
hosts inventory file HOST_INVENTORY SoftwareServer 1 managed hosts from a YAML/JSON inventory (systems/tunnel-manager)
datastores declared estate EGERIA_HARVEST_TOPOLOGY SoftwareServer, RelationalDatabase per-store data-store servers + databases + DataFlow lineage (anchor layer)
containers Portainer/Swarm PORTAINER_URL,PORTAINER_API_KEY,PORTAINER_ENDPOINT_ID SoftwareServer (nodes), DeployedSoftwareComponent (services) 1 swarm nodes + services
dns Technitium DNS TECHNITIUM_DNS_URL,TECHNITIUM_DNS_TOKEN Collection (zone) authoritative DNS zones
proxy Caddy admin CADDY_ADMIN_URL DeployedSoftwareComponent (route) 1 reverse-proxy routed hosts + upstreams

Data stores

Layer Source Env Egeria asset(s) Conf Catalogs
documentdb MongoDB MONGODB_URI / MONGODB_HOST RelationalDatabase (db), DeployedDatabaseSchema (collection) 2 databases + collections (needs pymongo)
vectors Qdrant QDRANT_URL,QDRANT_API_KEY SoftwareServer (store), DeployedDatabaseSchema (collection) 1 vector collections
semantic Apache Jena Fuseki JENA_FUSEKI_URL,JENA_USERNAME/JENA_PASSWORD SoftwareServer (store), DeployedDatabaseSchema (dataset) 1 RDF datasets / named graphs
kafka Kafka REST proxy KAFKA_REST_URL,KAFKA_TOKEN DeployedDatabaseSchema (topic) 1 Kafka topics

Business & finance

Layer Source Env Egeria asset(s) Conf Catalogs
finance Firefly-III FIREFLY_URL,FIREFLY_TOKEN SoftwareServer (store), DeployedDatabaseSchema (account) 2 personal-finance accounts
markets emerald-exchange EMERALD_URL,EMERALD_TOKEN / EMERALD_PORTFOLIO SoftwareServer (store), DeployedDatabaseSchema (instrument) 2 financial instruments / holdings
crm Twenty CRM TWENTY_URL,TWENTY_TOKEN,TWENTY_API_PREFIX SoftwareServer (store), DeployedDatabaseSchema (record) 2–3 companies (2) + people (3, PII)
erpnext ERPNext/Frappe ERPNEXT_URL,ERPNEXT_TOKEN SoftwareServer, RelationalDatabase, DeployedDatabaseSchema (DocType) 1–3 business DocTypes; HR/payroll→3, PII/financial→2
mailing Listmonk LISTMONK_URL,LISTMONK_USER,LISTMONK_TOKEN DeployedDatabaseSchema (list) 2 mailing lists (PII)
files Nextcloud NEXTCLOUD_URL,NEXTCLOUD_USERNAME,NEXTCLOUD_PASSWORD DeployedDatabaseSchema (share) 2 shared files/folders
ml data-science DATA_SCIENCE_MCP_URL,DATA_SCIENCE_MCP_TOKEN DeployedSoftwareComponent (model), DeployedDatabaseSchema (dataset) 1–2 ML models (1) + training datasets (2)

Identity & governance

Layer Source Env Egeria asset(s) Conf Catalogs
identity Keycloak KEYCLOAK_URL,KEYCLOAK_TOKEN / KEYCLOAK_CLIENT_ID+KEYCLOAK_CLIENT_SECRET,KEYCLOAK_REALM Collection (realm), DeployedSoftwareComponent (client) 1 security realms + OIDC clients
secrets OpenBao/Vault OPENBAO_URL/VAULT_ADDR,OPENBAO_TOKEN/VAULT_TOKEN DeployedSoftwareComponent (engine 2, policy 1) 1–2 secret-engine mounts + ACL policy names (never values)
servicenow ServiceNow CMDB SERVICENOW_URL,SERVICENOW_USER+SERVICENOW_PASSWORD / SERVICENOW_TOKEN SoftwareServer (CI) 1 CMDB configuration items

Knowledge & enterprise architecture

Layer Source Env Egeria asset(s) Conf Catalogs
knowledge Confluence CONFLUENCE_URL,CONFLUENCE_USER,CONFLUENCE_TOKEN (or ATLASSIAN_AGENT_*) Collection (space) Confluence spaces
m365 Microsoft Graph MSGRAPH_TOKEN,MSGRAPH_URL Collection (site/group) SharePoint sites + M365 groups
iot Home Assistant HOME_ASSISTANT_URL,HOME_ASSISTANT_TOKEN SoftwareServer (hub), Collection (domain) 1 integration domains
archimate ArchiMate model ARCHI_MODEL_PATH layer-mapped (DeployedSoftwareComponent/Process/SoftwareServer/DeployedDatabaseSchema) 1 ArchiMate Open Exchange elements
leanix SAP LeanIX LEANIX_URL,LEANIX_API_TOKEN Collection (portfolio), DeployedSoftwareComponent/Process/… (fact sheet) 1 EA fact sheets

Automation, code & work

Layer Source Env Egeria asset(s) Conf Catalogs
automation Ansible/AWX Tower TOWER_URL,TOWER_TOKEN Collection (inventory), Process (job template) 1 inventories + job templates
processes Camunda 7 CAMUNDA7_URL / CAMUNDA_URL Process 1 BPMN process definitions
projects Plane / Jira PLANE_URL,PLANE_TOKEN,PLANE_WORKSPACE / JIRA_URL,JIRA_USER,JIRA_TOKEN Project work-tracking projects
gitlab GitLab GITLAB_URL,GITLAB_TOKEN DeployedSoftwareComponent (repo) by visibility repositories
github GitHub GITHUB_TOKEN,GITHUB_ORG DeployedSoftwareComponent (repo) by visibility repositories

Collaboration & observability

Layer Source Env Egeria asset(s) Conf Catalogs
chat Mattermost MATTERMOST_URL,MATTERMOST_TOKEN Collection (team) teams
observability Grafana GRAFANA_URL,LGTM_TOKEN/GRAFANA_TOKEN DeployedSoftwareComponent (datasource), Collection (dashboard) 1 data sources + dashboards
monitoring Uptime Kuma UPTIME_KUMA_URL,UPTIME_KUMA_TOKEN DeployedSoftwareComponent (monitor) 1 monitored targets (via /metrics)
archive ArchiveBox ARCHIVEBOX_URL,ARCHIVEBOX_API_KEY Collection (corpus), DeployedDatabaseSchema (snapshot) 1 web snapshots
llmops Langfuse LANGFUSE_BASE_URL,LANGFUSE_PUBLIC_KEY,LANGFUSE_SECRET_KEY DeployedDatabaseSchema (dataset) 1 LLM datasets

Excluded by design (lookup/search proxies with no owned inventory): scholarx, legal-peripherals. Excluded as non-governable: media/personal peripherals.

First-party ↔ open-source coverage

The federation is vendor-pluggable: a capability is served by a first-party tool, an open-source equivalent, or both — and every adapter catalogs into the same Egeria types, so reasoning is identical. egeria-mcp ships harvesters for both columns where the estate runs both:

Capability First-party harvester Open-source harvester
ITSM / service mgmt servicenow erpnext (DocTypes Issue/HD Ticket, capability=ITSM)
Enterprise architecture leanix archimate
VCS / code github gitlab
Productivity / knowledge m365, knowledge (Confluence) files (Nextcloud)
Project / work tracking projects (Jira) projects (Plane), erpnext (capability=PM)
BPM / process processes (Camunda), automation (Ansible)
Metadata / catalog Apache Egeria (the SoR)

ERPNext is intentionally multi-role — its capability property (ERP / ITSM / PM) lets the federation reconcile it with whichever first-party tool it substitutes for. The canonical capability map (every capability × first-party × open-source × canonical concept) lives in the agent-utilities vendor-neutral ontology doc.

Each matcher creates a labelled DataFlow edge (or, for P10, propagates confidentiality). Matching is deterministic; linking reads existing lineage first so re-runs add nothing.

# Pattern Match rule Edge label Layers joined
P1 host-hosting asset.additionalProperties.hostNode == Node.name hosts containers ↔ data/erp/finance
P3 service-store Service::*_<name>DataStore::<name> realizes containers ↔ datastores
P4 dataset-store Dataset::<store>::* parent → DataStore::<store> hosts within/across data
P11 source-store asset.source ∈ {Kafka,Qdrant,Jena,Firefly,Twenty,ERPNext,Emerald,HA} → its DataStore hosts data layers ↔ store layer
P5 ingress-exposure Route.upstream host == Service name routes-to proxy ↔ containers
P6 monitoring Monitor.name ≍ Route/Service/Node/Store monitors monitoring ↔ infra
P7 cmdb-identity CI::ServiceNow::<name> == infra name same-as servicenow ↔ infra
P8 access-control Keycloak Client::*::<id> ≍ Service secures identity ↔ containers
P12 repo-service Repository::*/<name> ≍ Service name deploys code ↔ infra
P16 datasource-store Datasource::Grafana::<name>DataStore::<name> reads observability ↔ data
P21 ea-realization ArchiMate/LeanIX element name ≍ Service/Store/Repo realized-by EA ↔ running reality
P9 semantic-assignment asset.displayName == Glossary Term means any ↔ glossary
P22 capability-cohort assets sharing a capability from ≥2 sources → a Capability::<cap> collection groups same-capability, cross-vendor (ITSM: ServiceNow+ERPNext; vcs: GitLab+GitHub; EA: LeanIX+ArchiMate)
P23 cross-vendor-identity same displayName, same capability, different source same-as the same entity tracked in two vendor tools
P10 confidentiality-propagation store hosts higher-classified data (raises store level) governance enrichment

Capability tagging. Each asset carries a canonical capability property (ERP/ITSM/PM on ERPNext, vcs on GitLab/GitHub, enterprise-architecture on LeanIX/ArchiMate, ITSM on ServiceNow, …); where untagged, reconcile derives it from the asset's source/prefix. P22/P23 use it to connect first-party and open-source adapters that serve the same capability, so a query spans both regardless of vendor.

The audit (audit() / egeria_audit) reports which assets remain unlinked islands and per-layer coverage % — the radar for which layers to harvest next.

3. Federation back into the KG (bidirectional)

EgeriaApi.list_data_flows() enumerates the catalogue's DataFlow edges (shared scanner lineage_scan.py, scanning the low-cardinality hubs). The agent-utilities egeria extractor turns them into KG edges:

Reconcile edge label KG edge type
routes-to, produces, consumes, harvest flows :flowsTo
hosts, realizes, secures, monitors, same-as, means, deploys, reads :dependsOn

Egeria asset typeName → KG node type: SoftwareServerServer, DeployedSoftwareComponentTool, RelationalDatabase/DeployedDatabaseSchemaDataObject, ProcessProcessModel, CollectionConcept, else DataConnector. Every node carries externalToolId (Egeria GUID) + domain="egeria" (federation key).

See architecture.md for the pipeline diagrams.