Skip to content

Knowledge Distillation → Skill-Graphs

CONCEPT:KG-2.7 (standardized ingestion) · CONCEPT:AHE-3.9 (physical distillation) CONCEPT:KG-2.90 (connector → skill synthesis) · CONCEPT:KG-2.91 (skill-synthesis ontology links) Packages: agent_utilities/knowledge_graph/distillation/ · agent_utilities/knowledge_graph/ingestion/ Engine: epistemic-graph GetSubgraph (batched subgraph read) MCP: graph_ingest(action="distill" | "import_pack") · CLIs: python -m agent_utilities.knowledge_graph.distillation.skill_graph_distiller, python -m agent_utilities.knowledge_graph.ingestion Skills (universal-skills): skill-graph-builder (generate_skill.py --from-kg), web-crawler (crawl.py --ingest-kg), knowledge-graph-ingest

Why

The KG already holds curated knowledge; skill-graphs are how that knowledge gets packaged, versioned, and shared with agents. Historically those were two disconnected worlds: a skill-graph was built by crawling a website into a reference/ markdown tree, while the KG was populated separately. This makes the KG the single source of truth and a skill-graph a versioned, round-trippable projection of a KG subgraph — so a slice of knowledge ("everything about ServiceNow") becomes a curated, shareable package that another KG can re-import and dedup-merge.

The insight: a skill-graph is already a degenerate knowledge graph

Skill-graph artifact …is really a KG equivalent
SKILL.md table of contents node index subgraph manifest
reference/**/*.md content nodes Document.content / IdeaBlock.trusted_answer
folder hierarchy edges CONTAINS / PART_OF
concept: frontmatter typed anchor Concept node

So distilling a skill-graph is a graph projection + serialization, and importing one is ingestion — the two are inverses.


1. Standardized document ingestion (the prerequisite)

Distillation is only faithful if the KG actually retains document text. Previously the same content got a different node shape depending on how it was submitted:

Submission form Old path Shape Full text?
Single file extract_document Document(no body) + Concept(summary) ❌ lossy
Directory / URL KBIngestionEngine RawSource + curated Article ✅ (LLM-rewritten)
Manual DistillationEngine.ingest_text verbatim IdeaBlock ✅ but unreachable via the tool

This was consolidated (strangler-then-delete) into one verbatim contract — the same regardless of file / directory / URL:

flowchart LR
    SRC["document<br/>(file · dir · URL)"] --> UNIT["_ingest_document_file<br/>(canonical per-doc unit)"]
    UNIT --> DOC["Document{content: full verbatim}"]
    UNIT --> CHUNK["IdeaBlock{trusted_answer: chunk}<br/>(PART_OF Document)"]
    UNIT --> CONC["Concept{summary}<br/>(MENTIONS)"]
  • Document{content} — full verbatim body, re-materialisable.
  • IdeaBlock chunks (distillation_engine.chunk_text, deterministic ids {doc.id}:chunk:{i}) linked PART_OF the Document — the retrieval/dedup substrate.
  • Concept nodes via MENTIONS — the interlinking layer.

LLM curation into Article nodes survives only as the explicit KNOWLEDGE_BASE / curate_wiki content type, or opt-in via manifest.metadata["curate"]=True — never an accidental consequence of passing a directory. Code: ingestion/engine.py (_ingest_document, _ingest_document_file, _ingest_document_dir, _ingest_document_url), enrichment/models.py (Document.content), enrichment/extractors/document.py.


2. Distillation (KG subgraph → skill-graph)

SkillGraphDistiller (distillation/skill_graph_distiller.py) walks a coherent subgraph and materialises a neutral reference/ tree + kg_manifest.json — format-agnostic output that skill-graph-builder consumes verbatim as a "local directory" source (no change to the existing TOC/SKILL.md generator).

flowchart LR
    SEL["select<br/>seed id OR semantic_search → BFS to depth"] --> SUB
    SUB["fetch_subgraph<br/>one GetSubgraph round-trip"] --> TAX
    TAX["taxonomy<br/>community_detection → folders"] --> MAT
    MAT["materialize<br/>content → reference/*.md · edges → cross-links"] --> OUT["reference/ + kg_manifest.json"]

Mapping, KG-native → skill-graph-native:

  • Selection — seed by node id, or by graph.semantic_search on a query embedding, then an undirected hop-bounded BFS (max_nodes cap, closest-first).
  • community_detection (Louvain) → reference/<cluster>/ folders; cluster names from the highest-signal Concept title.
  • Hierarchy/relationship edges (CONTAINS, MENTIONS, RELATES_TO, …) → TOC nesting + inline "Related" cross-links between files.
  • Body text (content / trusted_answer / summary) → reference/**/*.md.
  • Parent/child dedup — because a Document and its PART_OF chunks both carry text, a chunk whose parent Document is itself materialised is recorded in the manifest but not written as a duplicate file.
  • kg_manifest.json{schema, ontology, snapshot_ts, selector, nodes:[{id,type,title,file}], edges:[{src,dst,type}], clusters} — the provenance record that makes the package round-trippable.

generate_skill.py --from-kg "<seed-or-query>" shells out to the distiller, merges the result through the existing pipeline, copies kg_manifest.json into the skill dir, and surfaces provenance in the SKILL.md frontmatter:

kg_manifest: kg_manifest.json
kg_ontology: agent-utilities
kg_snapshot: 2026-06-09T12:15:49Z
kg_anchors: ['concept:servicenow']

3. Round-trip import (shareable knowledge packages)

import_skill_graph_pack (distillation/skill_graph_importer.py) reads a distilled pack's kg_manifest.json and reconstructs the subgraph in a recipient KG — preserving original node ids and edges. Because ids are preserved (and chunk ids deterministic), re-import is idempotent. corpus_name="dedup" runs the existing IdeaBlock deduplicator (engine.distill_knowledge) so two packages on the same topic converge instead of duplicating.

flowchart LR
    KG1["KG (source of truth)"] -->|distill| PKG["skill-graph package<br/>reference/ + kg_manifest.json"]
    PKG -->|pip / registry| SHARE((share))
    SHARE -->|import_pack| KG2["recipient KG"]
    KG2 -->|dedup-merge| KG2

distill = serialize a subgraph · share = pip/registry · import = ingest + dedup-merge.


4. Paired graph-native skill-workflows

The KG's Procedure/Playbook/Policy nodes and PRECEDES edges map directly onto a workflow step-DAG. SkillGraphDistiller.distill_workflow (or graph_ingest action="distill", content_type="workflow") emits a SKILL.md whose ### Step N: <token> [depends_on: Step k] ordering is a topological sort over PRECEDES — validatable by skill-workflow-builder's build_workflow.py validate. From one subgraph you can therefore distill a pair: the docs (skill-graph) and the how-to-act (skill-workflow), versioned together.


4a. Connector → skill synthesis (propose-only)

CONCEPT:KG-2.90 (the distiller) · CONCEPT:KG-2.91 (the ontology links) Module: agent_utilities/knowledge_graph/distillation/skill_synthesizer.py (ConnectorSkillDistiller)

The same "graph ops, LLM at the edges" philosophy runs outward: once connectors (egeria / leanix / aris / camunda) have mapped their processes into the KG, those processes are themselves a source of skills. The ConnectorSkillDistiller is a KG-native, propose-only background distiller that turns mapped processes into NEW atomic-skill and skill-workflow PROPOSALS — a human/Claude reviews + approves; nothing lands in any repo automatically.

It is generic over the ontology, never per-connector. Every connector already lifts into the same ArchiMate/capability classes, so one ontology-driven pass covers them all:

Ontology class / edge …becomes a
a lone BusinessTask / Capability atomic-skill candidate
a flowsTo-chain of ≥2 BusinessTasks skill-workflow candidate (steps = atomic skills)
an unresolved manual: task (ProcessPlanCompiler gap) atomic-skill candidate (automation gap)
a recurring cross-process inference (OntologyReasoningDriver) cross-process candidate
flowchart LR
    subgraph CONN["connectors → KG (already mapped)"]
        C1["camunda"]:::c --> KG
        C2["aris"]:::c --> KG
        C3["egeria"]:::c --> KG
        C4["leanix"]:::c --> KG
    end
    KG["KG ontology<br/>BusinessProcess · BusinessTask · flowsTo · Capability"] --> DISC
    DISC["discover<br/>processes + unresolved manual: gaps + OWL patterns"] --> CLS
    CLS["classify<br/>action→atomic · flowsTo-chain→workflow"] --> DED
    DED["dedup<br/>ConceptMatcher / skill registry → covered|related|novel"] --> PROP
    PROP["propose (PROPOSE-ONLY)<br/>SkillProposal / SkillWorkflowProposal<br/>+ AUTOMATES · DERIVED_FROM · COMPOSES"] --> REV
    REV{{"human / Claude<br/>review + approve"}} -->|approve| MAT
    MAT["materialize<br/>PhysicalDistillationEngine → SKILL.md (STAGING dir)"]
    classDef c fill:#eef,stroke:#88a;

Ontology additions (KG-2.91): SkillProposal / SkillWorkflowProposal interfaces (ontology/interfaces.py) + node types + the object properties AUTOMATES (skill/workflow → process/capability), DERIVED_FROM (proposal → source node, inverse :derives), COMPOSES (workflow → atomic skills, transitive) in ontology_orchestration.ttl. The new node-type labels are marked promotable (core/owl_bridge.py) so OWL reasoning runs transitive/inverse over them — a workflow proposal's COMPOSES chain resolves to its leaf atomic skills, and AUTOMATES/derives relate a proposal to the process it covers and the source it came from.

Dual-mode workflow SKILL.md. A workflow proposal's artifact runs under BOTH Claude AND graph-os. Frontmatter carries name/description/domain/tags/ team_config (specialist_ids + tool_assignments)/concept; the body has a machine-readable step DAG (### Step N: <atomic-skill> [depends_on: Step M, ...]), a Claude-executable ## Execution section (run independent steps in parallel, dependents after), and a standard delegation footer:

If graph-os is reachable, offload the whole DAG via graph_orchestrate action=execute_workflow (or the kg-delegation-router skill); otherwise execute steps natively in dependency order.

Wiring (default-ON, propose-only). The distiller runs as the distill_skills stage of LoopController.run_one_cycle (best-effort, alongside reason/standardize/ distill; reuses the per-cycle embedder for semantic dedup), and is reachable on both surfaces: MCP graph_orchestrate(action="distill_skills") and REST POST /api/graph/orchestrate/distill-skills, both dispatching into the same action core. Review uses the proposal nodes; on approval graph_orchestrate(action="distill_skills", task="materialize:<proposal_id>") materializes via PhysicalDistillationEngine — into a staging dir, never a source repo.


5. Crawler → KG routing

So the KG is the canonical store, crawled docs land there first (via the standardized contract), then distill from the KG. Because web-crawler (universal-skills) must not import agent-utilities, routing is a process-boundary shell-out to the ingest CLI — mirroring how generate_skill.py already shells out to crawl.py:

  • crawl.py --ingest-kg → ingests the crawl output dir after writing markdown.
  • generate_skill.py --ingest-kg → ingests the final merged reference/ once.
  • Both call python -m agent_utilities.knowledge_graph.ingestion <path> --content-type document, which runs the standardized IngestionEngine against the live daemon. Graceful, clearly-messaged degradation if agent-utilities/daemon is absent.

6. Batched GetSubgraph (engine optimization)

Distillation reads a node's properties and the edges among the selection. Doing that per-node would be N socket round-trips against the out-of-process engine (plus a full edge scan). The engine's GetSubgraph returns the induced subgraph — decoded node properties + in-set edges — in one round-trip:

{ "nodes": [ {"id": "...", "properties": { ... }} ],
  "edges": [ {"source": "...", "target": "...", "properties": { ... }} ] }

Implemented across the three engine layers (src/protocol.rs Method::GetSubgraph, src/server.rs dispatch, epistemic_graph/client.py graph.get_subgraph); the distiller's fetch_subgraph uses it with a per-node fallback so it still works against older engines.

The prior GetSubgraph dispatch serialized to msgpack and then mis-parsed those bytes as JSON (expected value at line 1 column 1) — it never worked, which is why no client method existed. The fix decodes the property blobs server-side into JSON. Activating it live requires the daemon to run the rebuilt binary; until then the fallback path is used.


Surface reference

Action Invocation
Distill skill-graph graph_ingest(action="distill", target_path="<out>", corpus_name="<seed>" \| description="<query>", max_depth=2)
Distill workflow …same, with content_type="workflow"
Connector → skill proposals graph_orchestrate(action="distill_skills"[, task="draft"]) · POST /api/graph/orchestrate/distill-skills
Materialize an approved proposal graph_orchestrate(action="distill_skills", task="materialize:<proposal_id>")
Import a pack graph_ingest(action="import_pack", target_path="<dir>", corpus_name="dedup")
Build from KG generate_skill.py --from-kg "<seed-or-query>" <name>
Route crawl → KG crawl.py --ingest-kg · generate_skill.py --ingest-kg
Ingest CLI python -m agent_utilities.knowledge_graph.ingestion <path> [--content-type] [--curate]
Distiller CLI python -m …distillation.skill_graph_distiller --seed\|--query … --out-dir <dir> [--workflow]

File map

Concern Path
Distiller (KG → reference/ + manifest, workflows) knowledge_graph/distillation/skill_graph_distiller.py
Connector → skill synthesis (propose-only, KG-2.90/2.83) knowledge_graph/distillation/skill_synthesizer.py
Loop stage (_distill_skills) knowledge_graph/research/loop_controller.py
distill_skills action (MCP + REST) mcp/tools/analysis_tools.py (graph_orchestrate) · mcp/kg_server.py
Importer (pack → KG) knowledge_graph/distillation/skill_graph_importer.py
Standardized document ingestion knowledge_graph/ingestion/engine.py
Ingest CLI knowledge_graph/ingestion/__main__.py
MCP actions (distill, import_pack) mcp/kg_server.py (graph_ingest)
Batched subgraph (engine) epistemic-graph/src/{protocol,server}.rs, epistemic_graph/client.py
Builder --from-kg / --ingest-kg universal-skills/.../skill-graph-builder/scripts/generate_skill.py
Crawler --ingest-kg universal-skills/.../web-crawler/scripts/crawl.py

See also