Knowledge Distillation → Skill-Graphs¶
CONCEPT:KG-2.7 (standardized ingestion) · CONCEPT:AHE-3.9 (physical distillation) CONCEPT:KG-2.90 (connector → skill synthesis) · CONCEPT:KG-2.91 (skill-synthesis ontology links) Packages:
agent_utilities/knowledge_graph/distillation/·agent_utilities/knowledge_graph/ingestion/Engine:epistemic-graphGetSubgraph(batched subgraph read) MCP:graph_ingest(action="distill" | "import_pack")· CLIs:python -m agent_utilities.knowledge_graph.distillation.skill_graph_distiller,python -m agent_utilities.knowledge_graph.ingestionSkills (universal-skills):skill-graph-builder(generate_skill.py --from-kg),web-crawler(crawl.py --ingest-kg),knowledge-graph-ingest
Why¶
The KG already holds curated knowledge; skill-graphs are how that knowledge gets
packaged, versioned, and shared with agents. Historically those were two disconnected
worlds: a skill-graph was built by crawling a website into a reference/ markdown tree,
while the KG was populated separately. This makes the KG the single source of truth
and a skill-graph a versioned, round-trippable projection of a KG subgraph — so a
slice of knowledge ("everything about ServiceNow") becomes a curated, shareable package
that another KG can re-import and dedup-merge.
The insight: a skill-graph is already a degenerate knowledge graph —
| Skill-graph artifact | …is really a | KG equivalent |
|---|---|---|
SKILL.md table of contents |
node index | subgraph manifest |
reference/**/*.md |
content nodes | Document.content / IdeaBlock.trusted_answer |
| folder hierarchy | edges | CONTAINS / PART_OF |
concept: frontmatter |
typed anchor | Concept node |
So distilling a skill-graph is a graph projection + serialization, and importing one is ingestion — the two are inverses.
1. Standardized document ingestion (the prerequisite)¶
Distillation is only faithful if the KG actually retains document text. Previously the same content got a different node shape depending on how it was submitted:
| Submission form | Old path | Shape | Full text? |
|---|---|---|---|
| Single file | extract_document |
Document(no body) + Concept(summary) |
❌ lossy |
| Directory / URL | KBIngestionEngine |
RawSource + curated Article |
✅ (LLM-rewritten) |
| Manual | DistillationEngine.ingest_text |
verbatim IdeaBlock |
✅ but unreachable via the tool |
This was consolidated (strangler-then-delete) into one verbatim contract — the same regardless of file / directory / URL:
flowchart LR
SRC["document<br/>(file · dir · URL)"] --> UNIT["_ingest_document_file<br/>(canonical per-doc unit)"]
UNIT --> DOC["Document{content: full verbatim}"]
UNIT --> CHUNK["IdeaBlock{trusted_answer: chunk}<br/>(PART_OF Document)"]
UNIT --> CONC["Concept{summary}<br/>(MENTIONS)"]
Document{content}— full verbatim body, re-materialisable.IdeaBlockchunks (distillation_engine.chunk_text, deterministic ids{doc.id}:chunk:{i}) linkedPART_OFthe Document — the retrieval/dedup substrate.Conceptnodes viaMENTIONS— the interlinking layer.
LLM curation into Article nodes survives only as the explicit KNOWLEDGE_BASE /
curate_wiki content type, or opt-in via manifest.metadata["curate"]=True — never an
accidental consequence of passing a directory. Code: ingestion/engine.py
(_ingest_document, _ingest_document_file, _ingest_document_dir,
_ingest_document_url), enrichment/models.py (Document.content),
enrichment/extractors/document.py.
2. Distillation (KG subgraph → skill-graph)¶
SkillGraphDistiller (distillation/skill_graph_distiller.py) walks a coherent subgraph
and materialises a neutral reference/ tree + kg_manifest.json — format-agnostic output
that skill-graph-builder consumes verbatim as a "local directory" source (no change to
the existing TOC/SKILL.md generator).
flowchart LR
SEL["select<br/>seed id OR semantic_search → BFS to depth"] --> SUB
SUB["fetch_subgraph<br/>one GetSubgraph round-trip"] --> TAX
TAX["taxonomy<br/>community_detection → folders"] --> MAT
MAT["materialize<br/>content → reference/*.md · edges → cross-links"] --> OUT["reference/ + kg_manifest.json"]
Mapping, KG-native → skill-graph-native:
- Selection — seed by node id, or by
graph.semantic_searchon a query embedding, then an undirected hop-bounded BFS (max_nodescap, closest-first). community_detection(Louvain) →reference/<cluster>/folders; cluster names from the highest-signalConcepttitle.- Hierarchy/relationship edges (
CONTAINS,MENTIONS,RELATES_TO, …) → TOC nesting + inline "Related" cross-links between files. - Body text (
content/trusted_answer/summary) →reference/**/*.md. - Parent/child dedup — because a
Documentand itsPART_OFchunks both carry text, a chunk whose parent Document is itself materialised is recorded in the manifest but not written as a duplicate file. kg_manifest.json—{schema, ontology, snapshot_ts, selector, nodes:[{id,type,title,file}], edges:[{src,dst,type}], clusters}— the provenance record that makes the package round-trippable.
generate_skill.py --from-kg "<seed-or-query>" shells out to the distiller, merges the
result through the existing pipeline, copies kg_manifest.json into the skill dir, and
surfaces provenance in the SKILL.md frontmatter:
kg_manifest: kg_manifest.json
kg_ontology: agent-utilities
kg_snapshot: 2026-06-09T12:15:49Z
kg_anchors: ['concept:servicenow']
3. Round-trip import (shareable knowledge packages)¶
import_skill_graph_pack (distillation/skill_graph_importer.py) reads a distilled pack's
kg_manifest.json and reconstructs the subgraph in a recipient KG — preserving original
node ids and edges. Because ids are preserved (and chunk ids deterministic), re-import is
idempotent. corpus_name="dedup" runs the existing IdeaBlock deduplicator
(engine.distill_knowledge) so two packages on the same topic converge instead of
duplicating.
flowchart LR
KG1["KG (source of truth)"] -->|distill| PKG["skill-graph package<br/>reference/ + kg_manifest.json"]
PKG -->|pip / registry| SHARE((share))
SHARE -->|import_pack| KG2["recipient KG"]
KG2 -->|dedup-merge| KG2
distill = serialize a subgraph · share = pip/registry · import = ingest + dedup-merge.
4. Paired graph-native skill-workflows¶
The KG's Procedure/Playbook/Policy nodes and PRECEDES edges map directly onto a
workflow step-DAG. SkillGraphDistiller.distill_workflow (or graph_ingest action="distill",
content_type="workflow") emits a SKILL.md whose ### Step N: <token> [depends_on: Step k]
ordering is a topological sort over PRECEDES — validatable by skill-workflow-builder's
build_workflow.py validate. From one subgraph you can therefore distill a pair: the
docs (skill-graph) and the how-to-act (skill-workflow), versioned together.
4a. Connector → skill synthesis (propose-only)¶
CONCEPT:KG-2.90 (the distiller) · CONCEPT:KG-2.91 (the ontology links) Module:
agent_utilities/knowledge_graph/distillation/skill_synthesizer.py(ConnectorSkillDistiller)
The same "graph ops, LLM at the edges" philosophy runs outward: once
connectors (egeria / leanix / aris / camunda) have mapped their processes into
the KG, those processes are themselves a source of skills. The
ConnectorSkillDistiller is a KG-native, propose-only background distiller
that turns mapped processes into NEW atomic-skill and skill-workflow
PROPOSALS — a human/Claude reviews + approves; nothing lands in any repo
automatically.
It is generic over the ontology, never per-connector. Every connector already lifts into the same ArchiMate/capability classes, so one ontology-driven pass covers them all:
| Ontology class / edge | …becomes a |
|---|---|
a lone BusinessTask / Capability |
atomic-skill candidate |
a flowsTo-chain of ≥2 BusinessTasks |
skill-workflow candidate (steps = atomic skills) |
an unresolved manual: task (ProcessPlanCompiler gap) |
atomic-skill candidate (automation gap) |
| a recurring cross-process inference (OntologyReasoningDriver) | cross-process candidate |
flowchart LR
subgraph CONN["connectors → KG (already mapped)"]
C1["camunda"]:::c --> KG
C2["aris"]:::c --> KG
C3["egeria"]:::c --> KG
C4["leanix"]:::c --> KG
end
KG["KG ontology<br/>BusinessProcess · BusinessTask · flowsTo · Capability"] --> DISC
DISC["discover<br/>processes + unresolved manual: gaps + OWL patterns"] --> CLS
CLS["classify<br/>action→atomic · flowsTo-chain→workflow"] --> DED
DED["dedup<br/>ConceptMatcher / skill registry → covered|related|novel"] --> PROP
PROP["propose (PROPOSE-ONLY)<br/>SkillProposal / SkillWorkflowProposal<br/>+ AUTOMATES · DERIVED_FROM · COMPOSES"] --> REV
REV{{"human / Claude<br/>review + approve"}} -->|approve| MAT
MAT["materialize<br/>PhysicalDistillationEngine → SKILL.md (STAGING dir)"]
classDef c fill:#eef,stroke:#88a;
Ontology additions (KG-2.91): SkillProposal / SkillWorkflowProposal
interfaces (ontology/interfaces.py) + node types + the object properties
AUTOMATES (skill/workflow → process/capability), DERIVED_FROM (proposal →
source node, inverse :derives), COMPOSES (workflow → atomic skills,
transitive) in ontology_orchestration.ttl. The new node-type labels are marked
promotable (core/owl_bridge.py) so OWL reasoning runs transitive/inverse
over them — a workflow proposal's COMPOSES chain resolves to its leaf atomic
skills, and AUTOMATES/derives relate a proposal to the process it covers and
the source it came from.
Dual-mode workflow SKILL.md. A workflow proposal's artifact runs under BOTH
Claude AND graph-os. Frontmatter carries name/description/domain/tags/
team_config (specialist_ids + tool_assignments)/concept; the body has a
machine-readable step DAG (### Step N: <atomic-skill> [depends_on: Step M, ...]),
a Claude-executable ## Execution section (run independent steps in parallel,
dependents after), and a standard delegation footer:
If graph-os is reachable, offload the whole DAG via
graph_orchestrate action=execute_workflow(or the kg-delegation-router skill); otherwise execute steps natively in dependency order.
Wiring (default-ON, propose-only). The distiller runs as the distill_skills
stage of LoopController.run_one_cycle (best-effort, alongside reason/standardize/
distill; reuses the per-cycle embedder for semantic dedup), and is reachable on
both surfaces: MCP graph_orchestrate(action="distill_skills") and REST
POST /api/graph/orchestrate/distill-skills, both dispatching into the same
action core. Review uses the proposal nodes; on approval
graph_orchestrate(action="distill_skills", task="materialize:<proposal_id>")
materializes via PhysicalDistillationEngine — into a staging dir, never a
source repo.
5. Crawler → KG routing¶
So the KG is the canonical store, crawled docs land there first (via the standardized
contract), then distill from the KG. Because web-crawler (universal-skills) must not import
agent-utilities, routing is a process-boundary shell-out to the ingest CLI — mirroring how
generate_skill.py already shells out to crawl.py:
crawl.py --ingest-kg→ ingests the crawl output dir after writing markdown.generate_skill.py --ingest-kg→ ingests the final mergedreference/once.- Both call
python -m agent_utilities.knowledge_graph.ingestion <path> --content-type document, which runs the standardizedIngestionEngineagainst the live daemon. Graceful, clearly-messaged degradation if agent-utilities/daemon is absent.
6. Batched GetSubgraph (engine optimization)¶
Distillation reads a node's properties and the edges among the selection. Doing that
per-node would be N socket round-trips against the out-of-process engine (plus a full edge
scan). The engine's GetSubgraph returns the induced subgraph — decoded node properties +
in-set edges — in one round-trip:
{ "nodes": [ {"id": "...", "properties": { ... }} ],
"edges": [ {"source": "...", "target": "...", "properties": { ... }} ] }
Implemented across the three engine layers (src/protocol.rs Method::GetSubgraph,
src/server.rs dispatch, epistemic_graph/client.py graph.get_subgraph); the distiller's
fetch_subgraph uses it with a per-node fallback so it still works against older engines.
The prior
GetSubgraphdispatch serialized to msgpack and then mis-parsed those bytes as JSON (expected value at line 1 column 1) — it never worked, which is why no client method existed. The fix decodes the property blobs server-side into JSON. Activating it live requires the daemon to run the rebuilt binary; until then the fallback path is used.
Surface reference¶
| Action | Invocation |
|---|---|
| Distill skill-graph | graph_ingest(action="distill", target_path="<out>", corpus_name="<seed>" \| description="<query>", max_depth=2) |
| Distill workflow | …same, with content_type="workflow" |
| Connector → skill proposals | graph_orchestrate(action="distill_skills"[, task="draft"]) · POST /api/graph/orchestrate/distill-skills |
| Materialize an approved proposal | graph_orchestrate(action="distill_skills", task="materialize:<proposal_id>") |
| Import a pack | graph_ingest(action="import_pack", target_path="<dir>", corpus_name="dedup") |
| Build from KG | generate_skill.py --from-kg "<seed-or-query>" <name> |
| Route crawl → KG | crawl.py --ingest-kg · generate_skill.py --ingest-kg |
| Ingest CLI | python -m agent_utilities.knowledge_graph.ingestion <path> [--content-type] [--curate] |
| Distiller CLI | python -m …distillation.skill_graph_distiller --seed\|--query … --out-dir <dir> [--workflow] |
File map¶
| Concern | Path |
|---|---|
| Distiller (KG → reference/ + manifest, workflows) | knowledge_graph/distillation/skill_graph_distiller.py |
| Connector → skill synthesis (propose-only, KG-2.90/2.83) | knowledge_graph/distillation/skill_synthesizer.py |
Loop stage (_distill_skills) |
knowledge_graph/research/loop_controller.py |
distill_skills action (MCP + REST) |
mcp/tools/analysis_tools.py (graph_orchestrate) · mcp/kg_server.py |
| Importer (pack → KG) | knowledge_graph/distillation/skill_graph_importer.py |
| Standardized document ingestion | knowledge_graph/ingestion/engine.py |
| Ingest CLI | knowledge_graph/ingestion/__main__.py |
MCP actions (distill, import_pack) |
mcp/kg_server.py (graph_ingest) |
| Batched subgraph (engine) | epistemic-graph/src/{protocol,server}.rs, epistemic_graph/client.py |
Builder --from-kg / --ingest-kg |
universal-skills/.../skill-graph-builder/scripts/generate_skill.py |
Crawler --ingest-kg |
universal-skills/.../web-crawler/scripts/crawl.py |
See also¶
- Graph-Native Assimilation Engine — the same "graph ops, LLM at the edges" philosophy for research/capability dedup.
- Knowledge Graph Ingestion Stability — backend locking/lifecycle under bulk ingest.