Supported Deployment Configurations¶
This is the flagship configuration ladder for agent-utilities: five complete, copy-paste-ready configurations, from a zero-infrastructure laptop to an autonomous multi-host fleet. Each rung builds on the previous one and states its delta explicitly.
Every flag name, default, and behavior on this page is verified against
agent_utilities/core/config.py (AgentConfig), the shipped
docker/*.compose.yml files, and the module that reads the flag. The
authoritative flag inventory (with per-flag verdicts) is
Configuration Reference & Flag Audit.
Configurations the CI pipeline cannot stand up (live Kafka, live Postgres,
multi-shard engines, live actuators) are explicitly marked
not exercised in CI below, with a pointer to the unit suites that cover
their logic against injected fakes.
Summary matrix¶
| Rung | Durable state | Task queue | Agent dispatch | Auth | Engines | Autonomy |
|---|---|---|---|---|---|---|
| (a) Zero-infra dev | per-host SQLite | sqlite (auto) |
inline (default) |
none (open, startup warning) | 1 local engine (UDS) | off; ActionPolicy = approval-required default |
| (b) Secured single node | per-host SQLite | sqlite (auto) |
inline |
engine HMAC + required JWT identity + brain enforcement | 1 local engine, HMAC | off |
| (c) Durable single node | shared Postgres (STATE_DB_URI) |
postgres (auto) |
inline |
as (b) | 1 local engine | off |
| (d) Scaled multi-host | shared Postgres | kafka (explicit) |
queue + worker fleet |
as (b) | 3 TCP shards (HRW-routed) | off |
| (e) Autonomous operations | shared Postgres | kafka |
queue |
as (b) | 3 TCP shards | golden loop, failure-driven evolution, fleet reconciler, autoscaler, event webhook |
How configuration is loaded¶
Three layers, in precedence order:
- Environment variables /
.env— every flag is a typed field onAgentConfig(agent_utilities/core/config.py); the env alias is the flag's canonical name. The CI gatescripts/check_no_env_sprawl.pykeeps new flags onAgentConfiginstead of scatteredos.environreads. - XDG
config.json— discovered at~/.config/agent-utilities/config.json(override the directory withAGENT_UTILITIES_CONFIG_DIR; resolution lives inagent_utilities/core/paths.py). Each key is upper-cased to its env name and applied only if that env var is not already set — so the environment always wins. Lists/dicts are JSON-encoded (this is howchat_models/embedding_modelsregistries are normally configured). A templateconfig.jsonis written on first run if none exists.GRAPH_SERVICE_ENDPOINTSand other list flags accept either a JSON list (natural inconfig.json) or a comma-separated string (natural in.env). - Defaults — the values baked into
AgentConfig. The zero-infra rung below runs entirely on these.
Data lives under XDG paths: config ~/.config/agent-utilities/, data
~/.local/share/agent-utilities/ (AGENT_UTILITIES_DATA_DIR override), cache
~/.cache/agent-utilities/.
The relevant processes (console scripts from pyproject.toml
[project.scripts]):
| Command | Role |
|---|---|
graph-os |
MCP tool surface (stdio default; --transport streamable-http --port <p> for HTTP; compose: docker/mcp.compose.yml, port 8004) |
graph-os-daemon |
Standalone KG host daemon: holds the flock host lock, drains the durable task queue, runs all maintenance/autonomy ticks. --status and --drain-queue flags. It serves no HTTP. |
python -m agent_utilities |
The REST gateway (FastAPI agent server): /health, /api/graph/*, /api/sessions, /api/goals, /api/fleet/*, /api/dashboard/*, /metrics. Binds HOST:PORT (defaults 0.0.0.0:9000). GATEWAY_WORKERS pre-forks it. When it runs, it hosts the KG daemon itself (flock-elected) — you do not also need graph-os-daemon. |
kg-ingest-worker |
Decoupled ingest consumer (kg-ingest group), engine client only (KG-2.57) |
agent-dispatch-worker |
Stateless agent-turn consumer (agent-dispatch group), engine client only (ORCH-1.45) |
Rung (a): Zero-infra dev¶
What you get: the full platform on one machine with no external services.
The knowledge graph runs on the default tiered backend — the
epistemic_graph Rust engine as the L1 working store plus embedded LadybugDB
as the L2 durable store (GRAPH_BACKEND_L2 resolves to ladybug when unset)
— all state in per-host SQLite files under the XDG data dir, agent turns
executed in-process (AGENT_DISPATCH_BACKEND=inline is the default), and no
authentication. What you don't get: identity (callers are trusted as-is
and a one-time startup warning says so), durability beyond this host, more
than one host, and any autonomous operations (all autonomy flags default off;
the shipped ActionPolicy marks every mutating operational action
approval_required).
.env¶
An empty .env is a valid rung-(a) configuration — every line below is
either the shipped default (shown commented, for visibility) or optional:
# ---- Everything below the first block is the shipped default ----
# Spawn the local Rust engine on first connect if it is not already running.
# The `epistemic-graph-server` binary ships with the epistemic-graph wheel.
# Without this flag, the engine daemon must already be running (there is no
# in-process fallback); autostart only ever applies to the local endpoint.
EPISTEMIC_GRAPH_AUTOSTART=1
# Model provider — only needed for LLM-backed features (agents, enrichment).
# Graph operations run without it.
OPENAI_API_KEY=sk-your-key
# ...or any other provider key on AgentConfig (ANTHROPIC_API_KEY,
# GEMINI_API_KEY, ...), or OPENAI_BASE_URL for a local vLLM/Ollama endpoint.
# -- Defaults, listed explicitly (do not need to be set) --
#GRAPH_BACKEND=tiered # L1 epistemic_graph + L2 LadybugDB, embedded
#GRAPH_BACKEND_L1=epistemic_graph
#KG_DAEMON_ROLE=auto # flock host-lock elects ONE host per machine
#STATE_DB_URI= # unset = per-host SQLite state files
#TASK_QUEUE_BACKEND= # unset = auto: sqlite (no STATE_DB_URI)
#AGENT_DISPATCH_BACKEND=inline # agent turns execute in-process
#KG_AUTH_REQUIRED=false # open access + one-time startup warning
#KG_BRAIN_ENFORCE=false # ACL/permission checks are no-ops
#GATEWAY_WORKERS=1
#GATEWAY_METRICS=true # /metrics on the gateway
#KG_DEV_MODE=false # false = all maintenance daemons on
#FLEET_RECONCILER=false
#FLEET_AUTOSCALER=false
#KG_LOOP=false
#KG_FAILURE_EVOLUTION=false
#ACTION_POLICY_PATH= # empty = shipped conservative default policy
config.json¶
None required. On first run a template is written to
~/.config/agent-utilities/config.json; the one thing most dev setups put in
it is the model registry (chat_models / embedding_models) — see
docs/examples/config.json.
Engine auth on this rung¶
Even with zero configuration, engine traffic is authenticated: when
GRAPH_SERVICE_AUTH_SECRET is unset, a per-install HMAC secret is minted once
and persisted at ~/.local/share/agent-utilities/engine_secret (mode 0600,
knowledge_graph/core/graph_compute.py), and every local process — including
any engine this install spawns — agrees on it (CONCEPT:OS-5.14).
KG_ENGINE_INSECURE=1 is the explicit dev opt-out.
Verify¶
# 1. KG host daemon status (host lock holder + live daemon threads)
uv run graph-os-daemon --status
# 2. Python facade round-trip (no services needed beyond the local engine)
python3 -c "
from agent_utilities.knowledge_graph.facade import KnowledgeGraph
kg = KnowledgeGraph()
print(kg.query('MATCH (n) RETURN count(n) AS n'))
"
# 3. MCP surface over stdio (register in your IDE's mcp_config.json)
# { "mcpServers": { "graph-os": { "command": "uv", "args": ["run", "graph-os"] } } }
# 4. REST gateway (also hosts the KG daemon when it runs)
python -m agent_utilities & # binds 0.0.0.0:9000 by default (HOST/PORT)
curl -s localhost:9000/health
curl -s -X POST localhost:9000/api/graph/query \
-H 'content-type: application/json' \
-d '{"cypher": "MATCH (n) RETURN count(n) AS n"}'
Recipe form: Tiny. MCP consumption patterns: mcp-consumption.
Rung (b): Secured single node¶
What you get: everything from (a) plus a closed identity perimeter: engine HMAC (already automatic — now made explicit), server-minted JWT identity required on the KG surface, and fail-closed node-level permission enforcement. What you don't get: durability beyond this host, scale-out, autonomy.
The pieces (all CONCEPT:OS-5.14, validated in
agent_utilities/security/auth.py and
agent_utilities/security/request_identity.py):
- Engine HMAC — automatic per-install secret as in rung (a); set
GRAPH_SERVICE_AUTH_SECRETexplicitly only when multiple installs/hosts must share one engine. KG_AUTH_REQUIRED=1—ActorIdentityMiddlewarevalidatesAuthorization: Bearer <JWT>againstAUTH_JWT_JWKS_URI(JWKS cached 5 minutes), checksAUTH_JWT_ISSUER/AUTH_JWT_AUDIENCEwhen set, and mints the server-sideActorContextfrom the claims (sub/client_id/azp→ actor,roles/realm_access.roles/scope→ roles,tenant_id/tenant/org_id/tid→ tenant). An invalid token is always a 401 — even withKG_AUTH_REQUIREDoff. With it on, requests without a valid token get 401; only health paths (/health,/healthz,/api/health,/api/healthz) and/metricsstay open. Caller-supplied_actor/_roles/_tenantkwargs are ignored.KG_AUTH_TOKEN— a JWT minting the identity of a stdio MCP process (stdio has no Authorization header); validated against the same JWKS.KG_BRAIN_ENFORCE=1— turns the node-ACL / ontology-permission checks from no-ops into real fail-closed enforcement (read inknowledge_graph/core/company_brain_runtime.py; truthy values1,true,yes,on; defaultfalse). With enforcement on, nodes without an ACL are denied by default;KG_ACL_DEFAULT_ALLOW=1is the explicit escape hatch that allows un-ACL'd nodes.
.env¶
# ---- Everything from rung (a) ----
EPISTEMIC_GRAPH_AUTOSTART=1
OPENAI_API_KEY=sk-your-key
# ---- plus: identity & enforcement (OS-5.14) ----
# Required server-validated JWT identity on the KG REST/MCP surface
KG_AUTH_REQUIRED=1
AUTH_JWT_JWKS_URI=https://keycloak.example.internal/realms/agents/protocol/openid-connect/certs
AUTH_JWT_ISSUER=https://keycloak.example.internal/realms/agents
AUTH_JWT_AUDIENCE=agent-utilities
# Identity for stdio MCP processes (no Authorization header on stdio)
#KG_AUTH_TOKEN=eyJhbGciOi...
# Fail-closed node-level permissioning
KG_BRAIN_ENFORCE=1
#KG_ACL_DEFAULT_ALLOW=0 # default: nodes without an ACL are denied
# Engine HMAC is automatic (per-install secret at data_dir()/engine_secret).
# Set explicitly only to share one engine across installs:
#GRAPH_SERVICE_AUTH_SECRET=<openssl rand -hex 32>
#KG_ENGINE_INSECURE=false # default: secure; 1 = dev opt-out
Verify¶
python -m agent_utilities &
# Health stays open
curl -s localhost:9000/health
# Without a token: 401
curl -s -o /dev/null -w '%{http_code}\n' -X POST localhost:9000/api/graph/query \
-H 'content-type: application/json' -d '{"cypher":"MATCH (n) RETURN n LIMIT 1"}'
# -> 401
# With a valid JWT from your issuer: 200
TOKEN=$(curl -s -X POST "$AUTH_JWT_ISSUER/protocol/openid-connect/token" \
-d grant_type=client_credentials -d client_id=agent-utilities \
-d client_secret=REDACTED | jq -r .access_token)
curl -s -X POST localhost:9000/api/graph/query \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{"cypher":"MATCH (n) RETURN count(n) AS n"}'
# Engine secret was minted automatically
ls -l ~/.local/share/agent-utilities/engine_secret # mode 0600
JWT validation (JWKS fetch/cache, claim mapping, 401 paths) is unit-tested; a live identity provider is not exercised in CI — validate the issuer wiring against your IdP once per environment. Worked example: identity-jwt. Background: Secrets & auth.
Rung (c): Durable single node¶
What you get: everything from (b) plus durability that survives the host
process — and the schema/locking groundwork for multi-host. One flag,
STATE_DB_URI, externalizes ALL durable platform state (durable-execution
checkpoints, sessions/turns/goals, the KG task + staging queue) onto one
shared Postgres through a single connection pool (CONCEPT:OS-5.16–5.18,
state externalization). A second
DSN, GRAPH_DB_URI, gives the tiered KG a durable Postgres/pg-age tier.
What you don't get: horizontal scale-out of ingest or agent execution
(still one host doing the work), autonomy.
What turns on, with no further flags:
TASK_QUEUE_BACKENDauto-resolution — unset means auto:postgreswhenSTATE_DB_URIis set, elsesqlite(knowledge_graph/core/queue_backend.py: create_task_queue). The Postgres queue claims withFOR UPDATE SKIP LOCKED(at-least-once, 600 s visibility timeout). SettingTASK_QUEUE_BACKEND=postgresexplicitly makes an unreachable database a fail-loud startup error instead of a logged SQLite fallback (CONCEPT:KG-2.55).- Daemon leadership —
DaemonLeadership(core/leadership.py) holds a Postgres session advisory lock per role; maintenance ticks (analysis, golden loop, failure ingest, fuseki publish, reconciler, autoscaler) become leader-only fleet-wide. A crashed leader's lock releases server-side and a follower takes over within one tick (CONCEPT:OS-5.17). Under SQLite this is a no-op (single host).
Compose¶
docker/pg-age.compose.yml provisions a graph-enabled Postgres (ParadeDB)
on host port 5433, database agent_kg, user/password agent/agent,
with init scripts from docker/pg-age-init/:
docker compose -f docker/pg-age.compose.yml up -d
# A separate logical DB for platform state keeps it apart from the graph tier:
docker exec agent-pg-age psql -U agent -d agent_kg -c 'CREATE DATABASE agent_state'
Any Postgres you already run works the same way; the compose file is the worked single-host example.
.env¶
# ---- Everything from rung (b) ----
EPISTEMIC_GRAPH_AUTOSTART=1
OPENAI_API_KEY=sk-your-key
KG_AUTH_REQUIRED=1
AUTH_JWT_JWKS_URI=https://keycloak.example.internal/realms/agents/protocol/openid-connect/certs
AUTH_JWT_ISSUER=https://keycloak.example.internal/realms/agents
AUTH_JWT_AUDIENCE=agent-utilities
KG_BRAIN_ENFORCE=1
# ---- plus: durable state (OS-5.16) + durable graph tier ----
# ONE flag moves checkpoints + sessions/turns/goals + the KG task queue
# onto shared Postgres (unset = the per-host SQLite files of rungs a/b)
STATE_DB_URI=postgresql://agent:agent@localhost:5433/agent_state
#STATE_DB_POOL_SIZE=8 # default: max connections in the ONE shared pool
# Durable Postgres/pg-age tier for the tiered KG backend
GRAPH_DB_URI=postgresql://agent:agent@localhost:5433/agent_kg
# Task queue: leave unset — auto resolves to postgres because STATE_DB_URI is
# set. Set explicitly to make a missing database fail loud at startup:
#TASK_QUEUE_BACKEND=postgres
Verify¶
uv run graph-os-daemon --status # or run the gateway; queue backend should be postgres
# State survives a restart: create a goal, restart, list goals
curl -s -X POST localhost:9000/api/goals -H "authorization: Bearer $TOKEN" \
-H 'content-type: application/json' -d '{"description":"durability probe"}'
# restart the gateway process, then:
curl -s localhost:9000/api/goals -H "authorization: Bearer $TOKEN"
# The live-Postgres integration pass (skipped without STATE_DB_URI):
STATE_DB_URI=postgresql://agent:agent@localhost:5433/agent_state \
python -m pytest tests/integration/test_state_postgres_live.py
Not exercised in CI: the live-Postgres paths. CI runs the unit suites
(tests/unit/test_state_store.py, tests/unit/test_durable_state_postgres.py,
tests/unit/knowledge_graph/test_queue_backend.py) against fake
pools/connections; the live suite tests/integration/test_state_postgres_live.py
exercises the real SKIP LOCKED claims, advisory leadership, and schema, but
only runs when STATE_DB_URI points at a reachable Postgres — run it once
against your database.
Recipe form: Single-node prod.
Rung (d): Scaled multi-host¶
What you get: everything from (c) plus horizontal scale-out of all three
work planes — ingest (Kafka-partitioned task queue + kg-ingest-worker
fleet), agent execution (session-keyed agent_turns queue +
agent-dispatch-worker fleet), and the graph engine itself (N shards,
HRW-routed by tenant/graph) — with N gateway workers/replicas behind a load
balancer and Prometheus scraping every tier. What you don't get:
autonomy (rung e); automatic data re-sharding (adding/removing a shard
reassigns ~1/N of graphs but moves no data — migration is a manual snapshot
export/import, see engine sharding).
Not exercised in CI. No live Kafka broker, multi-shard engine topology,
or cross-host worker fleet runs in CI. The selection/routing/partitioning/
delivery contracts are covered by unit suites with injected transport fakes:
tests/unit/knowledge_graph/test_kafka_ingest_scaleout.py,
tests/unit/knowledge_graph/test_engine_sharding.py,
tests/unit/test_agent_dispatch.py. Validate this rung end-to-end in a
staging environment before relying on it.
Compose¶
# Kafka (KRaft, single broker).
# Minimal — topics are created by the app's idempotent ensure-topic:
docker compose -f docker/docker-compose.kafka.yml up -d
# ...or fully provisioned (kg.* event topics, retention policies, tunable
# partitions via KG_TASKS_PARTITIONS):
docker compose -f docker/kafka-kraft.compose.yml up -d
# 3 engine shards, one shared HMAC secret (the engine binary refuses to start
# without one — there is deliberately no baked-in default):
export GRAPH_SERVICE_AUTH_SECRET="$(openssl rand -hex 32)" # SAME everywhere
docker compose -f docker/engine-shards.compose.yml up -d
# shard RPC: tcp://:9101 :9102 :9103 — Prometheus /metrics: :9111 :9112 :9113
.env¶
# ---- Everything from rung (c) ----
OPENAI_API_KEY=sk-your-key
KG_AUTH_REQUIRED=1
AUTH_JWT_JWKS_URI=https://keycloak.example.internal/realms/agents/protocol/openid-connect/certs
AUTH_JWT_ISSUER=https://keycloak.example.internal/realms/agents
AUTH_JWT_AUDIENCE=agent-utilities
KG_BRAIN_ENFORCE=1
STATE_DB_URI=postgresql://agent:agent@pg.example.internal:5433/agent_state
GRAPH_DB_URI=postgresql://agent:agent@pg.example.internal:5433/agent_kg
# (EPISTEMIC_GRAPH_AUTOSTART dropped: remote tcp:// shards are never
# auto-spawned — an unreachable shard is a fail-loud ConnectionError.)
# ---- plus: Kafka task queue (KG-2.55/2.56) ----
TASK_QUEUE_BACKEND=kafka # explicit = fail-loud if the broker is down
KAFKA_BOOTSTRAP_SERVERS=kafka.example.internal:9092
#KG_TASKS_PARTITIONS=6 # default; ensured at startup, grow-only;
# bounds kg-ingest consumer parallelism
# ---- plus: queue-driven agent dispatch (ORCH-1.45) ----
AGENT_DISPATCH_BACKEND=queue # turns return a job handle; workers execute
#AGENT_TURNS_PARTITIONS=6 # default; bounds concurrent-session parallelism
# ---- plus: engine shards (KG-2.58 / OS-5.14) ----
# Comma-separated or JSON list. The list must be IDENTICAL (verbatim strings)
# on every client; order does not matter (HRW hashing).
GRAPH_SERVICE_ENDPOINTS=tcp://kg-shard-1.example.internal:9101,tcp://kg-shard-2.example.internal:9102,tcp://kg-shard-3.example.internal:9103
GRAPH_SERVICE_AUTH_SECRET=<the one shared secret>
#KG_DEFAULT_GRAPH=__bus__ # default; tenant-mapped before HRW routing
# ---- plus: gateway scale + observability (OS-5.23) ----
GATEWAY_WORKERS=4 # pre-forked workers on ONE listen socket;
# the flock elects ONE KG host among them
#GATEWAY_METRICS=true # default: /metrics on the gateway
#GATEWAY_RATE_LIMIT=0 # per-tenant req/s; 0 = off (buckets per process)
#GATEWAY_RATE_BURST=0 # 0 = 2x rate
#ENGINE_BREAKER_THRESHOLD=5 # default; 0 = off
#ENGINE_BREAKER_COOLDOWN=15 # default, seconds
The same settings in config.json form (JSON list for the endpoints):
{
"task_queue_backend": "kafka",
"kafka_bootstrap_servers": "kafka.example.internal:9092",
"agent_dispatch_backend": "queue",
"graph_service_endpoints": [
"tcp://kg-shard-1.example.internal:9101",
"tcp://kg-shard-2.example.internal:9102",
"tcp://kg-shard-3.example.internal:9103"
],
"graph_service_auth_secret": "<the one shared secret>"
}
Worker fleets¶
Run on any host that can reach Kafka, Postgres, and the shards. Both force
KG_DAEMON_ROLE=client (never contend for the KG host flock) and fail loud at
startup if they cannot reach the engine with the shared HMAC secret:
# Ingest workers (consumer group "kg-ingest"); worker count autosized from
# CPU/memory when --workers is omitted:
kg-ingest-worker --workers 4 --bootstrap-servers kafka.example.internal:9092
# Agent dispatch workers (consumer group "agent-dispatch"); default 1 thread —
# turns are LLM-bound:
agent-dispatch-worker --workers 2
Load balancer (Caddy)¶
N gateway replicas (each GATEWAY_WORKERS=1) or one multi-worker gateway —
both are supported; see gateway scaling
for the per-process state table (metrics registries, rate-limit buckets).
Prometheus scrape¶
/metrics registries are per-process — scrape each gateway replica directly
(not through the load balancer), plus each engine shard's metrics listener
(ports from docker/engine-shards.compose.yml):
scrape_configs:
- job_name: agent-utilities-gateway
static_configs:
- targets: ["gw-1:9000", "gw-2:9000", "gw-3:9000"]
- job_name: epistemic-graph-shards
static_configs:
- targets: ["kg-shard-1.example.internal:9111",
"kg-shard-2.example.internal:9112",
"kg-shard-3.example.internal:9113"]
Key series: agent_utilities_gateway_requests_total,
agent_utilities_engine_shard_up{endpoint},
agent_utilities_dispatch_queue_depth,
agent_utilities_dispatch_turns_total{outcome},
agent_utilities_dispatch_workers. Full walkthrough:
observability.
Verify¶
# Kafka topics exist with the right partition counts (created/grown at startup)
docker exec kafka kafka-topics --bootstrap-server localhost:9092 --describe \
--topic kg_tasks # docker-compose.kafka.yml image; the kafka-kraft.compose.yml
# image uses /opt/kafka/bin/kafka-topics.sh
# Shard topology: per-shard reachability + breaker state
curl -s localhost:9000/api/dashboard/daemon/shards -H "authorization: Bearer $TOKEN"
# Dispatch fleet is visible (worker heartbeats in the topology)
curl -s localhost:9000/api/fleet/topology -H "authorization: Bearer $TOKEN"
# Metrics are flowing
curl -s gw-1:9000/metrics | grep agent_utilities_engine_shard_up
curl -s kg-shard-1.example.internal:9111/metrics | head
Worked examples: sharding-walkthrough, queue-dispatch-walkthrough. Deep dives: engine sharding, agent dispatch, event backbone, capacity model.
Rung (e): Autonomous operations¶
What you get: everything from (d) plus the platform operating on itself:
the golden-loop research/remediation cycle, failure-driven evolution from
Langfuse telemetry, the desired-state fleet reconciler with a real actuator,
the reactive replica autoscaler, and webhook ingress for monitoring events.
Every mutating action still flows through the ONE ActionPolicy gate — the
shipped default policy queues all of it for human approval, so "autonomous"
is opt-in per action kind. What you don't get: unattended mutation out of
the box (you must relax the policy rule-by-rule), and auto-merge of evolution
proposals unless you explicitly enable KG_GOLDEN_AUTO_MERGE.
All ticks below run in the KG host daemon and are leader-only under
STATE_DB_URI (rung c) — exactly one host in the fleet runs them.
Not exercised in CI. The control logic (policy gate, reconciler diff,
autoscaler bounds, golden-loop stages) is unit-tested, but CI never runs a
live Langfuse, a real Docker actuator, or a live Prometheus signal source.
The shipped defaults are deliberately inert (FLEET_ACTUATOR=dryrun,
approval-required policy); treat every relaxation as a production change.
.env¶
# ---- Everything from rung (d), plus: ----
# -- Golden loop: propose-only self-evolution (intake -> acquire -> distill) --
KG_LOOP=true
#KG_LOOP_INTERVAL=3600 # default, seconds
#KG_LOOP_TOPICS=5 # default: hot topics per tick
#KG_LOOP_DISTILL=false # default; opt-in distillation stage
#KG_LOOP_BREADTH=false # default; opt-in breadth scan
#KG_LOOP_STANDARDIZE=false # default; opt-in standardization stage
KG_GOLDEN_AUTO_MERGE=false # default; keep merges human-gated
#EVOLUTION_WORKTREE_ROOT= # default: data_dir()/evolution_worktrees
# -- Failure-driven evolution (AHE-3.18): Langfuse failures -> remediation --
KG_FAILURE_EVOLUTION=true
#KG_FAILURE_EVOLUTION_INTERVAL=3600 # default, seconds
#KG_FAILURE_EVOLUTION_WINDOW=86400 # default: telemetry look-back, seconds
#KG_FAILURE_REGRESSION_DATASET=false # default; dataset-based regression path
LANGFUSE_HOST=https://langfuse.example.internal
LANGFUSE_PUBLIC_KEY=pk-lf-REDACTED
LANGFUSE_SECRET_KEY=sk-lf-REDACTED
# -- Fleet reconciler (OS-5.25): desired state vs observed, policy-gated --
FLEET_RECONCILER=true
#FLEET_RECONCILER_INTERVAL=120 # default, seconds
#FLEET_RECONCILER_MAX_ACTIONS=5 # default: storm guard per tick
#FLEET_REGISTRY_PATH= # empty = shipped deploy/mcp-fleet.registry.yml
#FLEET_DESIRED_STATE_PATH= # optional per-service replicas/version overlay
FLEET_ACTUATOR=docker # default "dryrun" records intent, mutates NOTHING;
# "docker" = reference CLI actuator.
# Portainer/Swarm: set_fleet_actuator() seam (below)
#DEPLOY_WATCH_WINDOW=300 # default: post-deploy health watch, seconds
#DEPLOY_WATCH_POLL=15 # default: probe interval inside the watch
# -- Autoscaler (OS-5.29): load signal -> registry min/max -> policy gate --
FLEET_AUTOSCALER=true
#FLEET_AUTOSCALER_INTERVAL=60 # default, seconds
SCALING_PROMETHEUS_URL=http://prometheus.example.internal:9090
# unset = zero-infra in-process gauges
# -- ActionPolicy (OS-5.24): the single autonomy decision point --
ACTION_POLICY_PATH=/etc/agent-utilities/action-policy.yml
# Empty (default) = the shipped conservative policy
# (deploy/action-policy.default.yml): every mutating kind approval_required,
# only diagnose/observe/notify/record_dry_run auto. KG governance_rule
# overrides (scope: action_policy) win over file rules either way.
# -- Monitoring webhook ingress (OS-5.15) --
FLEET_EVENTS_TOKEN=<openssl rand -hex 32>
# Shared secret for POST /api/fleet/events (header X-Fleet-Events-Token,
# constant-time compare, re-read per request so rotation needs no restart).
# Default unset = the endpoint accepts unauthenticated posts.
Custom actuator (Portainer/Swarm)¶
FLEET_ACTUATOR selects between dryrun and the reference docker CLI
actuator; anything else is deployment-wired through the seam:
from agent_utilities.orchestration.fleet_actuation import set_fleet_actuator
set_fleet_actuator(MyPortainerActuator()) # real actuation behind the policy gate
Verify¶
# Reconciler/autoscaler proposals land in the approval queue (default policy
# queues every mutating action):
curl -s localhost:9000/api/fleet/approvals -H "authorization: Bearer $TOKEN"
# Webhook ingress: rejected without the token...
curl -s -o /dev/null -w '%{http_code}\n' -X POST localhost:9000/api/fleet/events \
-H 'content-type: application/json' -d '{"alerts":[]}'
# -> 401
# ...accepted with it:
curl -s -X POST 'localhost:9000/api/fleet/events?source=alertmanager' \
-H "x-fleet-events-token: $FLEET_EVENTS_TOKEN" \
-H 'content-type: application/json' \
-d '{"alerts":[{"status":"firing","labels":{"alertname":"probe"}}]}'
# Daemon ticks are registered (leader host):
uv run graph-os-daemon --status
Worked examples: action-policy-postures (relaxing the default posture rule-by-rule), fleet-events-wiring (Alertmanager / Uptime Kuma / Portainer payloads), autoscaling-signals, evolution-publication (how promoted proposals become reviewable local branches). Deep dives: fleet autonomy, failure-driven evolution, autonomous evolution guide.
Recipe form: Enterprise.
Where to go next¶
- Flag-by-flag inventory and the configuration-discipline rule: Configuration Reference & Flag Audit
- Worked end-to-end examples: ontology-to-workflow, identity-jwt, observability, sharding-walkthrough, queue-dispatch-walkthrough, fleet-events-wiring, action-policy-postures, autoscaling-signals, evolution-publication, mcp-consumption
- Architecture deep dives: state externalization, engine sharding, gateway scaling, agent dispatch, fleet autonomy, graph service layer
- Sizing: capacity model