Fleet Autonomy Control Plane (OS-5.24 — OS-5.27, OS-5.29)¶
The Tranche-3 autonomy build: the layer that lets the platform act on its fleet — restart, scale, deploy, remediate — without ever acting outside policy. Five pieces, one decision point.
flowchart LR
subgraph Ingress
AM[Alertmanager / Uptime Kuma / Portainer] -->|POST /api/fleet/events| FE[FleetEvent KG nodes\nOS-5.15]
end
FE --> TR[fleet_event_triage task]
TR --> PB[Remediation playbooks\nOS-5.26]
REG[deploy/mcp-fleet.registry.yml\n+ desired-state override] --> RC[Fleet reconciler tick\nOS-5.25]
REG -->|scaling: bounds| AS[Autoscaler tick\nOS-5.29]
SIG[ScalingSignalProvider\nlocal gauges / Prometheus / injected] --> AS
OBS[FleetObserver\nKG events + docker / injected] --> RC
OBS --> AS
OBS --> PB
PB --> AP{ActionPolicy\nOS-5.24}
RC --> AP
AS --> AP
DW[Deploy watch\nOS-5.27] --> AP
AP -->|allow / allow+notify| ACT[FleetActuator\ndry-run default / docker / injected]
AP -->|queue_approval| APQ[ActionApproval nodes\n/api/fleet/approvals]
AP -->|deny| AUD
APQ -->|granted| RC
ACT -->|restart/deploy| DW
AP --> AUD[ActionDecision audit ledger]
ACT --> EXE[ActionExecution records]
OS-5.24 — ActionPolicy: the single autonomy decision point¶
orchestration/action_policy.py. Every autonomous mutating operational
action — reconciler convergence, playbook restarts, deploy-watch rollbacks —
passes through ActionPolicy.decide(ActionRequest) first. No caller
actuates without a decision.
Before this layer, every autonomy gate was a binary env flag (on = fully autonomous, off = dead). ActionPolicy replaces the cliff with per-action tiers and budgets:
| Concept | Meaning |
|---|---|
| tier | auto | auto_notify | approval_required | forbidden |
| rate limit | max allowed executions per action+target per window → exceeded = deny (a looping remediation is broken) |
| blast radius | max distinct targets per action kind per window → exceeded = queue for approval (a wide wave needs a human) |
| maintenance window | "HH:MM-HH:MM" UTC; auto tiers outside the window are queued for approval |
Policy sources, most-specific first:
- KG overrides —
governance_rulenodes withscope: action_policy(flat props or arule_jsonpayload;active: falsedisables). These always beat file rules and can be written at runtime. - YAML file —
ACTION_POLICY_PATH, default = the shipped conservativedeploy/action-policy.default.yml(embedded byte-identically in code for installed wheels): everything mutating isapproval_required; only no-op/diagnostic kinds areauto.
Example policy:
version: 1
defaults:
tier: approval_required
rate_limit: {max: 3, window_s: 3600}
blast_radius: {max_targets: 3, window_s: 3600}
rules:
- {kind: diagnose, target: "*", tier: auto}
- kind: restart_service
target: "staging-*"
tier: auto_notify
rate_limit: {max: 2, window_s: 1800}
maintenance_window: "02:00-05:00"
- {kind: restart_service, target: "*", tier: approval_required}
- {kind: scale_service, target: "prod-db", tier: forbidden}
Decisions are allow / allow_notify / queue_approval / deny; internal
errors fail closed (deny). Every decision is audit-logged as an
ActionDecision KG node — which is also the durable ledger the rate/blast
accounting reads, so budgets hold across processes and restarts.
Approval flow (reused, not forked). queue_approval files an
ActionApproval node (deduped per kind+target while pending). The existing
fleet approvals routes carry it end-to-end: GET /api/fleet/approvals lists
it next to orchestrator Task approvals, POST /api/fleet/approvals/grant
(job_id = the action_approval:* id) stamps it, and the reconciler's
approved-action drain executes it on the next tick. It is deliberately not
a Task node — pending Tasks are claimed by the engine's task workers,
which would execute the action unapproved.
Relation to AHE-3.20. The promotion-governance validator
(knowledge_graph/research/promotion_governance.py) remains the
code-evolution gate; it was the conceptual template (typed verdicts,
constitution rules). The adoption follow-up is now closed:
GovernedAutoMerger (knowledge_graph/research/auto_merge.py) routes its
own promotion decision through a merge_promotion ActionRequest before
the proposal→active lifecycle flip — deny blocks the promotion (recorded
and audited, fail-closed on policy errors), the shipped approval_required
tier queues the same ActionApproval the AHE-3.21 publication bridge
consumes (deduped per kind+target), and auto/auto_notify tiers proceed.
See Autonomous Evolution.
OS-5.25 — Desired-state fleet reconciler¶
orchestration/fleet_reconciler.py, registered as the leader-only
fleet_reconciler maintenance tick (opt-in: FLEET_RECONCILER=1, interval
FLEET_RECONCILER_INTERVAL).
- Desired state —
deploy/mcp-fleet.registry.yml(override path:FLEET_REGISTRY_PATH): every listed service should berunning, 1 replica unless stated. An optionalFLEET_DESIRED_STATE_PATHYAML layers per-servicereplicas/desired: running|stopped/version. - Observed state — the
FleetObserverprotocol (orchestration/fleet_observation.py). Default =KGFleetObserver(folds the OS-5.15 FleetEvent stream into per-service status + flap counts) composed withDockerFleetObserverwhen a docker CLI exists. Deployments inject richer observers (Portainer, Prometheus) viaset_fleet_observer(). - Divergence rules (conservative) — observed down ⇒
restart_service; replica mismatch ⇒scale_service; up-but-desired-stopped ⇒stop_service; no observation ⇒ no action (never act on zero evidence). - Every proposal passes ActionPolicy, then the
FleetActuatorprotocol (orchestration/fleet_actuation.py). The defaultDryRunActuatorrecords intent (ActionExecutionnodes,dry_run: true) and mutates nothing — safe to enable fleet-wide before any real actuator is wired.FLEET_ACTUATOR=dockerselects the reference docker CLI actuator; Portainer/Swarm actuation is deployment-wired viaset_fleet_actuator(). - Restart/deploy executions schedule an OS-5.27 health watch; granted
ActionApprovalentries are drained and executed each tick; a storm guard caps actions per tick (FLEET_RECONCILER_MAX_ACTIONS); each pass writes aReconcileReportnode.
OS-5.26 — Remediation playbooks (on the OS-5.15 seam)¶
knowledge_graph/adaptation/remediation_playbooks.py, registered through
the existing register_playbook() seam for critical/error events from
alertmanager / uptime-kuma / portainer / generic (warnings keep the OS-5.15
default playbook). The dispatcher still runs the default playbook first, so
correlation + failure_gap golden-loop escalation are preserved.
| Playbook | Steps |
|---|---|
service_down |
confirm via observer (recovered ⇒ stop) → ActionPolicy → actuate restart → schedule OS-5.27 verification watch → escalate on deny/failure |
service_flapping |
≥3 down-events in the observer window ⇒ back off (no restart even on an auto tier) + escalate a restart proposal to a human |
resource_pressure |
disk/memory/CPU markers ⇒ never auto-act: notify + queue an investigate_resource_pressure proposal |
Every step outcome is appended to the originating FleetEvent node
(remediation_log JSON trail + remediation_status). Escalation = an
approval-queue entry plus a notification through the KG-2.42 notifier
seam (actions/dispatch.send_notification — journaled by default; a
deployment registers a real Slack/email Notifier via
set_default_notifier).
OS-5.27 — Health-gated deploy + rollback¶
orchestration/deploy_watch.py — the safety net every mutating action ends
in. watch_deploy(engine, service, version, window_s, ...) enqueues a
durable deploy_watch task; the worker-side run_deploy_watch probes the
FleetObserver every DEPLOY_WATCH_POLL seconds until the recorded
deadline (a watch requeued by the zombie-task reaper after a host crash
resumes its original window):
- any
downobservation ⇒ failed →on_fail(default: arollback_servicerequest through ActionPolicy — queued under the shipped default, actuated under permissive policy — plus operator escalation); - window closes with ≥1 healthy and 0 down probes ⇒ success;
- zero observations ⇒ unobserved: notify (monitoring gap), never roll back on zero evidence.
Outcomes persist as DeployWatch nodes. The reconciler and the
service_down playbook schedule a watch after every successful
restart/deploy actuation.
OS-5.29 — Reactive replica autoscaling¶
orchestration/fleet_autoscaler.py, registered as the leader-only
fleet_autoscaler maintenance tick (opt-in: FLEET_AUTOSCALER=1, interval
FLEET_AUTOSCALER_INTERVAL, default 60 s). Where the reconciler converges on
declared replica counts, the autoscaler chooses them from load — the
smallest viable autoscaler, composed from the existing primitives.
Bounds — registry-declared. A service is only ever autoscaled when its
registry/override entry carries a scaling: block:
services:
- name: vector-mcp
scaling:
min: 1 # replica floor (>= 0; default 1)
max: 3 # replica ceiling (required; >= min)
signal: queue_depth # queue_depth | consumer_lag | cpu | custom metric
target: 200 # per-replica target value (required; > 0)
scale_up_step: 1 # max replicas added per evaluation (default 1)
scale_down_step: 1 # max replicas removed per evaluation (default 1)
cooldown_s: 300 # min seconds between scale actions (default 300)
Because deploy/mcp-fleet.registry.yml is machine-generated, the normal home
for scaling blocks is the FLEET_DESIRED_STATE_PATH override (same shape,
layered on top; scaling: null explicitly disables). Validation is strict —
max >= min >= 0, signal/target/max explicit — and an invalid block is
dropped with a warning (the service keeps the static OS-5.25 reconcile);
a typo never produces surprise scaling.
Signal hierarchy — pluggable, no hard Prometheus dependency
(orchestration/scaling_signals.py, ScalingSignalProvider protocol:
signal_value(service, signal) -> float | None):
- Injected —
set_scaling_signal_provider(MyProvider())(lgtm-mcp, Thanos, pushgateway, …) wins over everything. - Prometheus —
SCALING_PROMETHEUS_URLset ⇒ instant HTTP queries against/api/v1/query(plainhttpxGET, no client lib). Well-known signals use shipped PromQL; a custom signal is sent verbatim as PromQL with a{service}placeholder. - Local (default, zero-infra) — this process's own OS-5.23/KG-2.55
gauges:
queue_depth→agent_utilities_kg_ingest_queue_depth,consumer_lag→agent_utilities_kg_ingest_consumer_lag; other names resolve verbatim in the local registry.
None (unknown signal, provider failure, metrics extra absent, empty query
result) means no data ⇒ no scaling action — the same
never-act-on-zero-evidence rule the reconciler applies to unobserved
services.
Target tracking (as shipped):
with effective_current = max(current, 1) (scale-from-zero works) and
fleet-total signals (queue_depth, consumer_lag) first normalized to
per-replica (value / effective_current); cpu and custom signals are
treated as per-replica averages already (write custom PromQL with avg, not
sum). The result is clamped to [min, max] and step-capped (at most
scale_up_step added / scale_down_step removed per evaluation), and
current comes from the FleetObserver — a service with no replica
observation (or observed down: that is the reconciler's restart problem)
is skipped.
Cooldown / flap rules. No scale action — in either direction — within
cooldown_s of the service's last allowed/executed scale_service entry in
the durable ActionDecision/ActionExecution ledger (shared across
processes and restarts). That single rule is also the flap guard: an
up-then-down oscillation inside the window is impossible. The per-tick
action budget reuses FLEET_RECONCILER_MAX_ACTIONS.
Gate → actuate → watch. Every proposal is a scale_service
ActionRequest (source: autoscaler, params carry
replicas/from_replicas/direction/signal/value/target) through ActionPolicy
— approval_required under the shipped default policy, so enabling
FLEET_AUTOSCALER out of the box only queues scale proposals for a human.
Allowed actions run through the FleetActuator seam; a successful scale-up
schedules an OS-5.27 deploy watch, and scale-downs do too when the policy
file opts in:
Audit. At most ONE compact AutoscaleEvaluation node per tick (per-tick
verdict list in details_json; quiet ticks write nothing) — per-action
audit already lives in the ActionDecision/ActionExecution ledger.
Strangled: capabilities/auto_healing.py¶
The dormant AutoHealingEngine shell (disabled by default, never-wired
skill_evolver/fallback_router hooks, no production caller) is deleted.
Its one useful bit — threshold-counted repeated-failure escalation — is
absorbed into graph/parallel_engine.py
(_escalate_repeated_failure), which now files a failure_gap Concept
topic through the live AHE-3.18 propose-only remediation chain at the same
3-failure threshold.
Flags¶
| Flag | Default | Concept |
|---|---|---|
ACTION_POLICY_PATH |
shipped conservative default | OS-5.24 |
FLEET_RECONCILER |
false (opt-in) |
OS-5.25 |
FLEET_RECONCILER_INTERVAL |
120 s |
OS-5.25 |
FLEET_RECONCILER_MAX_ACTIONS |
5 / tick |
OS-5.25 |
FLEET_REGISTRY_PATH |
deploy/mcp-fleet.registry.yml |
OS-5.25 |
FLEET_DESIRED_STATE_PATH |
unset | OS-5.25 |
FLEET_ACTUATOR |
dryrun |
OS-5.25 |
DEPLOY_WATCH_WINDOW |
300 s |
OS-5.27 |
DEPLOY_WATCH_POLL |
15 s |
OS-5.27 |
FLEET_AUTOSCALER |
false (opt-in) |
OS-5.29 |
FLEET_AUTOSCALER_INTERVAL |
60 s |
OS-5.29 |
SCALING_PROMETHEUS_URL |
unset (local gauges) | OS-5.29 |
All typed AgentConfig fields (see docs/architecture/configuration.md).
Wiring a real deployment¶
from agent_utilities.orchestration.fleet_actuation import set_fleet_actuator
from agent_utilities.orchestration.fleet_observation import set_fleet_observer
from agent_utilities.orchestration.scaling_signals import set_scaling_signal_provider
from agent_utilities.knowledge_graph.actions.dispatch import set_default_notifier
set_fleet_observer(MyPortainerObserver()) # richer observed state
set_fleet_actuator(MyPortainerActuator()) # real actuation
set_scaling_signal_provider(MyLgtmProvider()) # richer load signals (optional)
set_default_notifier(MySlackNotifier()) # real escalation channel
For autoscaling without a custom provider, point the built-in at your
Prometheus (SCALING_PROMETHEUS_URL=http://prometheus:9090), declare
scaling: bounds in the FLEET_DESIRED_STATE_PATH override, and set
FLEET_AUTOSCALER=1.
Then set FLEET_RECONCILER=1 and relax ACTION_POLICY_PATH rule-by-rule
(staging targets first) as confidence grows — the audit ledger
(ActionDecision / ActionExecution / ReconcileReport / DeployWatch /
AutoscaleEvaluation nodes) is the evidence trail.