Skip to content

Scaling the Gateway

CONCEPT:OS-5.23 — Gateway Middle-Tier Hardening (Python-tier Prometheus metrics, per-tenant rate limiting, engine circuit breaker, multi-worker readiness).

The API gateway (agent_utilities.server + gateway/graph_api.py) historically ran as exactly one process with one event loop. This page documents how to run more than one worker/replica, what is safe about that today, and which state remains per-process.

Worker model

GATEWAY_WORKERS=1   (default)  one process, one event loop, in-process KG daemon
GATEWAY_WORKERS=N   (N>1)      pre-forked worker pool on ONE shared listen socket

With GATEWAY_WORKERS>1, _run_agent_server binds the listen socket once, then forks before building the app, so every worker constructs its own FastAPI app, engine client connections and daemon role. The parent is worker 0 and reaps the children when its server exits. (uvicorn's own workers= flag requires an import-string app, which the dynamically-built gateway app cannot provide — hence the explicit pre-fork.) The flag is ignored under pytest and with the terminal UI.

You can equally scale with N container replicas (each GATEWAY_WORKERS=1) behind a load balancer — every statement below about "per-process" state applies the same way.

KG host daemon: exactly one, by construction

The consolidated KG host daemon (queue drain, graph writer, task workers, maintenance/golden-loop ticks) is serialized by the advisory flock host-lock (knowledge_graph/core/host_lock.py): each worker resolves its role independently after the fork, the first to acquire the lock becomes host, and every other worker self-heals to client (no daemon threads, lets the host drain the durable queue). The lock auto-releases when the holder dies, so a crashed host worker never blocks a restart. This was verified against the fork model: role resolution happens per-child, after fork, so the inherited-lock-fd hazard does not arise.

Consequence: daemon ticks (enrichment, hygiene, golden loop, …) run in ONE worker only — that is the intended topology, identical to running the gateway next to MCP servers on one machine.

What is per-process (deliberate, documented)

State Behaviour across workers/replicas
Prometheus metrics registry Per-process. A scrape of /metrics through the shared socket samples ONE worker; aggregate in Prometheus (scrape each replica) or run 1 worker/container.
Rate-limit token buckets Per-process: GATEWAY_RATE_LIMIT is effectively multiplied by the worker count. Precise distributed limiting belongs to the state-externalization track.
Engine circuit breaker Per-process per endpoint. Each worker discovers a dead engine independently (≤ threshold extra probes per worker).
Dashboard Aggregator cache 10s TTL read cache + thread pool — bounded divergence, safe to duplicate.
Dashboard layout (ConfigManager) NOT divergent: reads/writes go to the shared YAML file (XDG config dir) on every request.
Durable execution / sessions / engine task queue Externalized already (SQLite/engine-side) — workers coordinate through the KG host.

Nothing in gateway/api.py mutates module state after startup except save_layout, which persists straight to disk.

Python-tier metrics

Mounted by register_graph_routes (so the gateway and the agent-webui backend both get it). Naming mirrors the Rust engine's epistemic_graph_* series:

Metric Labels Meaning
agent_utilities_gateway_requests_total route, method, status Request count. route is always a route TEMPLATE (/api/things/{id}) — unmatched requests collapse into unmatched.
agent_utilities_gateway_request_duration_seconds route Latency histogram.
agent_utilities_gateway_in_flight_requests Gauge of in-flight requests.
agent_utilities_gateway_rate_limited_total tenant 429s from the token-bucket limiter.
agent_utilities_gateway_engine_requests_total op, outcome Engine client calls (ok / connection_error / error / short_circuited).
agent_utilities_gateway_engine_breaker_state endpoint 0=closed, 1=half-open, 2=open.

GET /metrics is exempt from the identity middleware (scrapers cannot mint JWTs) and from rate limiting. prometheus_client is the optional metrics extra; absent, everything degrades to a no-op and /metrics returns a placeholder. Toggle with GATEWAY_METRICS (default on).

Per-tenant rate limiting

GATEWAY_RATE_LIMIT (req/s sustained, default 0 = off) + GATEWAY_RATE_BURST (default 2× rate). The ASGI limiter sits inside the OS-5.14 identity middleware, so the bucket key uses the server-minted ActorContext: tenant → authenticated actor id → client IP. Rejections are 429 with Retry-After and a JSON body. Health routes and /metrics are exempt.

Engine circuit breaker

Every GraphComputeEngine call is guarded by a shared per-endpoint breaker (knowledge_graph/core/engine_breaker.py): ENGINE_BREAKER_THRESHOLD (default 5) consecutive connect/timeout failures open the circuit; ENGINE_BREAKER_COOLDOWN (default 15s) later a single half-open probe heals or re-opens it. While open, callers get the fast, typed EngineCircuitOpenError (a ConnectionError subclass) instead of hammering a dead socket. Application-level errors (bad Cypher, missing node) never trip the breaker. ENGINE_BREAKER_THRESHOLD=0 disables tripping.

Flags

Flag Default What it sets
GATEWAY_METRICS true Python-tier Prometheus middleware + GET /metrics
GATEWAY_RATE_LIMIT 0 (off) Per-tenant sustained req/s
GATEWAY_RATE_BURST 0 (→ 2× rate) Token-bucket capacity
GATEWAY_WORKERS 1 Pre-forked gateway worker processes
ENGINE_BREAKER_THRESHOLD 5 Failures before the engine circuit opens (0 = off)
ENGINE_BREAKER_COOLDOWN 15 Seconds before the half-open probe

All are typed fields on AgentConfig (core/config.py) per the configuration-discipline rule.