Scaling the Gateway¶

CONCEPT:OS-5.23 — Gateway Middle-Tier Hardening (Python-tier Prometheus metrics, per-tenant rate limiting, engine circuit breaker, multi-worker readiness).

The API gateway (agent_utilities.server + gateway/graph_api.py) historically ran as exactly one process with one event loop. This page documents how to run more than one worker/replica, what is safe about that today, and which state remains per-process.

Worker model¶

GATEWAY_WORKERS=1   (default)  one process, one event loop, in-process KG daemon
GATEWAY_WORKERS=N   (N>1)      pre-forked worker pool on ONE shared listen socket

With GATEWAY_WORKERS>1, _run_agent_server binds the listen socket once, then forks before building the app, so every worker constructs its own FastAPI app, engine client connections and daemon role. The parent is worker 0 and reaps the children when its server exits. (uvicorn's own workers= flag requires an import-string app, which the dynamically-built gateway app cannot provide — hence the explicit pre-fork.) The flag is ignored under pytest and with the terminal UI.

You can equally scale with N container replicas (each GATEWAY_WORKERS=1) behind a load balancer — every statement below about "per-process" state applies the same way.

KG host daemon: exactly one, by construction¶

The consolidated KG host daemon (queue drain, graph writer, task workers, maintenance/golden-loop ticks) is serialized by the advisory flock host-lock (knowledge_graph/core/host_lock.py): each worker resolves its role independently after the fork, the first to acquire the lock becomes host, and every other worker self-heals to client (no daemon threads, lets the host drain the durable queue). The lock auto-releases when the holder dies, so a crashed host worker never blocks a restart. This was verified against the fork model: role resolution happens per-child, after fork, so the inherited-lock-fd hazard does not arise.

Consequence: daemon ticks (enrichment, hygiene, golden loop, …) run in ONE worker only — that is the intended topology, identical to running the gateway next to MCP servers on one machine.

What is per-process (deliberate, documented)¶

State	Behaviour across workers/replicas
Prometheus metrics registry	Per-process. A scrape of `/metrics` through the shared socket samples ONE worker; aggregate in Prometheus (scrape each replica) or run 1 worker/container.
Rate-limit token buckets	Per-process: `GATEWAY_RATE_LIMIT` is effectively multiplied by the worker count. Precise distributed limiting belongs to the state-externalization track.
Engine circuit breaker	Per-process per endpoint. Each worker discovers a dead engine independently (≤ threshold extra probes per worker).
Dashboard `Aggregator` cache	10s TTL read cache + thread pool — bounded divergence, safe to duplicate.
Dashboard layout (`ConfigManager`)	NOT divergent: reads/writes go to the shared YAML file (XDG config dir) on every request.
Durable execution / sessions / engine task queue	Externalized already (SQLite/engine-side) — workers coordinate through the KG host.

Nothing in gateway/api.py mutates module state after startup except save_layout, which persists straight to disk.

Python-tier metrics¶

Mounted by register_graph_routes (so the gateway and the agent-webui backend both get it). Naming mirrors the Rust engine's epistemic_graph_* series:

Metric	Labels	Meaning
`agent_utilities_gateway_requests_total`	`route`, `method`, `status`	Request count. `route` is always a route TEMPLATE (`/api/things/{id}`) — unmatched requests collapse into `unmatched`.
`agent_utilities_gateway_request_duration_seconds`	`route`	Latency histogram.
`agent_utilities_gateway_in_flight_requests`	—	Gauge of in-flight requests.
`agent_utilities_gateway_rate_limited_total`	`tenant`	429s from the token-bucket limiter.
`agent_utilities_gateway_engine_requests_total`	`op`, `outcome`	Engine client calls (`ok` / `connection_error` / `error` / `short_circuited`).
`agent_utilities_gateway_engine_breaker_state`	`endpoint`	0=closed, 1=half-open, 2=open.

GET /metrics is exempt from the identity middleware (scrapers cannot mint JWTs) and from rate limiting. prometheus_client is the optional metrics extra; absent, everything degrades to a no-op and /metrics returns a placeholder. Toggle with GATEWAY_METRICS (default on).

Per-tenant rate limiting¶

GATEWAY_RATE_LIMIT (req/s sustained, default 0 = off) + GATEWAY_RATE_BURST (default 2× rate). The ASGI limiter sits inside the OS-5.14 identity middleware, so the bucket key uses the server-minted ActorContext: tenant → authenticated actor id → client IP. Rejections are 429 with Retry-After and a JSON body. Health routes and /metrics are exempt.

Engine circuit breaker¶

Every GraphComputeEngine call is guarded by a shared per-endpoint breaker (knowledge_graph/core/engine_breaker.py): ENGINE_BREAKER_THRESHOLD (default 5) consecutive connect/timeout failures open the circuit; ENGINE_BREAKER_COOLDOWN (default 15s) later a single half-open probe heals or re-opens it. While open, callers get the fast, typed EngineCircuitOpenError (a ConnectionError subclass) instead of hammering a dead socket. Application-level errors (bad Cypher, missing node) never trip the breaker. ENGINE_BREAKER_THRESHOLD=0 disables tripping.

Flags¶

Flag	Default	What it sets
`GATEWAY_METRICS`	`true`	Python-tier Prometheus middleware + `GET /metrics`
`GATEWAY_RATE_LIMIT`	`0` (off)	Per-tenant sustained req/s
`GATEWAY_RATE_BURST`	`0` (→ 2× rate)	Token-bucket capacity
`GATEWAY_WORKERS`	`1`	Pre-forked gateway worker processes
`ENGINE_BREAKER_THRESHOLD`	`5`	Failures before the engine circuit opens (0 = off)
`ENGINE_BREAKER_COOLDOWN`	`15`	Seconds before the half-open probe

All are typed fields on AgentConfig (core/config.py) per the configuration-discipline rule.