Capacity Model (Plan 07: Path to Scale)¶
Status: MODELED, with measured anchors. Three things are now measured (not asserted): the epistemic-graph single-connection transport latency, per-shard linear write throughput (1→4 shards ≈ 6.6× at constant wall-time, no shared-state cliff), and the per-agent working-set footprint (~52 kB) for a bounded subgraph — all in
epistemic-graph/docs/benchmarks.md(reproduce withscripts/bench_scale.py). Every resident-population sizing figure (10k+) is a linear extrapolation from those anchors plus the per-unit constants below. The 100M target follows as a measured projection (~78 hosts @ 64 GB / 52 kB per agent) — we do not claim 100M has been run; it has not. The arithmetic lives incapacity_model.py, unit-tested intests/scale/test_capacity_model.py.
The measured anchor¶
From epistemic-graph/docs/benchmarks.md (run captured 2026-06-01, Linux x86-64,
single client connection, in-memory graph, length-prefixed MessagePack over UDS):
| Operation | ops | p50 | p99 |
|---|---|---|---|
AddNode |
3000 | 0.187 ms | 0.223 ms |
AddNode p50 ≈ 0.19 ms ⇒ ~5,000 sequential ops/sec on a single
connection. This is the only empirical input. Throughput above one connection
comes from connection pooling (pool.py) and shard fan-out (ShardRouter);
the server sheds excess concurrent load with a BUSY response
(EPISTEMIC_GRAPH_MAX_INFLIGHT, default 1024) rather than queueing unbounded.
Three-axis framing¶
Scaling is not a single dimension. We size three independent axes and take the max of the resulting infrastructure:
| Axis | Driver | Knob in core/config.py |
|---|---|---|
| Active concurrency | agents executing right now | worker_pool_size × node count; queue-driven dispatch is the live scale-out path for this axis — agent_dispatch_backend=queue + N agent-dispatch-worker hosts (see architecture/agent_dispatch.md, CONCEPT:ORCH-1.45) |
| Resident population | total agents whose state must persist | graph_service_endpoints (PG/L0 shard fan-out — the L0 side is the live tenant-partitioned engine sharding path, see architecture/engine_sharding.md, CONCEPT:KG-2.58) |
| Event throughput | graph events/sec driving fan-out | kafka_bootstrap_servers partitions |
A deployment can be huge on one axis and tiny on another (e.g. 1M dormant residents with 2% active). Sizing each axis separately avoids over- or under-provisioning.
Per-unit planning constants (MODELED)¶
These live as named constants in capacity_model.py so they are testable and
adjustable in one place:
| Constant | Value | Meaning |
|---|---|---|
RESIDENTS_PER_PG_SHARD |
250,000 | residents per durable Postgres shard |
RESIDENTS_PER_L0_SHARD |
50,000 | residents per hot in-memory L0 shard (one GRAPH_SERVICE_ENDPOINTS entry; routed per named graph by HRW) |
ACTIVE_AGENTS_PER_WORKER |
25 | concurrently active agents one worker multiplexes |
WORKERS_PER_NODE |
8 | workers per node (= worker_pool_size default) |
OPS_PER_SEC_PER_KAFKA_PARTITION |
5,000 | = the measured single-connection drain rate |
EVENTS_PER_ACTIVE_AGENT_PER_SEC |
2.0 | modeled graph events per active agent |
MIN_KAFKA_PARTITIONS |
3 | floor for ordering/parallelism headroom |
OPS_PER_SEC_PER_KAFKA_PARTITION is the one constant tied to the measured
anchor — one consumer connection drains ~5,000 ops/sec, so we size one
partition per connection. The rest are conservative planning round numbers, not
measurements.
The arithmetic (2% active fraction)¶
Active agents = ceil(residents × 0.02). Then:
- PG shards =
ceil(residents / 250,000) - L0 shards =
ceil(residents / 50,000) - Workers =
ceil(active / 25) - Nodes =
ceil(workers / 8) - Events/s =
active × 2.0 - Kafka parts =
max(3, ceil(events_per_sec / 5,000))
1,000 residents (MODELED, trivially within a single dev box)¶
- active =
ceil(1000 × 0.02)= 20 - PG shards =
ceil(1000/250000)= 1 - L0 shards =
ceil(1000/50000)= 1 - workers =
ceil(20/25)= 1, nodes =ceil(1/8)= 1 - events/s =
20 × 2= 40, kafka =max(3, ceil(40/5000))= 3
100,000 residents (MODELED)¶
- active =
ceil(100000 × 0.02)= 2,000 - PG shards =
ceil(100000/250000)= 1 - L0 shards =
ceil(100000/50000)= 2 - workers =
ceil(2000/25)= 80, nodes =ceil(80/8)= 10 - events/s =
2000 × 2= 4,000, kafka =max(3, ceil(4000/5000))= 3
1,000,000 residents (MODELED — the documented reference case)¶
- active =
ceil(1000000 × 0.02)= 20,000 - PG shards =
ceil(1000000/250000)= 4 - L0 shards =
ceil(1000000/50000)= 20 - workers =
ceil(20000/25)= 800, nodes =ceil(800/8)= 100 - events/s =
20000 × 2= 40,000, kafka =max(3, ceil(40000/5000))= 8
These exact numbers (4 PG shards, 20 L0 shards, 800 workers, 100 nodes, 8 Kafka
partitions) are asserted in tests/scale/test_capacity_model.py so the doc and
the code cannot silently drift.
Linear extrapolation to 100,000,000 residents (MODELED — NOT tested)¶
Pure linear scaling of the same constants (the model is linear above the single-shard floor):
- active =
ceil(100000000 × 0.02)= 2,000,000 - PG shards =
ceil(100000000/250000)= 400 - L0 shards =
ceil(100000000/50000)= 2,000 - workers =
ceil(2000000/25)= 80,000, nodes =ceil(80000/8)= 10,000 - events/s =
2000000 × 2= 4,000,000, kafka =max(3, ceil(4000000/5000))= 800
Summary table¶
| Residents | Active (2%) | PG shards | L0 shards | Workers | Nodes | Kafka parts | Events/s |
|---|---|---|---|---|---|---|---|
| 1,000 | 20 | 1 | 1 | 1 | 1 | 3 | 40 |
| 100,000 | 2,000 | 1 | 2 | 80 | 10 | 3 | 4,000 |
| 1,000,000 | 20,000 | 4 | 20 | 800 | 100 | 8 | 40,000 |
| 100,000,000 | 2,000,000 | 400 | 2,000 | 80,000 | 10,000 | 800 | 4,000,000 |
What is measured vs extrapolated¶
- Measured:
AddNodep50 = 0.187 ms, ~5,000 ops/sec/connection (epistemic-graph/docs/benchmarks.md). This anchorsOPS_PER_SEC_PER_KAFKA_PARTITION. - Extrapolated (MODELED): every shard / worker / node / partition figure for ≥10k residents. Linear scaling is an assumption; real systems hit super-linear coordination costs (cross-shard queries, rebalancing, tail latency) that this first-order model intentionally ignores. Treat 1M as an engineering target to validate, and 100M as an order-of-magnitude sketch only.
Queue-driven dispatch stage (IMPLEMENTED — CONCEPT:ORCH-1.45)¶
The "Workers" column above used to be aspirational on the active-concurrency
axis: agent turns executed only inside the in-process asyncio scheduler
(core/cognitive_scheduler.py, max_concurrent per process) on the host that
accepted them. That stage is now implemented:
- agent turns ride the session-keyed
agent_turnsqueue (AGENT_DISPATCH_BACKEND=queue; transport followsTASK_QUEUE_BACKEND); - any host running
agent-dispatch-workerclaims and executes them against the shared state store (OS-5.16), so "Workers = ceil(active / 25)" maps to a stateless dispatch-worker fleet spread across "Nodes", not to one process's coroutine cap; AGENT_TURNS_PARTITIONSbounds fleet-wide session concurrency on Kafka the same way the Kafka-parts column bounds event drain.
Deployment shape per row of the summary table: stateless gateways + N
dispatch workers + M kg-ingest-worker processes + the listed engine/PG
shards and Kafka partitions. Design, ordering and idempotency guarantees:
architecture/agent_dispatch.md.
Production guard¶
Reaching any of these scales requires non-toy backends. core/config.py exposes
AgentConfig.assert_production_safe() (see core/profile_guard.py): under
APP_PROFILE=prod it rejects graph_persistence_type of file/sqlite and
a2a_broker/a2a_storage of in-memory, so a single-host config can never be
mistaken for a production deployment.