Skip to content

Capacity Model (Plan 07: Path to Scale)

Status: MODELED, with measured anchors. Three things are now measured (not asserted): the epistemic-graph single-connection transport latency, per-shard linear write throughput (1→4 shards ≈ 6.6× at constant wall-time, no shared-state cliff), and the per-agent working-set footprint (~52 kB) for a bounded subgraph — all in epistemic-graph/docs/benchmarks.md (reproduce with scripts/bench_scale.py). Every resident-population sizing figure (10k+) is a linear extrapolation from those anchors plus the per-unit constants below. The 100M target follows as a measured projection (~78 hosts @ 64 GB / 52 kB per agent) — we do not claim 100M has been run; it has not. The arithmetic lives in capacity_model.py, unit-tested in tests/scale/test_capacity_model.py.

The measured anchor

From epistemic-graph/docs/benchmarks.md (run captured 2026-06-01, Linux x86-64, single client connection, in-memory graph, length-prefixed MessagePack over UDS):

Operation ops p50 p99
AddNode 3000 0.187 ms 0.223 ms

AddNode p50 ≈ 0.19 ms~5,000 sequential ops/sec on a single connection. This is the only empirical input. Throughput above one connection comes from connection pooling (pool.py) and shard fan-out (ShardRouter); the server sheds excess concurrent load with a BUSY response (EPISTEMIC_GRAPH_MAX_INFLIGHT, default 1024) rather than queueing unbounded.

Three-axis framing

Scaling is not a single dimension. We size three independent axes and take the max of the resulting infrastructure:

Axis Driver Knob in core/config.py
Active concurrency agents executing right now worker_pool_size × node count; queue-driven dispatch is the live scale-out path for this axisagent_dispatch_backend=queue + N agent-dispatch-worker hosts (see architecture/agent_dispatch.md, CONCEPT:ORCH-1.45)
Resident population total agents whose state must persist graph_service_endpoints (PG/L0 shard fan-out — the L0 side is the live tenant-partitioned engine sharding path, see architecture/engine_sharding.md, CONCEPT:KG-2.58)
Event throughput graph events/sec driving fan-out kafka_bootstrap_servers partitions

A deployment can be huge on one axis and tiny on another (e.g. 1M dormant residents with 2% active). Sizing each axis separately avoids over- or under-provisioning.

Per-unit planning constants (MODELED)

These live as named constants in capacity_model.py so they are testable and adjustable in one place:

Constant Value Meaning
RESIDENTS_PER_PG_SHARD 250,000 residents per durable Postgres shard
RESIDENTS_PER_L0_SHARD 50,000 residents per hot in-memory L0 shard (one GRAPH_SERVICE_ENDPOINTS entry; routed per named graph by HRW)
ACTIVE_AGENTS_PER_WORKER 25 concurrently active agents one worker multiplexes
WORKERS_PER_NODE 8 workers per node (= worker_pool_size default)
OPS_PER_SEC_PER_KAFKA_PARTITION 5,000 = the measured single-connection drain rate
EVENTS_PER_ACTIVE_AGENT_PER_SEC 2.0 modeled graph events per active agent
MIN_KAFKA_PARTITIONS 3 floor for ordering/parallelism headroom

OPS_PER_SEC_PER_KAFKA_PARTITION is the one constant tied to the measured anchor — one consumer connection drains ~5,000 ops/sec, so we size one partition per connection. The rest are conservative planning round numbers, not measurements.

The arithmetic (2% active fraction)

Active agents = ceil(residents × 0.02). Then:

  • PG shards = ceil(residents / 250,000)
  • L0 shards = ceil(residents / 50,000)
  • Workers = ceil(active / 25)
  • Nodes = ceil(workers / 8)
  • Events/s = active × 2.0
  • Kafka parts = max(3, ceil(events_per_sec / 5,000))

1,000 residents (MODELED, trivially within a single dev box)

  • active = ceil(1000 × 0.02) = 20
  • PG shards = ceil(1000/250000) = 1
  • L0 shards = ceil(1000/50000) = 1
  • workers = ceil(20/25) = 1, nodes = ceil(1/8) = 1
  • events/s = 20 × 2 = 40, kafka = max(3, ceil(40/5000)) = 3

100,000 residents (MODELED)

  • active = ceil(100000 × 0.02) = 2,000
  • PG shards = ceil(100000/250000) = 1
  • L0 shards = ceil(100000/50000) = 2
  • workers = ceil(2000/25) = 80, nodes = ceil(80/8) = 10
  • events/s = 2000 × 2 = 4,000, kafka = max(3, ceil(4000/5000)) = 3

1,000,000 residents (MODELED — the documented reference case)

  • active = ceil(1000000 × 0.02) = 20,000
  • PG shards = ceil(1000000/250000) = 4
  • L0 shards = ceil(1000000/50000) = 20
  • workers = ceil(20000/25) = 800, nodes = ceil(800/8) = 100
  • events/s = 20000 × 2 = 40,000, kafka = max(3, ceil(40000/5000)) = 8

These exact numbers (4 PG shards, 20 L0 shards, 800 workers, 100 nodes, 8 Kafka partitions) are asserted in tests/scale/test_capacity_model.py so the doc and the code cannot silently drift.

Linear extrapolation to 100,000,000 residents (MODELED — NOT tested)

Pure linear scaling of the same constants (the model is linear above the single-shard floor):

  • active = ceil(100000000 × 0.02) = 2,000,000
  • PG shards = ceil(100000000/250000) = 400
  • L0 shards = ceil(100000000/50000) = 2,000
  • workers = ceil(2000000/25) = 80,000, nodes = ceil(80000/8) = 10,000
  • events/s = 2000000 × 2 = 4,000,000, kafka = max(3, ceil(4000000/5000)) = 800

Summary table

Residents Active (2%) PG shards L0 shards Workers Nodes Kafka parts Events/s
1,000 20 1 1 1 1 3 40
100,000 2,000 1 2 80 10 3 4,000
1,000,000 20,000 4 20 800 100 8 40,000
100,000,000 2,000,000 400 2,000 80,000 10,000 800 4,000,000

What is measured vs extrapolated

  • Measured: AddNode p50 = 0.187 ms, ~5,000 ops/sec/connection (epistemic-graph/docs/benchmarks.md). This anchors OPS_PER_SEC_PER_KAFKA_PARTITION.
  • Extrapolated (MODELED): every shard / worker / node / partition figure for ≥10k residents. Linear scaling is an assumption; real systems hit super-linear coordination costs (cross-shard queries, rebalancing, tail latency) that this first-order model intentionally ignores. Treat 1M as an engineering target to validate, and 100M as an order-of-magnitude sketch only.

Queue-driven dispatch stage (IMPLEMENTED — CONCEPT:ORCH-1.45)

The "Workers" column above used to be aspirational on the active-concurrency axis: agent turns executed only inside the in-process asyncio scheduler (core/cognitive_scheduler.py, max_concurrent per process) on the host that accepted them. That stage is now implemented:

  • agent turns ride the session-keyed agent_turns queue (AGENT_DISPATCH_BACKEND=queue; transport follows TASK_QUEUE_BACKEND);
  • any host running agent-dispatch-worker claims and executes them against the shared state store (OS-5.16), so "Workers = ceil(active / 25)" maps to a stateless dispatch-worker fleet spread across "Nodes", not to one process's coroutine cap;
  • AGENT_TURNS_PARTITIONS bounds fleet-wide session concurrency on Kafka the same way the Kafka-parts column bounds event drain.

Deployment shape per row of the summary table: stateless gateways + N dispatch workers + M kg-ingest-worker processes + the listed engine/PG shards and Kafka partitions. Design, ordering and idempotency guarantees: architecture/agent_dispatch.md.

Production guard

Reaching any of these scales requires non-toy backends. core/config.py exposes AgentConfig.assert_production_safe() (see core/profile_guard.py): under APP_PROFILE=prod it rejects graph_persistence_type of file/sqlite and a2a_broker/a2a_storage of in-memory, so a single-host config can never be mistaken for a production deployment.