Runbook — MCP Fleet Auth & Monitoring¶
Operator procedures for the JWT+Eunomia auth model (architecture) and the LGTM observability stack (architecture).
Add a new MCP and get auth + metrics + alerts + dashboard automatically¶
- Add the connector to
deploy/mcp-fleet.registry.yml(regenerated byscripts/gen_mcp_fleet_registry.py). - Generate stacks:
scripts/gen_mcp_service_stacks.py(production compose, auth-on by default) and the editablecompose.dev.ymlviagen_editable_compose.py. - Refresh monitoring:
- Deploy the service; it serves
/metrics+/healthand enforces jwt+eunomia. No per-service Prometheus/Grafana edits — the file-SD + templated dashboards cover it.
Enable auth on the fleet (one-time gates, then phased waves)¶
Gate 1 — multiplexer token. Create the Keycloak confidential client and store its secret:
Set on the multiplexer service:MCP_CLIENT_AUTH=oidc-client-credentials,
OIDC_CLIENT_ID=mcp-multiplexer, OIDC_CLIENT_SECRET (OpenBao ref), restart it.
Gate 2 — baseline eunomia policy. Load an allow-authenticated / deny-unknown
policy at eunomia.arpa:
Waves. Flip services to AUTH_TYPE=jwt in waves; after each, verify
reachability before continuing:
- Wave 1: read-only (searxng, jellyfin, mealie, wger, uptime, vector…)
- Wave 2: data/integration (github, gitlab, repository-manager, nextcloud…)
- Wave 3: sensitive/infra (keycloak, openbao, kafka, jena, egeria…); portainer-mcp
last (it is the redeploy lever). Never flip mcp-multiplexer itself.
Per service: push the compose, redeploy_stack_git, then check:
# server enforces jwt:
curl -s -o /dev/null -w '%{http_code}\n' -X POST http://<svc>.arpa/mcp # → 401
# reachable through the multiplexer (token attached):
# call a tool on <svc> via the multiplexer → succeeds
Rotate the multiplexer client secret¶
- Rotate the secret in Keycloak (client
mcp-multiplexer). - Update the OpenBao entry.
- Restart the multiplexer (the provider re-reads
OIDC_CLIENT_SECRETand re-mints). In-flight tokens keep working until expiry.
Add or loosen a eunomia policy¶
Use eunomia-policy-manager to add a rule for the principal/server/tool, then it
takes effect on the next call (remote PDP). Remember: no rule = deny.
Roll back a service to no-auth¶
Revert the auth block in that service's compose and redeploy_stack_git; it
returns to Auth: none and is reachable without a token. A0 is additive, so
disabling it (MCP_CLIENT_AUTH=none + restart multiplexer) simply stops attaching
tokens.
Activate / refresh monitoring¶
- Redeploy the LGTM stack to pick up
prometheus.yml,rules.yml,blackbox-exporter,promtail, and Grafana provisioning. - Prometheus reloads file-SD targets automatically; new dashboards appear within
the provider's
updateIntervalSeconds. - MCP
/metricsreturns data only once the service runs an agent-utilities build carrying the/metricsroute (image rebuild or editable source mount).
Multiplexer outbound auth (A0) — make jwt children reachable¶
A jwt-enforcing child rejects calls without a bearer token. Give the multiplexer one service identity so it mints + attaches a Keycloak token to every remote child (CONCEPT:OS-5.32). One-time setup (use your realm/host, not the placeholders):
- Keycloak confidential client (
keycloak-client-onboarder, or admin API): create clientmcp-multiplexerwith Service Accounts enabled + an audience mapper addingaud=<MCP_AUDIENCE>(the audience the children verify). Grab its secret. - Verify the token mints with the right claims:
- Store the secret in OpenBao:
bao kv put <kv-mount>/mcp-multiplexer/oidc \ OIDC_CLIENT_SECRET=<SECRET> OIDC_CLIENT_ID=mcp-multiplexer OIDC_AUDIENCE=<MCP_AUDIENCE> \ OIDC_TOKEN_URL=<KEYCLOAK>/realms/<REALM>/protocol/openid-connect/token. - Wire the env into BOTH multiplexers (opt-in; inert if unset):
- Swarm/deployed multiplexer: add to its Portainer stack Env and make the compose
pass each (
- VAR=${VAR}), then redeploy. - Local (Claude Code) multiplexer: add the same keys under
mcpServers.mcp-multiplexer.envin~/.claude.json, then restart Claude Code. - Verify: a tool call on a jwt child through the multiplexer succeeds; child logs show 200.
Sync a connector's runtime env from its .env¶
Deployed -mcp containers only get env the compose passes. To inject a connector's
agents/<name>/.env (tokens, URLs, tool toggles) into the running container:
- Put each var in the Portainer stack Env and add - VAR=${VAR} to the compose
environment so it's passed. Keep HOST/PORT/TRANSPORT as the deployment sets them
(don't let dev .env values override). Secrets live in the stack Env, not the file. Redeploy.
- Fleet sweep: per stack, map it to agents/<image-basename>/.env, overlay those vars into
the stack Env, rewrite the compose to pass them, update the stack.
- Order matters with auth: passing a stale AUTH_TYPE=jwt activates jwt — only do that
after A0 (above) + the eunomia policy are in place, or the child becomes unreachable.
Triage common alerts¶
| Alert | First checks |
|---|---|
McpServiceDown |
Is the service on the metrics-bearing image? docker_get_stack_logs; is it scheduled? |
McpProbeFailed |
/health reachable on the overlay? auth accidentally on the route? |
McpHighToolErrorRate |
Per-tool error series on the per-service dashboard; child logs in Loki. |
McpChildBreakerOpen |
Child unreachable/erroring — check the child + the multiplexer token. |
ContainerOOMKilled / ContainerHighMemory |
Raise the limit or fix the leak; per-stack panel. |
HostLowDisk |
Prune images/volumes; check prometheus_data/loki growth. |