Runbook — MCP Fleet Auth & Monitoring¶

Operator procedures for the JWT+Eunomia auth model (architecture) and the LGTM observability stack (architecture).

Add a new MCP and get auth + metrics + alerts + dashboard automatically¶

Add the connector to deploy/mcp-fleet.registry.yml (regenerated by scripts/gen_mcp_fleet_registry.py).
Generate stacks: scripts/gen_mcp_service_stacks.py (production compose, auth-on by default) and the editable compose.dev.yml via gen_editable_compose.py.

Refresh monitoring:

python scripts/gen_prometheus_mcp_targets.py     # adds the scrape + probe target
python scripts/gen_grafana_dashboards.py         # dashboards pick it up via labels

Deploy the service; it serves /metrics + /health and enforces jwt+eunomia. No per-service Prometheus/Grafana edits — the file-SD + templated dashboards cover it.

Enable auth on the fleet (one-time gates, then phased waves)¶

Gate 1 — multiplexer token. Create the Keycloak confidential client and store its secret:

keycloak-client-onboarder   # client_id=mcp-multiplexer, audience agent-services

Set on the multiplexer service: MCP_CLIENT_AUTH=oidc-client-credentials, OIDC_CLIENT_ID=mcp-multiplexer, OIDC_CLIENT_SECRET (OpenBao ref), restart it.

Gate 2 — baseline eunomia policy. Load an allow-authenticated / deny-unknown policy at eunomia.arpa:

eunomia-policy-manager

Confirm the principal derived from the minted token matches an allow rule.

Waves. Flip services to AUTH_TYPE=jwt in waves; after each, verify reachability before continuing: - Wave 1: read-only (searxng, jellyfin, mealie, wger, uptime, vector…) - Wave 2: data/integration (github, gitlab, repository-manager, nextcloud…) - Wave 3: sensitive/infra (keycloak, openbao, kafka, jena, egeria…); portainer-mcp last (it is the redeploy lever). Never flip mcp-multiplexer itself.

Per service: push the compose, redeploy_stack_git, then check:

# server enforces jwt:
curl -s -o /dev/null -w '%{http_code}\n' -X POST http://<svc>.arpa/mcp   # → 401
# reachable through the multiplexer (token attached):
#   call a tool on <svc> via the multiplexer → succeeds

Rotate the multiplexer client secret¶

Rotate the secret in Keycloak (client mcp-multiplexer).
Update the OpenBao entry.
Restart the multiplexer (the provider re-reads OIDC_CLIENT_SECRET and re-mints). In-flight tokens keep working until expiry.

Add or loosen a eunomia policy¶

Use eunomia-policy-manager to add a rule for the principal/server/tool, then it takes effect on the next call (remote PDP). Remember: no rule = deny.

Roll back a service to no-auth¶

Revert the auth block in that service's compose and redeploy_stack_git; it returns to Auth: none and is reachable without a token. A0 is additive, so disabling it (MCP_CLIENT_AUTH=none + restart multiplexer) simply stops attaching tokens.

Activate / refresh monitoring¶

Redeploy the LGTM stack to pick up prometheus.yml, rules.yml, blackbox-exporter, promtail, and Grafana provisioning.
Prometheus reloads file-SD targets automatically; new dashboards appear within the provider's updateIntervalSeconds.
MCP /metrics returns data only once the service runs an agent-utilities build carrying the /metrics route (image rebuild or editable source mount).

Multiplexer outbound auth (A0) — make jwt children reachable¶

A jwt-enforcing child rejects calls without a bearer token. Give the multiplexer one service identity so it mints + attaches a Keycloak token to every remote child (CONCEPT:OS-5.32). One-time setup (use your realm/host, not the placeholders):

Keycloak confidential client (keycloak-client-onboarder, or admin API): create client mcp-multiplexer with Service Accounts enabled + an audience mapper adding aud=<MCP_AUDIENCE> (the audience the children verify). Grab its secret.

Verify the token mints with the right claims:

curl -s -X POST "<KEYCLOAK>/realms/<REALM>/protocol/openid-connect/token" \
  -d grant_type=client_credentials -d client_id=mcp-multiplexer \
  -d client_secret=<SECRET>   # decode: iss=<KEYCLOAK>/realms/<REALM>, aud includes <MCP_AUDIENCE>

Store the secret in OpenBao: bao kv put <kv-mount>/mcp-multiplexer/oidc \ OIDC_CLIENT_SECRET=<SECRET> OIDC_CLIENT_ID=mcp-multiplexer OIDC_AUDIENCE=<MCP_AUDIENCE> \ OIDC_TOKEN_URL=<KEYCLOAK>/realms/<REALM>/protocol/openid-connect/token.

Wire the env into BOTH multiplexers (opt-in; inert if unset):

MCP_CLIENT_AUTH=oidc-client-credentials
OIDC_CLIENT_ID=mcp-multiplexer
OIDC_CLIENT_SECRET=<SECRET>          # from OpenBao
OIDC_AUDIENCE=<MCP_AUDIENCE>
OIDC_TOKEN_URL=<KEYCLOAK>/realms/<REALM>/protocol/openid-connect/token

Swarm/deployed multiplexer: add to its Portainer stack Env and make the compose pass each (- VAR=${VAR}), then redeploy.
Local (Claude Code) multiplexer: add the same keys under mcpServers.mcp-multiplexer.env in ~/.claude.json, then restart Claude Code.
Verify: a tool call on a jwt child through the multiplexer succeeds; child logs show 200.

Sync a connector's runtime env from its `.env`¶

Deployed -mcp containers only get env the compose passes. To inject a connector's agents/<name>/.env (tokens, URLs, tool toggles) into the running container: - Put each var in the Portainer stack Env and add - VAR=${VAR} to the compose environment so it's passed. Keep HOST/PORT/TRANSPORT as the deployment sets them (don't let dev .env values override). Secrets live in the stack Env, not the file. Redeploy. - Fleet sweep: per stack, map it to agents/<image-basename>/.env, overlay those vars into the stack Env, rewrite the compose to pass them, update the stack. - Order matters with auth: passing a stale AUTH_TYPE=jwt activates jwt — only do that after A0 (above) + the eunomia policy are in place, or the child becomes unreachable.

Triage common alerts¶

Alert	First checks
`McpServiceDown`	Is the service on the metrics-bearing image? `docker_get_stack_logs`; is it scheduled?
`McpProbeFailed`	`/health` reachable on the overlay? auth accidentally on the route?
`McpHighToolErrorRate`	Per-tool error series on the per-service dashboard; child logs in Loki.
`McpChildBreakerOpen`	Child unreachable/erroring — check the child + the multiplexer token.
`ContainerOOMKilled` / `ContainerHighMemory`	Raise the limit or fix the leak; per-stack panel.
`HostLowDisk`	Prune images/volumes; check `prometheus_data`/`loki` growth.