Skip to content

Runbook — MCP Fleet Auth & Monitoring

Operator procedures for the JWT+Eunomia auth model (architecture) and the LGTM observability stack (architecture).

Add a new MCP and get auth + metrics + alerts + dashboard automatically

  1. Add the connector to deploy/mcp-fleet.registry.yml (regenerated by scripts/gen_mcp_fleet_registry.py).
  2. Generate stacks: scripts/gen_mcp_service_stacks.py (production compose, auth-on by default) and the editable compose.dev.yml via gen_editable_compose.py.
  3. Refresh monitoring:
    python scripts/gen_prometheus_mcp_targets.py     # adds the scrape + probe target
    python scripts/gen_grafana_dashboards.py         # dashboards pick it up via labels
    
  4. Deploy the service; it serves /metrics + /health and enforces jwt+eunomia. No per-service Prometheus/Grafana edits — the file-SD + templated dashboards cover it.

Enable auth on the fleet (one-time gates, then phased waves)

Gate 1 — multiplexer token. Create the Keycloak confidential client and store its secret:

keycloak-client-onboarder   # client_id=mcp-multiplexer, audience agent-services
Set on the multiplexer service: MCP_CLIENT_AUTH=oidc-client-credentials, OIDC_CLIENT_ID=mcp-multiplexer, OIDC_CLIENT_SECRET (OpenBao ref), restart it.

Gate 2 — baseline eunomia policy. Load an allow-authenticated / deny-unknown policy at eunomia.arpa:

eunomia-policy-manager
Confirm the principal derived from the minted token matches an allow rule.

Waves. Flip services to AUTH_TYPE=jwt in waves; after each, verify reachability before continuing: - Wave 1: read-only (searxng, jellyfin, mealie, wger, uptime, vector…) - Wave 2: data/integration (github, gitlab, repository-manager, nextcloud…) - Wave 3: sensitive/infra (keycloak, openbao, kafka, jena, egeria…); portainer-mcp last (it is the redeploy lever). Never flip mcp-multiplexer itself.

Per service: push the compose, redeploy_stack_git, then check:

# server enforces jwt:
curl -s -o /dev/null -w '%{http_code}\n' -X POST http://<svc>.arpa/mcp   # → 401
# reachable through the multiplexer (token attached):
#   call a tool on <svc> via the multiplexer → succeeds

Rotate the multiplexer client secret

  1. Rotate the secret in Keycloak (client mcp-multiplexer).
  2. Update the OpenBao entry.
  3. Restart the multiplexer (the provider re-reads OIDC_CLIENT_SECRET and re-mints). In-flight tokens keep working until expiry.

Add or loosen a eunomia policy

Use eunomia-policy-manager to add a rule for the principal/server/tool, then it takes effect on the next call (remote PDP). Remember: no rule = deny.

Roll back a service to no-auth

Revert the auth block in that service's compose and redeploy_stack_git; it returns to Auth: none and is reachable without a token. A0 is additive, so disabling it (MCP_CLIENT_AUTH=none + restart multiplexer) simply stops attaching tokens.

Activate / refresh monitoring

  • Redeploy the LGTM stack to pick up prometheus.yml, rules.yml, blackbox-exporter, promtail, and Grafana provisioning.
  • Prometheus reloads file-SD targets automatically; new dashboards appear within the provider's updateIntervalSeconds.
  • MCP /metrics returns data only once the service runs an agent-utilities build carrying the /metrics route (image rebuild or editable source mount).

Multiplexer outbound auth (A0) — make jwt children reachable

A jwt-enforcing child rejects calls without a bearer token. Give the multiplexer one service identity so it mints + attaches a Keycloak token to every remote child (CONCEPT:OS-5.32). One-time setup (use your realm/host, not the placeholders):

  1. Keycloak confidential client (keycloak-client-onboarder, or admin API): create client mcp-multiplexer with Service Accounts enabled + an audience mapper adding aud=<MCP_AUDIENCE> (the audience the children verify). Grab its secret.
  2. Verify the token mints with the right claims:
    curl -s -X POST "<KEYCLOAK>/realms/<REALM>/protocol/openid-connect/token" \
      -d grant_type=client_credentials -d client_id=mcp-multiplexer \
      -d client_secret=<SECRET>   # decode: iss=<KEYCLOAK>/realms/<REALM>, aud includes <MCP_AUDIENCE>
    
  3. Store the secret in OpenBao: bao kv put <kv-mount>/mcp-multiplexer/oidc \ OIDC_CLIENT_SECRET=<SECRET> OIDC_CLIENT_ID=mcp-multiplexer OIDC_AUDIENCE=<MCP_AUDIENCE> \ OIDC_TOKEN_URL=<KEYCLOAK>/realms/<REALM>/protocol/openid-connect/token.
  4. Wire the env into BOTH multiplexers (opt-in; inert if unset):
    MCP_CLIENT_AUTH=oidc-client-credentials
    OIDC_CLIENT_ID=mcp-multiplexer
    OIDC_CLIENT_SECRET=<SECRET>          # from OpenBao
    OIDC_AUDIENCE=<MCP_AUDIENCE>
    OIDC_TOKEN_URL=<KEYCLOAK>/realms/<REALM>/protocol/openid-connect/token
    
  5. Swarm/deployed multiplexer: add to its Portainer stack Env and make the compose pass each (- VAR=${VAR}), then redeploy.
  6. Local (Claude Code) multiplexer: add the same keys under mcpServers.mcp-multiplexer.env in ~/.claude.json, then restart Claude Code.
  7. Verify: a tool call on a jwt child through the multiplexer succeeds; child logs show 200.

Sync a connector's runtime env from its .env

Deployed -mcp containers only get env the compose passes. To inject a connector's agents/<name>/.env (tokens, URLs, tool toggles) into the running container: - Put each var in the Portainer stack Env and add - VAR=${VAR} to the compose environment so it's passed. Keep HOST/PORT/TRANSPORT as the deployment sets them (don't let dev .env values override). Secrets live in the stack Env, not the file. Redeploy. - Fleet sweep: per stack, map it to agents/<image-basename>/.env, overlay those vars into the stack Env, rewrite the compose to pass them, update the stack. - Order matters with auth: passing a stale AUTH_TYPE=jwt activates jwt — only do that after A0 (above) + the eunomia policy are in place, or the child becomes unreachable.

Triage common alerts

Alert First checks
McpServiceDown Is the service on the metrics-bearing image? docker_get_stack_logs; is it scheduled?
McpProbeFailed /health reachable on the overlay? auth accidentally on the route?
McpHighToolErrorRate Per-tool error series on the per-service dashboard; child logs in Loki.
McpChildBreakerOpen Child unreachable/erroring — check the child + the multiplexer token.
ContainerOOMKilled / ContainerHighMemory Raise the limit or fix the leak; per-stack panel.
HostLowDisk Prune images/volumes; check prometheus_data/loki growth.