Skip to content

RLM Resilience + Structured Telemetry (CONCEPT:ORCH-1.29)

Overview

Makes the RLM robust and gives GEPA a high-signal feedback channel: a structured RunTrace (per iteration + token usage) replaces free-text reflections, a failure taxonomy classifies errors, and recoverable-vs-fatal error handling stops the RLM burning iterations on a dead sandbox. Assimilated from predict-rlm (trace.py, telemetry.py, interpreter.py). Extends ORCH-1.12.

How it works

  • RunTrace / IterationStep / LMUsage (rlm/telemetry.py) — structured per-iteration capture (code, output, reasoning, finish reason) + main/sub-LM token usage; failure_summary() returns the dominant failure class for the proposer.
  • FailureClass taxonomy (model_generated_bad_code, host_tool_timeout, sandbox_exec_timeout, sandbox_fatal, sandbox_escalated, evaluator_reject, unknown) with precedence ordering; classify_failure maps an exception/text to a class (sandbox_escalated = a benign ORCH-1.38 router escalation, low precedence).
  • Recoverable vs fatalwith_tool_timeout gives each host tool a wall-clock budget and returns a recoverable error on timeout (sandbox survives), while SandboxFatalError (raised from repl.py's container path on irreversible sandbox death) fast-fails the run.

Key files / API

Piece Location
Telemetry + resilience rlm/telemetry.py (RunTrace, FailureClass, classify_failure, dominant_failure, with_tool_timeout, SandboxFatalError)
Fatal wiring rlm/sandboxes/docker_backend.py (DockerSandbox raises SandboxFatalError on dead container/timeout); the router propagates it without escalating (ORCH-1.38)

Wiring (≤3 hops)

graph_orchestrate(rlm_run)runnerrepl/telemetry (≤3 hops).

Research provenance

predict-rlm src/predict_rlm/trace.py, telemetry.py, interpreter.py (SandboxFatalError, TOOL_CALL_TIMEOUT_SEC) — verified.