RLM Resilience + Structured Telemetry (CONCEPT:ORCH-1.29)¶
Overview¶
Makes the RLM robust and gives GEPA a high-signal feedback channel: a structured RunTrace (per
iteration + token usage) replaces free-text reflections, a failure taxonomy classifies errors,
and recoverable-vs-fatal error handling stops the RLM burning iterations on a dead sandbox.
Assimilated from predict-rlm (trace.py, telemetry.py, interpreter.py). Extends ORCH-1.12.
How it works¶
RunTrace/IterationStep/LMUsage(rlm/telemetry.py) — structured per-iteration capture (code, output, reasoning, finish reason) + main/sub-LM token usage;failure_summary()returns the dominant failure class for the proposer.FailureClasstaxonomy (model_generated_bad_code,host_tool_timeout,sandbox_exec_timeout,sandbox_fatal,sandbox_escalated,evaluator_reject,unknown) with precedence ordering;classify_failuremaps an exception/text to a class (sandbox_escalated= a benign ORCH-1.38 router escalation, low precedence).- Recoverable vs fatal —
with_tool_timeoutgives each host tool a wall-clock budget and returns a recoverable error on timeout (sandbox survives), whileSandboxFatalError(raised fromrepl.py's container path on irreversible sandbox death) fast-fails the run.
Key files / API¶
| Piece | Location |
|---|---|
| Telemetry + resilience | rlm/telemetry.py (RunTrace, FailureClass, classify_failure, dominant_failure, with_tool_timeout, SandboxFatalError) |
| Fatal wiring | rlm/sandboxes/docker_backend.py (DockerSandbox raises SandboxFatalError on dead container/timeout); the router propagates it without escalating (ORCH-1.38) |
Wiring (≤3 hops)¶
graph_orchestrate(rlm_run) → runner → repl/telemetry (≤3 hops).
Research provenance¶
predict-rlm src/predict_rlm/trace.py, telemetry.py, interpreter.py (SandboxFatalError, TOOL_CALL_TIMEOUT_SEC) — verified.