Failure-Driven Evolution (CONCEPT:AHE-3.18)¶
The self-evolution loop learns from failures observed in production telemetry, not only from research papers and unresolved concepts. Failures recorded in Langfuse become addressable gaps in the Knowledge Graph, which the golden loop turns into regression-gated remediation proposals.
Why¶
The research-driven golden loop (CONCEPT:KG-2.7) assimilated papers, OSS, and
unresolved Concept topics — but a running fleet's most valuable signal, its own
failures, never entered the loop. Langfuse already received exported traces
(CONCEPT:AHE-3.0 Experience Observability), yet the integration was
write-only: nothing read failures back out, the ExecutionSummary /
PerformanceAnomaly schema sat dormant, and the daemon's telemetry_ingestion
sweep referenced a workflow that never existed (it raised every cycle). AHE-3.18
closes that loop.
The loop¶
Langfuse ── pull ──▶ cluster ──▶ materialize ──▶ intake ──▶ remediate ──▶ regression-gated merge ──▶ lock regression
(errors, (recurring (KG nodes + (golden (TeamSpec/ (auto-merge only when (AHE-3.25: on a
low scores, failure failure_gap loop) AgentSpec the failure is not verified fix, lock a
cost/latency) signatures) Concept topics) proposal) spiking; else hold) plain-English assertion)
- Pull — error observations, low-score traces, and cost/latency anomalies are pulled from Langfuse over a configurable window.
- Cluster — records are grouped into recurring failure signatures
(deterministic, LLM-free):
(workflow/observation name, normalized status/error class, anomaly type). Volatile parts (ids, paths, numbers) are normalized out so the same failure produces the same signature across occurrences. A pattern must recur at leastmin_occurrencestimes (default 2) to qualify — single one-offs are noise. - Materialize — each recurring pattern writes, into the durable KG:
- a
PerformanceAnomalynode (anomaly_type ∈ {ERROR_RATE, LOW_SCORE, TIMEOUT, HIGH_COST, HIGH_TOKEN_USAGE}), - an
ExecutionSummaryrollup per failing workflow (success_rate < 1.0, somaintainer.trigger_self_improvementpicks it up), and - a synthetic
failure_gapConcepttopic — labelledConceptwith noADDRESSED_BYedge andevidence_trace_idslinking back to Langfuse — so the golden loop's existingunresolved_topics()intake surfaces it unchanged. - Remediate — the failure-ingest path runs one golden-loop cycle that
addresses the just-materialized gaps directly (see Targeting below) and
synthesizes a
TeamSpec/AgentSpecremediation proposal. - Regression-gated merge — promotion of a failure remediation is gated by a
regression check bound to the originating failures: it re-queries Langfuse and
holds the proposal if any signature is actively spiking (current > baseline).
Each gap is also appended to the durable eval corpus and the failing
capability's reward is nudged down (
FeedbackService). - Lock regression (CONCEPT:AHE-3.25) — when the gate passes (the fix holds
against the originally-observed failures),
_lock_regression_casespromotes one plain-English assertion case per signature into the durableEvalCorpus(idempotent) — e.g. "The response does not reproduce the failure 'X' in workflow 'Y'." Thereafter the case is judged by LLM-as-judge (EvalStrategy.ASSERTION, with an offline lexical fallback) rather than by brittle expected-output matching, so the same failure cannot silently recur. This is the "lock-as-regression-test" close of the loop (the Opik Test Suite pattern).
Langfuse integration¶
The integration is bidirectional and both halves are wired through
agent_utilities/core/config.py (the config singleton, resolved from the XDG
config.json for graph-os):
| Direction | Path | Notes |
|---|---|---|
| Write (agent-utilities → Langfuse) | observability/langfuse_exporter.py + OTEL exporter → ${LANGFUSE_HOST}/api/public/otel |
Exports graph runs as traces. |
| Read (Langfuse → AHE-3.18) | harness/trace_backend.py::LangfuseTraceBackend |
get_error_observations, get_low_score_traces, get_cost_latency_anomalies (+ dataset helpers), wrapping langfuse_agent.LangfuseApi. |
Endpoint note. Self-hosted Langfuse (validated against 3.182.0) does not expose
the /api/public/v2/observations or /api/public/v2/metrics routes (404). The
read surface uses the stable /api/public/observations and /api/public/metrics
endpoints, which take an identical query schema. The metrics query requires a
non-null toTimestamp.
Environment variables are the official Langfuse SDK names — there is no deprecated fallback:
LANGFUSE_HOST(host base URL; notLANGFUSE_BASE_URL)LANGFUSE_PUBLIC_KEY,LANGFUSE_SECRET_KEY
Targeting: addressing the right gap¶
A brand-new failure_gap competes for a slot in unresolved_topics()'s limited,
arbitrarily-ordered scan over (potentially thousands of) existing concepts — so
generic intake can silently exclude it and the failure would never be remediated.
The failure-ingest path therefore passes its just-materialized gaps explicitly
to GoldenLoopController.run_one_cycle(topics=…), bypassing the generic scan so
remediation is deterministic.
How to run it¶
- On demand (MCP):
graph_orchestrate(action="failure_ingest")— pulls failures, materializes gaps, and runs the regression-gated remediation cycle, returning a JSON report (gap_concepts,anomalies, and aremediationblock). - As a daemon tick: the consolidated KG daemon registers a
failure_ingestmaintenance job whenKG_FAILURE_EVOLUTIONis enabled. This replaced the deadtelemetry_ingestionsweep.
Both share one implementation: failure_analyzer.run_failure_ingest(engine).
Configuration¶
| Flag | Default | Purpose |
|---|---|---|
KG_FAILURE_EVOLUTION |
False |
Enable the daemon failure_ingest tick. Parsed via to_boolean ("True"/"False"). |
KG_FAILURE_EVOLUTION_INTERVAL |
3600 |
Daemon tick interval (seconds). |
KG_FAILURE_EVOLUTION_WINDOW |
86400 |
How far back to pull telemetry (seconds). |
KG_FAILURE_REGRESSION_DATASET |
False |
Enable the dataset-based regression path (build a Langfuse dataset from the failing traces). |
KG_GOLDEN_AUTO_MERGE |
False |
Master switch for promoting a passing remediation proposal → active. Off ⇒ propose-only. |
Code paths¶
agent_utilities/knowledge_graph/adaptation/failure_analyzer.py—FailureAnalyzer,cluster_failures,make_regression_check,_lock_regression_cases(AHE-3.25),run_failure_ingest.agent_utilities/harness/eval_corpus.py—EvalCorpus.add_case(..., assertion=…), the durable regression-case store the verified fixes lock into.agent_utilities/harness/continuous_evaluation_engine.py—TestCase.assertion,EvalStrategy.ASSERTION,EvalRunner._assertion_judge(AHE-3.25).agent_utilities/harness/trace_backend.py— the Langfuse failure-read surface.agent_utilities/knowledge_graph/core/engine_tasks.py— the_tick_failure_ingestdaemon job.agent_utilities/knowledge_graph/research/golden_loop.py—run_one_cycle(topics=…)override; regression check threaded intoGovernedAutoMerger.agent_utilities/mcp/kg_server.py—graph_orchestrate(action="failure_ingest").agent_utilities/models/schema_definition.py—ExecutionSummary,PerformanceAnomaly.
Relationship to other concepts¶
- Builds on AHE-3.0 (Experience Observability) and the propose-only golden loop (KG-2.7).
- Reuses KG-2.8 feedback primitives (
FeedbackServiceeval/outcomecorrections) and the AHE-3.14GovernedAutoMerger(its injectableregression_check). - The remediation proposals are the same
TeamSpec/AgentSpecartifacts the research-driven synthesis (KG-2.10) produces. - AHE-3.25 (plain-English regression assertions) closes the loop's final step: a verified remediation locks a human-readable regression case into the eval corpus, the "lock-as-regression-test" guarantee.