LongMemEval-S Validation Harness (CONCEPT:AHE-3.12)¶
Overview¶
The LongMemEval-S Validation Harness is a FastAPI /benchmark surface that lets Quarq's HTTP
benchmark runner (quarqlabs/benchmarks) drive the agent-utilities memory-first stack
(ORCH-1.27 + KG-2.11/2.12/2.13) against LongMemEval-S and prove it meets or beats Quarq's 98.2%.
Extends AHE-3 (Agentic Harness Engineering).
How it works¶
- Reproducible corpus.
POST /benchmark/sessioningests haystack messages as episodic memory into a frozen, versionedEvaluationCorpus— reproducible across agent versions (Quarq re-derives FAISS each run; this does not). - Full pipeline per question.
POST /benchmark/queryruns HyDE + self-correcting two-pass retrieval (corpus-scoped), synthesizes via thegeneratorrole, and scores via thejudgerole with a deterministic pure-Python fallback (normalize_answer,judge_binary). - Report + CI gate.
GET /benchmark/report/{run_id}returns accuracy + per-category breakdown (aggregate_report).scripts/check_longmemeval.pygates CI on a frozen-subset floor (default 95%), sharing the exact scoring helpers so gate and live router never diverge; the full 500-question run is nightly/on-demand.
Key files / API¶
| Piece | Location |
|---|---|
| Router | server/routers/benchmark.py (/benchmark/session|query|report|health, normalize_answer, judge_binary, aggregate_report); mounted in server/app.py |
| CI gate | scripts/check_longmemeval.py |
Wiring (≤3 hops)¶
POST /benchmark/query → router → engine.search_hybrid(mode="hyde") → plan_and_retrieve (≤3 hops).
Research provenance¶
Quarq benchmark runner & eval pipeline — quarqlabs/benchmarks (HTTP) + agent-oss article eval pipeline.