Skip to content

LongMemEval-S Validation Harness (CONCEPT:AHE-3.12)

Overview

The LongMemEval-S Validation Harness is a FastAPI /benchmark surface that lets Quarq's HTTP benchmark runner (quarqlabs/benchmarks) drive the agent-utilities memory-first stack (ORCH-1.27 + KG-2.11/2.12/2.13) against LongMemEval-S and prove it meets or beats Quarq's 98.2%. Extends AHE-3 (Agentic Harness Engineering).

How it works

  • Reproducible corpus. POST /benchmark/session ingests haystack messages as episodic memory into a frozen, versioned EvaluationCorpus — reproducible across agent versions (Quarq re-derives FAISS each run; this does not).
  • Full pipeline per question. POST /benchmark/query runs HyDE + self-correcting two-pass retrieval (corpus-scoped), synthesizes via the generator role, and scores via the judge role with a deterministic pure-Python fallback (normalize_answer, judge_binary).
  • Report + CI gate. GET /benchmark/report/{run_id} returns accuracy + per-category breakdown (aggregate_report). scripts/check_longmemeval.py gates CI on a frozen-subset floor (default 95%), sharing the exact scoring helpers so gate and live router never diverge; the full 500-question run is nightly/on-demand.

Key files / API

Piece Location
Router server/routers/benchmark.py (/benchmark/session|query|report|health, normalize_answer, judge_binary, aggregate_report); mounted in server/app.py
CI gate scripts/check_longmemeval.py

Wiring (≤3 hops)

POST /benchmark/query → router → engine.search_hybrid(mode="hyde")plan_and_retrieve (≤3 hops).

Research provenance

Quarq benchmark runner & eval pipeline — quarqlabs/benchmarks (HTTP) + agent-oss article eval pipeline.